# Data Collection Service The Data Collection Service is a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. It manages multiple data collectors for different trading pairs and exchanges. ## Features - **Clean Logging**: Only essential information (connections, disconnections, errors) - **Multi-Exchange Support**: Extensible architecture for multiple exchanges - **Health Monitoring**: Built-in health checks and auto-recovery - **Configurable**: JSON-based configuration with sensible defaults - **Graceful Shutdown**: Proper signal handling and cleanup - **Testing**: Comprehensive unit test coverage ## Quick Start ### Basic Usage ```bash # Start with default configuration (indefinite run) python scripts/start_data_collection.py # Run for 8 hours python scripts/start_data_collection.py --hours 8 # Use custom configuration python scripts/start_data_collection.py --config config/my_config.json ``` ### Monitoring ```bash # Check status once python scripts/monitor_clean.py # Monitor continuously every 60 seconds python scripts/monitor_clean.py --interval 60 ``` ## Configuration The service uses JSON configuration files with automatic default creation if none exists. ### Default Configuration Location `config/data_collection.json` ### Configuration Structure ```json { "exchanges": { "okx": { "enabled": true, "trading_pairs": [ { "symbol": "BTC-USDT", "enabled": true, "data_types": ["trade"], "timeframes": ["1m", "5m", "15m", "1h"] }, { "symbol": "ETH-USDT", "enabled": true, "data_types": ["trade"], "timeframes": ["1m", "5m", "15m", "1h"] } ] } }, "collection_settings": { "health_check_interval": 120, "store_raw_data": true, "auto_restart": true, "max_restart_attempts": 3 }, "logging": { "level": "INFO", "log_errors_only": true, "verbose_data_logging": false } } ``` ### Configuration Options #### Exchange Settings - **enabled**: Whether to enable this exchange - **trading_pairs**: Array of trading pair configurations #### Trading Pair Settings - **symbol**: Trading pair symbol (e.g., "BTC-USDT") - **enabled**: Whether to collect data for this pair - **data_types**: Types of data to collect (["trade"], ["ticker"], etc.) - **timeframes**: Candle timeframes to generate (["1m", "5m", "15m", "1h", "4h", "1d"]) #### Collection Settings - **health_check_interval**: Health check frequency in seconds - **store_raw_data**: Whether to store raw trade data - **auto_restart**: Enable automatic restart on failures - **max_restart_attempts**: Maximum restart attempts before giving up #### Logging Settings - **level**: Log level ("DEBUG", "INFO", "WARNING", "ERROR") - **log_errors_only**: Only log errors and essential events - **verbose_data_logging**: Enable verbose logging of individual trades/candles ## Service Architecture ### Core Components 1. **DataCollectionService**: Main service class managing the lifecycle 2. **CollectorManager**: Manages multiple data collectors with health monitoring 3. **ExchangeFactory**: Creates exchange-specific collectors 4. **BaseDataCollector**: Abstract base for all data collectors ### Data Flow ``` Exchange API → Data Collector → Data Processor → Database ↓ Health Monitor → Service Manager ``` ### Storage - **Raw Data**: PostgreSQL `raw_trades` table - **Candles**: PostgreSQL `market_data` table with multiple timeframes - **Real-time**: Redis pub/sub for live data distribution ## Logging Philosophy The service implements **clean production logging** focused on operational needs: ### What Gets Logged ✅ **Service Lifecycle** - Service start/stop - Collector initialization - Database connections ✅ **Connection Events** - WebSocket connect/disconnect - Reconnection attempts - API errors ✅ **Health & Errors** - Health check results - Error conditions - Recovery actions ✅ **Statistics** - Periodic uptime reports - Collection summary ### What Doesn't Get Logged ❌ **Individual Data Points** - Every trade received - Every candle generated - Raw market data ❌ **Verbose Operations** - Database queries - Internal processing steps - Routine heartbeats ## API Reference ### DataCollectionService The main service class for managing data collection. #### Constructor ```python DataCollectionService(config_path: str = "config/data_collection.json") ``` #### Methods ##### `async run(duration_hours: Optional[float] = None) -> bool` Run the service for a specified duration or indefinitely. **Parameters:** - `duration_hours`: Optional duration in hours (None = indefinite) **Returns:** - `bool`: True if successful, False if error occurred ##### `async start() -> bool` Start the data collection service. **Returns:** - `bool`: True if started successfully ##### `async stop() -> None` Stop the service gracefully. ##### `get_status() -> Dict[str, Any]` Get current service status including uptime, collector counts, and errors. **Returns:** - `dict`: Status information ### Standalone Function #### `run_data_collection_service(config_path, duration_hours)` ```python async def run_data_collection_service( config_path: str = "config/data_collection.json", duration_hours: Optional[float] = None ) -> bool ``` Convenience function to run the service. ## Integration Examples ### Basic Integration ```python import asyncio from data.collection_service import DataCollectionService async def main(): service = DataCollectionService("config/my_config.json") await service.run(duration_hours=24) # Run for 24 hours if __name__ == "__main__": asyncio.run(main()) ``` ### Custom Status Monitoring ```python import asyncio from data.collection_service import DataCollectionService async def monitor_service(): service = DataCollectionService() # Start service in background start_task = asyncio.create_task(service.run()) # Monitor status every 5 minutes while service.running: status = service.get_status() print(f"Uptime: {status['uptime_hours']:.1f}h, " f"Collectors: {status['collectors_running']}, " f"Errors: {status['errors_count']}") await asyncio.sleep(300) # 5 minutes await start_task asyncio.run(monitor_service()) ``` ### Programmatic Control ```python import asyncio from data.collection_service import DataCollectionService async def controlled_collection(): service = DataCollectionService() # Initialize and start await service.initialize_collectors() await service.start() try: # Run for 1 hour await asyncio.sleep(3600) finally: # Graceful shutdown await service.stop() asyncio.run(controlled_collection()) ``` ## Error Handling The service implements robust error handling at multiple levels: ### Service Level - **Configuration Errors**: Invalid JSON, missing files - **Initialization Errors**: Database connection, collector creation - **Runtime Errors**: Unexpected exceptions during operation ### Collector Level - **Connection Errors**: WebSocket disconnections, API failures - **Data Errors**: Invalid data formats, processing failures - **Health Errors**: Failed health checks, timeout conditions ### Recovery Strategies 1. **Automatic Restart**: Collectors auto-restart on failures 2. **Exponential Backoff**: Increasing delays between retry attempts 3. **Circuit Breaker**: Stop retrying after max attempts exceeded 4. **Graceful Degradation**: Continue with healthy collectors ## Testing ### Running Tests ```bash # Run all data collection service tests uv run pytest tests/test_data_collection_service.py -v # Run specific test uv run pytest tests/test_data_collection_service.py::TestDataCollectionService::test_service_initialization -v # Run with coverage uv run pytest tests/test_data_collection_service.py --cov=data.collection_service ``` ### Test Coverage The test suite covers: - Service initialization and configuration - Collector creation and management - Service lifecycle (start/stop) - Error handling and recovery - Configuration validation - Signal handling - Status reporting ## Troubleshooting ### Common Issues #### Configuration Not Found ``` ❌ Failed to load config from config/data_collection.json: [Errno 2] No such file or directory ``` **Solution**: The service will create a default configuration. Check the created file and adjust as needed. #### Database Connection Failed ``` ❌ Database connection failed: connection refused ``` **Solution**: Ensure PostgreSQL and Redis are running via Docker: ```bash docker-compose up -d postgres redis ``` #### No Collectors Created ``` ❌ No collectors were successfully initialized ``` **Solution**: Check configuration - ensure at least one exchange is enabled with valid trading pairs. #### WebSocket Connection Issues ``` ❌ Failed to start data collectors ``` **Solution**: Check network connectivity and API credentials. Verify exchange is accessible. ### Debug Mode For verbose debugging, modify the logging configuration: ```json { "logging": { "level": "DEBUG", "log_errors_only": false, "verbose_data_logging": true } } ``` ⚠️ **Warning**: Debug mode generates extensive logs and should not be used in production. ## Production Deployment ### Docker The service can be containerized for production deployment: ```dockerfile FROM python:3.11-slim WORKDIR /app COPY . . RUN pip install uv RUN uv pip install -r requirements.txt CMD ["python", "scripts/start_data_collection.py", "--config", "config/production.json"] ``` ### Systemd Service Create a systemd service for Linux deployment: ```ini [Unit] Description=Cryptocurrency Data Collection Service After=network.target postgres.service redis.service [Service] Type=simple User=crypto-collector WorkingDirectory=/opt/crypto-dashboard ExecStart=/usr/bin/python scripts/start_data_collection.py --config config/production.json Restart=always RestartSec=10 [Install] WantedBy=multi-user.target ``` ### Environment Variables Configure sensitive data via environment variables: ```bash export POSTGRES_HOST=localhost export POSTGRES_PORT=5432 export POSTGRES_DB=crypto_dashboard export POSTGRES_USER=dashboard_user export POSTGRES_PASSWORD=secure_password export REDIS_HOST=localhost export REDIS_PORT=6379 ``` ## Performance Considerations ### Resource Usage - **Memory**: ~100MB base + ~10MB per trading pair - **CPU**: Low (async I/O bound) - **Network**: ~1KB/s per trading pair - **Storage**: ~1GB/day per trading pair (with raw data) ### Scaling - **Vertical**: Increase timeframes and trading pairs - **Horizontal**: Run multiple services with different configurations - **Database**: Use TimescaleDB for time-series optimization ### Optimization Tips 1. **Disable Raw Data**: Set `store_raw_data: false` to reduce storage 2. **Limit Timeframes**: Only collect needed timeframes 3. **Batch Processing**: Use longer health check intervals 4. **Connection Pooling**: Database connections are automatically pooled ## Changelog ### v1.0.0 (Current) - Initial implementation - OKX exchange support - Clean logging system - Comprehensive test coverage - JSON configuration - Health monitoring - Graceful shutdown