- Deleted `example_complete_series_aggregation.py` as it is no longer needed. - Introduced `data_collection_service.py`, a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. - Added configuration management for multiple trading pairs and exchanges, supporting health monitoring and graceful shutdown. - Created `data_collection.json` for service configuration, including exchange settings and logging preferences. - Updated `CandleProcessingConfig` to reflect changes in timeframes for candle processing. - Enhanced documentation to cover the new data collection service and its configuration, ensuring clarity for users.
481 lines
11 KiB
Markdown
481 lines
11 KiB
Markdown
# Data Collection Service
|
|
|
|
The Data Collection Service is a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. It manages multiple data collectors for different trading pairs and exchanges.
|
|
|
|
## Features
|
|
|
|
- **Clean Logging**: Only essential information (connections, disconnections, errors)
|
|
- **Multi-Exchange Support**: Extensible architecture for multiple exchanges
|
|
- **Health Monitoring**: Built-in health checks and auto-recovery
|
|
- **Configurable**: JSON-based configuration with sensible defaults
|
|
- **Graceful Shutdown**: Proper signal handling and cleanup
|
|
- **Testing**: Comprehensive unit test coverage
|
|
|
|
## Quick Start
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Start with default configuration (indefinite run)
|
|
python scripts/start_data_collection.py
|
|
|
|
# Run for 8 hours
|
|
python scripts/start_data_collection.py --hours 8
|
|
|
|
# Use custom configuration
|
|
python scripts/start_data_collection.py --config config/my_config.json
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
```bash
|
|
# Check status once
|
|
python scripts/monitor_clean.py
|
|
|
|
# Monitor continuously every 60 seconds
|
|
python scripts/monitor_clean.py --interval 60
|
|
```
|
|
|
|
## Configuration
|
|
|
|
The service uses JSON configuration files with automatic default creation if none exists.
|
|
|
|
### Default Configuration Location
|
|
|
|
`config/data_collection.json`
|
|
|
|
### Configuration Structure
|
|
|
|
```json
|
|
{
|
|
"exchanges": {
|
|
"okx": {
|
|
"enabled": true,
|
|
"trading_pairs": [
|
|
{
|
|
"symbol": "BTC-USDT",
|
|
"enabled": true,
|
|
"data_types": ["trade"],
|
|
"timeframes": ["1m", "5m", "15m", "1h"]
|
|
},
|
|
{
|
|
"symbol": "ETH-USDT",
|
|
"enabled": true,
|
|
"data_types": ["trade"],
|
|
"timeframes": ["1m", "5m", "15m", "1h"]
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"collection_settings": {
|
|
"health_check_interval": 120,
|
|
"store_raw_data": true,
|
|
"auto_restart": true,
|
|
"max_restart_attempts": 3
|
|
},
|
|
"logging": {
|
|
"level": "INFO",
|
|
"log_errors_only": true,
|
|
"verbose_data_logging": false
|
|
}
|
|
}
|
|
```
|
|
|
|
### Configuration Options
|
|
|
|
#### Exchange Settings
|
|
|
|
- **enabled**: Whether to enable this exchange
|
|
- **trading_pairs**: Array of trading pair configurations
|
|
|
|
#### Trading Pair Settings
|
|
|
|
- **symbol**: Trading pair symbol (e.g., "BTC-USDT")
|
|
- **enabled**: Whether to collect data for this pair
|
|
- **data_types**: Types of data to collect (["trade"], ["ticker"], etc.)
|
|
- **timeframes**: Candle timeframes to generate (["1m", "5m", "15m", "1h", "4h", "1d"])
|
|
|
|
#### Collection Settings
|
|
|
|
- **health_check_interval**: Health check frequency in seconds
|
|
- **store_raw_data**: Whether to store raw trade data
|
|
- **auto_restart**: Enable automatic restart on failures
|
|
- **max_restart_attempts**: Maximum restart attempts before giving up
|
|
|
|
#### Logging Settings
|
|
|
|
- **level**: Log level ("DEBUG", "INFO", "WARNING", "ERROR")
|
|
- **log_errors_only**: Only log errors and essential events
|
|
- **verbose_data_logging**: Enable verbose logging of individual trades/candles
|
|
|
|
## Service Architecture
|
|
|
|
### Core Components
|
|
|
|
1. **DataCollectionService**: Main service class managing the lifecycle
|
|
2. **CollectorManager**: Manages multiple data collectors with health monitoring
|
|
3. **ExchangeFactory**: Creates exchange-specific collectors
|
|
4. **BaseDataCollector**: Abstract base for all data collectors
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
Exchange API → Data Collector → Data Processor → Database
|
|
↓
|
|
Health Monitor → Service Manager
|
|
```
|
|
|
|
### Storage
|
|
|
|
- **Raw Data**: PostgreSQL `raw_trades` table
|
|
- **Candles**: PostgreSQL `market_data` table with multiple timeframes
|
|
- **Real-time**: Redis pub/sub for live data distribution
|
|
|
|
## Logging Philosophy
|
|
|
|
The service implements **clean production logging** focused on operational needs:
|
|
|
|
### What Gets Logged
|
|
|
|
✅ **Service Lifecycle**
|
|
- Service start/stop
|
|
- Collector initialization
|
|
- Database connections
|
|
|
|
✅ **Connection Events**
|
|
- WebSocket connect/disconnect
|
|
- Reconnection attempts
|
|
- API errors
|
|
|
|
✅ **Health & Errors**
|
|
- Health check results
|
|
- Error conditions
|
|
- Recovery actions
|
|
|
|
✅ **Statistics**
|
|
- Periodic uptime reports
|
|
- Collection summary
|
|
|
|
### What Doesn't Get Logged
|
|
|
|
❌ **Individual Data Points**
|
|
- Every trade received
|
|
- Every candle generated
|
|
- Raw market data
|
|
|
|
❌ **Verbose Operations**
|
|
- Database queries
|
|
- Internal processing steps
|
|
- Routine heartbeats
|
|
|
|
## API Reference
|
|
|
|
### DataCollectionService
|
|
|
|
The main service class for managing data collection.
|
|
|
|
#### Constructor
|
|
|
|
```python
|
|
DataCollectionService(config_path: str = "config/data_collection.json")
|
|
```
|
|
|
|
#### Methods
|
|
|
|
##### `async run(duration_hours: Optional[float] = None) -> bool`
|
|
|
|
Run the service for a specified duration or indefinitely.
|
|
|
|
**Parameters:**
|
|
- `duration_hours`: Optional duration in hours (None = indefinite)
|
|
|
|
**Returns:**
|
|
- `bool`: True if successful, False if error occurred
|
|
|
|
##### `async start() -> bool`
|
|
|
|
Start the data collection service.
|
|
|
|
**Returns:**
|
|
- `bool`: True if started successfully
|
|
|
|
##### `async stop() -> None`
|
|
|
|
Stop the service gracefully.
|
|
|
|
##### `get_status() -> Dict[str, Any]`
|
|
|
|
Get current service status including uptime, collector counts, and errors.
|
|
|
|
**Returns:**
|
|
- `dict`: Status information
|
|
|
|
### Standalone Function
|
|
|
|
#### `run_data_collection_service(config_path, duration_hours)`
|
|
|
|
```python
|
|
async def run_data_collection_service(
|
|
config_path: str = "config/data_collection.json",
|
|
duration_hours: Optional[float] = None
|
|
) -> bool
|
|
```
|
|
|
|
Convenience function to run the service.
|
|
|
|
## Integration Examples
|
|
|
|
### Basic Integration
|
|
|
|
```python
|
|
import asyncio
|
|
from data.collection_service import DataCollectionService
|
|
|
|
async def main():
|
|
service = DataCollectionService("config/my_config.json")
|
|
await service.run(duration_hours=24) # Run for 24 hours
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
### Custom Status Monitoring
|
|
|
|
```python
|
|
import asyncio
|
|
from data.collection_service import DataCollectionService
|
|
|
|
async def monitor_service():
|
|
service = DataCollectionService()
|
|
|
|
# Start service in background
|
|
start_task = asyncio.create_task(service.run())
|
|
|
|
# Monitor status every 5 minutes
|
|
while service.running:
|
|
status = service.get_status()
|
|
print(f"Uptime: {status['uptime_hours']:.1f}h, "
|
|
f"Collectors: {status['collectors_running']}, "
|
|
f"Errors: {status['errors_count']}")
|
|
|
|
await asyncio.sleep(300) # 5 minutes
|
|
|
|
await start_task
|
|
|
|
asyncio.run(monitor_service())
|
|
```
|
|
|
|
### Programmatic Control
|
|
|
|
```python
|
|
import asyncio
|
|
from data.collection_service import DataCollectionService
|
|
|
|
async def controlled_collection():
|
|
service = DataCollectionService()
|
|
|
|
# Initialize and start
|
|
await service.initialize_collectors()
|
|
await service.start()
|
|
|
|
try:
|
|
# Run for 1 hour
|
|
await asyncio.sleep(3600)
|
|
finally:
|
|
# Graceful shutdown
|
|
await service.stop()
|
|
|
|
asyncio.run(controlled_collection())
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
The service implements robust error handling at multiple levels:
|
|
|
|
### Service Level
|
|
|
|
- **Configuration Errors**: Invalid JSON, missing files
|
|
- **Initialization Errors**: Database connection, collector creation
|
|
- **Runtime Errors**: Unexpected exceptions during operation
|
|
|
|
### Collector Level
|
|
|
|
- **Connection Errors**: WebSocket disconnections, API failures
|
|
- **Data Errors**: Invalid data formats, processing failures
|
|
- **Health Errors**: Failed health checks, timeout conditions
|
|
|
|
### Recovery Strategies
|
|
|
|
1. **Automatic Restart**: Collectors auto-restart on failures
|
|
2. **Exponential Backoff**: Increasing delays between retry attempts
|
|
3. **Circuit Breaker**: Stop retrying after max attempts exceeded
|
|
4. **Graceful Degradation**: Continue with healthy collectors
|
|
|
|
## Testing
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Run all data collection service tests
|
|
uv run pytest tests/test_data_collection_service.py -v
|
|
|
|
# Run specific test
|
|
uv run pytest tests/test_data_collection_service.py::TestDataCollectionService::test_service_initialization -v
|
|
|
|
# Run with coverage
|
|
uv run pytest tests/test_data_collection_service.py --cov=data.collection_service
|
|
```
|
|
|
|
### Test Coverage
|
|
|
|
The test suite covers:
|
|
- Service initialization and configuration
|
|
- Collector creation and management
|
|
- Service lifecycle (start/stop)
|
|
- Error handling and recovery
|
|
- Configuration validation
|
|
- Signal handling
|
|
- Status reporting
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### Configuration Not Found
|
|
|
|
```
|
|
❌ Failed to load config from config/data_collection.json: [Errno 2] No such file or directory
|
|
```
|
|
|
|
**Solution**: The service will create a default configuration. Check the created file and adjust as needed.
|
|
|
|
#### Database Connection Failed
|
|
|
|
```
|
|
❌ Database connection failed: connection refused
|
|
```
|
|
|
|
**Solution**: Ensure PostgreSQL and Redis are running via Docker:
|
|
|
|
```bash
|
|
docker-compose up -d postgres redis
|
|
```
|
|
|
|
#### No Collectors Created
|
|
|
|
```
|
|
❌ No collectors were successfully initialized
|
|
```
|
|
|
|
**Solution**: Check configuration - ensure at least one exchange is enabled with valid trading pairs.
|
|
|
|
#### WebSocket Connection Issues
|
|
|
|
```
|
|
❌ Failed to start data collectors
|
|
```
|
|
|
|
**Solution**: Check network connectivity and API credentials. Verify exchange is accessible.
|
|
|
|
### Debug Mode
|
|
|
|
For verbose debugging, modify the logging configuration:
|
|
|
|
```json
|
|
{
|
|
"logging": {
|
|
"level": "DEBUG",
|
|
"log_errors_only": false,
|
|
"verbose_data_logging": true
|
|
}
|
|
}
|
|
```
|
|
|
|
⚠️ **Warning**: Debug mode generates extensive logs and should not be used in production.
|
|
|
|
## Production Deployment
|
|
|
|
### Docker
|
|
|
|
The service can be containerized for production deployment:
|
|
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
|
|
WORKDIR /app
|
|
COPY . .
|
|
|
|
RUN pip install uv
|
|
RUN uv pip install -r requirements.txt
|
|
|
|
CMD ["python", "scripts/start_data_collection.py", "--config", "config/production.json"]
|
|
```
|
|
|
|
### Systemd Service
|
|
|
|
Create a systemd service for Linux deployment:
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Cryptocurrency Data Collection Service
|
|
After=network.target postgres.service redis.service
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=crypto-collector
|
|
WorkingDirectory=/opt/crypto-dashboard
|
|
ExecStart=/usr/bin/python scripts/start_data_collection.py --config config/production.json
|
|
Restart=always
|
|
RestartSec=10
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
Configure sensitive data via environment variables:
|
|
|
|
```bash
|
|
export POSTGRES_HOST=localhost
|
|
export POSTGRES_PORT=5432
|
|
export POSTGRES_DB=crypto_dashboard
|
|
export POSTGRES_USER=dashboard_user
|
|
export POSTGRES_PASSWORD=secure_password
|
|
export REDIS_HOST=localhost
|
|
export REDIS_PORT=6379
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Resource Usage
|
|
|
|
- **Memory**: ~100MB base + ~10MB per trading pair
|
|
- **CPU**: Low (async I/O bound)
|
|
- **Network**: ~1KB/s per trading pair
|
|
- **Storage**: ~1GB/day per trading pair (with raw data)
|
|
|
|
### Scaling
|
|
|
|
- **Vertical**: Increase timeframes and trading pairs
|
|
- **Horizontal**: Run multiple services with different configurations
|
|
- **Database**: Use TimescaleDB for time-series optimization
|
|
|
|
### Optimization Tips
|
|
|
|
1. **Disable Raw Data**: Set `store_raw_data: false` to reduce storage
|
|
2. **Limit Timeframes**: Only collect needed timeframes
|
|
3. **Batch Processing**: Use longer health check intervals
|
|
4. **Connection Pooling**: Database connections are automatically pooled
|
|
|
|
## Changelog
|
|
|
|
### v1.0.0 (Current)
|
|
|
|
- Initial implementation
|
|
- OKX exchange support
|
|
- Clean logging system
|
|
- Comprehensive test coverage
|
|
- JSON configuration
|
|
- Health monitoring
|
|
- Graceful shutdown |