TCPDashboard/docs/data-collection-service.md
Vasily.onl 1cca8cda16 Remove complete time series aggregation example and add data collection service implementation
- Deleted `example_complete_series_aggregation.py` as it is no longer needed.
- Introduced `data_collection_service.py`, a production-ready service for cryptocurrency market data collection with clean logging and robust error handling.
- Added configuration management for multiple trading pairs and exchanges, supporting health monitoring and graceful shutdown.
- Created `data_collection.json` for service configuration, including exchange settings and logging preferences.
- Updated `CandleProcessingConfig` to reflect changes in timeframes for candle processing.
- Enhanced documentation to cover the new data collection service and its configuration, ensuring clarity for users.
2025-06-02 14:23:08 +08:00

481 lines
11 KiB
Markdown

# Data Collection Service
The Data Collection Service is a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. It manages multiple data collectors for different trading pairs and exchanges.
## Features
- **Clean Logging**: Only essential information (connections, disconnections, errors)
- **Multi-Exchange Support**: Extensible architecture for multiple exchanges
- **Health Monitoring**: Built-in health checks and auto-recovery
- **Configurable**: JSON-based configuration with sensible defaults
- **Graceful Shutdown**: Proper signal handling and cleanup
- **Testing**: Comprehensive unit test coverage
## Quick Start
### Basic Usage
```bash
# Start with default configuration (indefinite run)
python scripts/start_data_collection.py
# Run for 8 hours
python scripts/start_data_collection.py --hours 8
# Use custom configuration
python scripts/start_data_collection.py --config config/my_config.json
```
### Monitoring
```bash
# Check status once
python scripts/monitor_clean.py
# Monitor continuously every 60 seconds
python scripts/monitor_clean.py --interval 60
```
## Configuration
The service uses JSON configuration files with automatic default creation if none exists.
### Default Configuration Location
`config/data_collection.json`
### Configuration Structure
```json
{
"exchanges": {
"okx": {
"enabled": true,
"trading_pairs": [
{
"symbol": "BTC-USDT",
"enabled": true,
"data_types": ["trade"],
"timeframes": ["1m", "5m", "15m", "1h"]
},
{
"symbol": "ETH-USDT",
"enabled": true,
"data_types": ["trade"],
"timeframes": ["1m", "5m", "15m", "1h"]
}
]
}
},
"collection_settings": {
"health_check_interval": 120,
"store_raw_data": true,
"auto_restart": true,
"max_restart_attempts": 3
},
"logging": {
"level": "INFO",
"log_errors_only": true,
"verbose_data_logging": false
}
}
```
### Configuration Options
#### Exchange Settings
- **enabled**: Whether to enable this exchange
- **trading_pairs**: Array of trading pair configurations
#### Trading Pair Settings
- **symbol**: Trading pair symbol (e.g., "BTC-USDT")
- **enabled**: Whether to collect data for this pair
- **data_types**: Types of data to collect (["trade"], ["ticker"], etc.)
- **timeframes**: Candle timeframes to generate (["1m", "5m", "15m", "1h", "4h", "1d"])
#### Collection Settings
- **health_check_interval**: Health check frequency in seconds
- **store_raw_data**: Whether to store raw trade data
- **auto_restart**: Enable automatic restart on failures
- **max_restart_attempts**: Maximum restart attempts before giving up
#### Logging Settings
- **level**: Log level ("DEBUG", "INFO", "WARNING", "ERROR")
- **log_errors_only**: Only log errors and essential events
- **verbose_data_logging**: Enable verbose logging of individual trades/candles
## Service Architecture
### Core Components
1. **DataCollectionService**: Main service class managing the lifecycle
2. **CollectorManager**: Manages multiple data collectors with health monitoring
3. **ExchangeFactory**: Creates exchange-specific collectors
4. **BaseDataCollector**: Abstract base for all data collectors
### Data Flow
```
Exchange API → Data Collector → Data Processor → Database
Health Monitor → Service Manager
```
### Storage
- **Raw Data**: PostgreSQL `raw_trades` table
- **Candles**: PostgreSQL `market_data` table with multiple timeframes
- **Real-time**: Redis pub/sub for live data distribution
## Logging Philosophy
The service implements **clean production logging** focused on operational needs:
### What Gets Logged
**Service Lifecycle**
- Service start/stop
- Collector initialization
- Database connections
**Connection Events**
- WebSocket connect/disconnect
- Reconnection attempts
- API errors
**Health & Errors**
- Health check results
- Error conditions
- Recovery actions
**Statistics**
- Periodic uptime reports
- Collection summary
### What Doesn't Get Logged
**Individual Data Points**
- Every trade received
- Every candle generated
- Raw market data
**Verbose Operations**
- Database queries
- Internal processing steps
- Routine heartbeats
## API Reference
### DataCollectionService
The main service class for managing data collection.
#### Constructor
```python
DataCollectionService(config_path: str = "config/data_collection.json")
```
#### Methods
##### `async run(duration_hours: Optional[float] = None) -> bool`
Run the service for a specified duration or indefinitely.
**Parameters:**
- `duration_hours`: Optional duration in hours (None = indefinite)
**Returns:**
- `bool`: True if successful, False if error occurred
##### `async start() -> bool`
Start the data collection service.
**Returns:**
- `bool`: True if started successfully
##### `async stop() -> None`
Stop the service gracefully.
##### `get_status() -> Dict[str, Any]`
Get current service status including uptime, collector counts, and errors.
**Returns:**
- `dict`: Status information
### Standalone Function
#### `run_data_collection_service(config_path, duration_hours)`
```python
async def run_data_collection_service(
config_path: str = "config/data_collection.json",
duration_hours: Optional[float] = None
) -> bool
```
Convenience function to run the service.
## Integration Examples
### Basic Integration
```python
import asyncio
from data.collection_service import DataCollectionService
async def main():
service = DataCollectionService("config/my_config.json")
await service.run(duration_hours=24) # Run for 24 hours
if __name__ == "__main__":
asyncio.run(main())
```
### Custom Status Monitoring
```python
import asyncio
from data.collection_service import DataCollectionService
async def monitor_service():
service = DataCollectionService()
# Start service in background
start_task = asyncio.create_task(service.run())
# Monitor status every 5 minutes
while service.running:
status = service.get_status()
print(f"Uptime: {status['uptime_hours']:.1f}h, "
f"Collectors: {status['collectors_running']}, "
f"Errors: {status['errors_count']}")
await asyncio.sleep(300) # 5 minutes
await start_task
asyncio.run(monitor_service())
```
### Programmatic Control
```python
import asyncio
from data.collection_service import DataCollectionService
async def controlled_collection():
service = DataCollectionService()
# Initialize and start
await service.initialize_collectors()
await service.start()
try:
# Run for 1 hour
await asyncio.sleep(3600)
finally:
# Graceful shutdown
await service.stop()
asyncio.run(controlled_collection())
```
## Error Handling
The service implements robust error handling at multiple levels:
### Service Level
- **Configuration Errors**: Invalid JSON, missing files
- **Initialization Errors**: Database connection, collector creation
- **Runtime Errors**: Unexpected exceptions during operation
### Collector Level
- **Connection Errors**: WebSocket disconnections, API failures
- **Data Errors**: Invalid data formats, processing failures
- **Health Errors**: Failed health checks, timeout conditions
### Recovery Strategies
1. **Automatic Restart**: Collectors auto-restart on failures
2. **Exponential Backoff**: Increasing delays between retry attempts
3. **Circuit Breaker**: Stop retrying after max attempts exceeded
4. **Graceful Degradation**: Continue with healthy collectors
## Testing
### Running Tests
```bash
# Run all data collection service tests
uv run pytest tests/test_data_collection_service.py -v
# Run specific test
uv run pytest tests/test_data_collection_service.py::TestDataCollectionService::test_service_initialization -v
# Run with coverage
uv run pytest tests/test_data_collection_service.py --cov=data.collection_service
```
### Test Coverage
The test suite covers:
- Service initialization and configuration
- Collector creation and management
- Service lifecycle (start/stop)
- Error handling and recovery
- Configuration validation
- Signal handling
- Status reporting
## Troubleshooting
### Common Issues
#### Configuration Not Found
```
❌ Failed to load config from config/data_collection.json: [Errno 2] No such file or directory
```
**Solution**: The service will create a default configuration. Check the created file and adjust as needed.
#### Database Connection Failed
```
❌ Database connection failed: connection refused
```
**Solution**: Ensure PostgreSQL and Redis are running via Docker:
```bash
docker-compose up -d postgres redis
```
#### No Collectors Created
```
❌ No collectors were successfully initialized
```
**Solution**: Check configuration - ensure at least one exchange is enabled with valid trading pairs.
#### WebSocket Connection Issues
```
❌ Failed to start data collectors
```
**Solution**: Check network connectivity and API credentials. Verify exchange is accessible.
### Debug Mode
For verbose debugging, modify the logging configuration:
```json
{
"logging": {
"level": "DEBUG",
"log_errors_only": false,
"verbose_data_logging": true
}
}
```
⚠️ **Warning**: Debug mode generates extensive logs and should not be used in production.
## Production Deployment
### Docker
The service can be containerized for production deployment:
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install uv
RUN uv pip install -r requirements.txt
CMD ["python", "scripts/start_data_collection.py", "--config", "config/production.json"]
```
### Systemd Service
Create a systemd service for Linux deployment:
```ini
[Unit]
Description=Cryptocurrency Data Collection Service
After=network.target postgres.service redis.service
[Service]
Type=simple
User=crypto-collector
WorkingDirectory=/opt/crypto-dashboard
ExecStart=/usr/bin/python scripts/start_data_collection.py --config config/production.json
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
```
### Environment Variables
Configure sensitive data via environment variables:
```bash
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=crypto_dashboard
export POSTGRES_USER=dashboard_user
export POSTGRES_PASSWORD=secure_password
export REDIS_HOST=localhost
export REDIS_PORT=6379
```
## Performance Considerations
### Resource Usage
- **Memory**: ~100MB base + ~10MB per trading pair
- **CPU**: Low (async I/O bound)
- **Network**: ~1KB/s per trading pair
- **Storage**: ~1GB/day per trading pair (with raw data)
### Scaling
- **Vertical**: Increase timeframes and trading pairs
- **Horizontal**: Run multiple services with different configurations
- **Database**: Use TimescaleDB for time-series optimization
### Optimization Tips
1. **Disable Raw Data**: Set `store_raw_data: false` to reduce storage
2. **Limit Timeframes**: Only collect needed timeframes
3. **Batch Processing**: Use longer health check intervals
4. **Connection Pooling**: Database connections are automatically pooled
## Changelog
### v1.0.0 (Current)
- Initial implementation
- OKX exchange support
- Clean logging system
- Comprehensive test coverage
- JSON configuration
- Health monitoring
- Graceful shutdown