TCPDashboard/docs/data-collection-service.md

# Data Collection Service

The Data Collection Service is a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. It manages multiple data collectors for different trading pairs and exchanges.

## Features

- **Clean Logging**: Only essential information (connections, disconnections, errors)
- **Multi-Exchange Support**: Extensible architecture for multiple exchanges
- **Health Monitoring**: Built-in health checks and auto-recovery
- **Configurable**: JSON-based configuration with sensible defaults
- **Graceful Shutdown**: Proper signal handling and cleanup
- **Testing**: Comprehensive unit test coverage

## Quick Start

### Basic Usage

```bash
# Start with default configuration (indefinite run)
python scripts/start_data_collection.py

# Run for 8 hours
python scripts/start_data_collection.py --hours 8

# Use custom configuration
python scripts/start_data_collection.py --config config/my_config.json
```

### Monitoring

```bash
# Check status once
python scripts/monitor_clean.py

# Monitor continuously every 60 seconds
python scripts/monitor_clean.py --interval 60
```

## Configuration

The service uses JSON configuration files with automatic default creation if none exists.

### Default Configuration Location

`config/data_collection.json`

### Configuration Structure

```json
{
  "exchanges": {
    "okx": {
      "enabled": true,
      "trading_pairs": [
        {
          "symbol": "BTC-USDT",
          "enabled": true,
          "data_types": ["trade"],
          "timeframes": ["1m", "5m", "15m", "1h"]
        },
        {
          "symbol": "ETH-USDT",
          "enabled": true,
          "data_types": ["trade"],
          "timeframes": ["1m", "5m", "15m", "1h"]
        }
      ]
    }
  },
  "collection_settings": {
    "health_check_interval": 120,
    "store_raw_data": true,
    "auto_restart": true,
    "max_restart_attempts": 3
  },
  "logging": {
    "level": "INFO",
    "log_errors_only": true,
    "verbose_data_logging": false
  }
}
```

### Configuration Options

#### Exchange Settings

- **enabled**: Whether to enable this exchange
- **trading_pairs**: Array of trading pair configurations

#### Trading Pair Settings

- **symbol**: Trading pair symbol (e.g., "BTC-USDT")
- **enabled**: Whether to collect data for this pair
- **data_types**: Types of data to collect (["trade"], ["ticker"], etc.)
- **timeframes**: Candle timeframes to generate (["1m", "5m", "15m", "1h", "4h", "1d"])

#### Collection Settings

- **health_check_interval**: Health check frequency in seconds
- **store_raw_data**: Whether to store raw trade data
- **auto_restart**: Enable automatic restart on failures
- **max_restart_attempts**: Maximum restart attempts before giving up

#### Logging Settings

- **level**: Log level ("DEBUG", "INFO", "WARNING", "ERROR")
- **log_errors_only**: Only log errors and essential events
- **verbose_data_logging**: Enable verbose logging of individual trades/candles

## Service Architecture

### Core Components

1. **DataCollectionService**: Main service class managing the lifecycle
2. **CollectorManager**: Manages multiple data collectors with health monitoring
3. **ExchangeFactory**: Creates exchange-specific collectors
4. **BaseDataCollector**: Abstract base for all data collectors

### Data Flow

```
Exchange API → Data Collector → Data Processor → Database
                     ↓
              Health Monitor → Service Manager
```

### Storage

- **Raw Data**: PostgreSQL `raw_trades` table
- **Candles**: PostgreSQL `market_data` table with multiple timeframes
- **Real-time**: Redis pub/sub for live data distribution

## Logging Philosophy

The service implements **clean production logging** focused on operational needs:

### What Gets Logged

✅ **Service Lifecycle**
- Service start/stop
- Collector initialization
- Database connections

✅ **Connection Events**
- WebSocket connect/disconnect
- Reconnection attempts
- API errors

✅ **Health & Errors**
- Health check results
- Error conditions
- Recovery actions

✅ **Statistics**
- Periodic uptime reports
- Collection summary

### What Doesn't Get Logged

❌ **Individual Data Points**
- Every trade received
- Every candle generated
- Raw market data

❌ **Verbose Operations**
- Database queries
- Internal processing steps
- Routine heartbeats

## API Reference

### DataCollectionService

The main service class for managing data collection.

#### Constructor

```python
DataCollectionService(config_path: str = "config/data_collection.json")
```

#### Methods

##### `async run(duration_hours: Optional[float] = None) -> bool`

Run the service for a specified duration or indefinitely.

**Parameters:**
- `duration_hours`: Optional duration in hours (None = indefinite)

**Returns:**
- `bool`: True if successful, False if error occurred

##### `async start() -> bool`

Start the data collection service.

**Returns:**
- `bool`: True if started successfully

##### `async stop() -> None`

Stop the service gracefully.

##### `get_status() -> Dict[str, Any]`

Get current service status including uptime, collector counts, and errors.

**Returns:**
- `dict`: Status information

### Standalone Function

#### `run_data_collection_service(config_path, duration_hours)`

```python
async def run_data_collection_service(
    config_path: str = "config/data_collection.json",
    duration_hours: Optional[float] = None
) -> bool
```

Convenience function to run the service.

## Integration Examples

### Basic Integration

```python
import asyncio
from data.collection_service import DataCollectionService

async def main():
    service = DataCollectionService("config/my_config.json")
    await service.run(duration_hours=24)  # Run for 24 hours

if __name__ == "__main__":
    asyncio.run(main())
```

### Custom Status Monitoring

```python
import asyncio
from data.collection_service import DataCollectionService

async def monitor_service():
    service = DataCollectionService()

    # Start service in background
    start_task = asyncio.create_task(service.run())

    # Monitor status every 5 minutes
    while service.running:
        status = service.get_status()
        print(f"Uptime: {status['uptime_hours']:.1f}h, "
              f"Collectors: {status['collectors_running']}, "
              f"Errors: {status['errors_count']}")

        await asyncio.sleep(300)  # 5 minutes

    await start_task

asyncio.run(monitor_service())
```

### Programmatic Control

```python
import asyncio
from data.collection_service import DataCollectionService

async def controlled_collection():
    service = DataCollectionService()

    # Initialize and start
    await service.initialize_collectors()
    await service.start()

    try:
        # Run for 1 hour
        await asyncio.sleep(3600)
    finally:
        # Graceful shutdown
        await service.stop()

asyncio.run(controlled_collection())
```

## Error Handling

The service implements robust error handling at multiple levels:

### Service Level

- **Configuration Errors**: Invalid JSON, missing files
- **Initialization Errors**: Database connection, collector creation
- **Runtime Errors**: Unexpected exceptions during operation

### Collector Level

- **Connection Errors**: WebSocket disconnections, API failures
- **Data Errors**: Invalid data formats, processing failures
- **Health Errors**: Failed health checks, timeout conditions

### Recovery Strategies

1. **Automatic Restart**: Collectors auto-restart on failures
2. **Exponential Backoff**: Increasing delays between retry attempts
3. **Circuit Breaker**: Stop retrying after max attempts exceeded
4. **Graceful Degradation**: Continue with healthy collectors

## Testing

### Running Tests

```bash
# Run all data collection service tests
uv run pytest tests/test_data_collection_service.py -v

# Run specific test
uv run pytest tests/test_data_collection_service.py::TestDataCollectionService::test_service_initialization -v

# Run with coverage
uv run pytest tests/test_data_collection_service.py --cov=data.collection_service
```

### Test Coverage

The test suite covers:
- Service initialization and configuration
- Collector creation and management
- Service lifecycle (start/stop)
- Error handling and recovery
- Configuration validation
- Signal handling
- Status reporting

## Troubleshooting

### Common Issues

#### Configuration Not Found

```
❌ Failed to load config from config/data_collection.json: [Errno 2] No such file or directory
```

**Solution**: The service will create a default configuration. Check the created file and adjust as needed.

#### Database Connection Failed

```
❌ Database connection failed: connection refused
```

**Solution**: Ensure PostgreSQL and Redis are running via Docker:

```bash
docker-compose up -d postgres redis
```

#### No Collectors Created

```
❌ No collectors were successfully initialized
```

**Solution**: Check configuration - ensure at least one exchange is enabled with valid trading pairs.

#### WebSocket Connection Issues

```
❌ Failed to start data collectors
```

**Solution**: Check network connectivity and API credentials. Verify exchange is accessible.

### Debug Mode

For verbose debugging, modify the logging configuration:

```json
{
  "logging": {
    "level": "DEBUG",
    "log_errors_only": false,
    "verbose_data_logging": true
  }
}
```

⚠️ **Warning**: Debug mode generates extensive logs and should not be used in production.

## Production Deployment

### Docker

The service can be containerized for production deployment:

```dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install uv
RUN uv pip install -r requirements.txt

CMD ["python", "scripts/start_data_collection.py", "--config", "config/production.json"]
```

### Systemd Service

Create a systemd service for Linux deployment:

```ini
[Unit]
Description=Cryptocurrency Data Collection Service
After=network.target postgres.service redis.service

[Service]
Type=simple
User=crypto-collector
WorkingDirectory=/opt/crypto-dashboard
ExecStart=/usr/bin/python scripts/start_data_collection.py --config config/production.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
```

### Environment Variables

Configure sensitive data via environment variables:

```bash
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=crypto_dashboard
export POSTGRES_USER=dashboard_user
export POSTGRES_PASSWORD=secure_password
export REDIS_HOST=localhost
export REDIS_PORT=6379
```

## Performance Considerations

### Resource Usage

- **Memory**: ~100MB base + ~10MB per trading pair
- **CPU**: Low (async I/O bound)
- **Network**: ~1KB/s per trading pair
- **Storage**: ~1GB/day per trading pair (with raw data)

### Scaling

- **Vertical**: Increase timeframes and trading pairs
- **Horizontal**: Run multiple services with different configurations
- **Database**: Use TimescaleDB for time-series optimization

### Optimization Tips

1. **Disable Raw Data**: Set `store_raw_data: false` to reduce storage
2. **Limit Timeframes**: Only collect needed timeframes
3. **Batch Processing**: Use longer health check intervals
4. **Connection Pooling**: Database connections are automatically pooled

## Changelog

### v1.0.0 (Current)

- Initial implementation
- OKX exchange support
- Clean logging system
- Comprehensive test coverage
- JSON configuration
- Health monitoring
- Graceful shutdown