TCPDashboard/docs/data-collection-service.md
Vasily.onl 1cca8cda16 Remove complete time series aggregation example and add data collection service implementation
- Deleted `example_complete_series_aggregation.py` as it is no longer needed.
- Introduced `data_collection_service.py`, a production-ready service for cryptocurrency market data collection with clean logging and robust error handling.
- Added configuration management for multiple trading pairs and exchanges, supporting health monitoring and graceful shutdown.
- Created `data_collection.json` for service configuration, including exchange settings and logging preferences.
- Updated `CandleProcessingConfig` to reflect changes in timeframes for candle processing.
- Enhanced documentation to cover the new data collection service and its configuration, ensuring clarity for users.
2025-06-02 14:23:08 +08:00

11 KiB

Data Collection Service

The Data Collection Service is a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. It manages multiple data collectors for different trading pairs and exchanges.

Features

  • Clean Logging: Only essential information (connections, disconnections, errors)
  • Multi-Exchange Support: Extensible architecture for multiple exchanges
  • Health Monitoring: Built-in health checks and auto-recovery
  • Configurable: JSON-based configuration with sensible defaults
  • Graceful Shutdown: Proper signal handling and cleanup
  • Testing: Comprehensive unit test coverage

Quick Start

Basic Usage

# Start with default configuration (indefinite run)
python scripts/start_data_collection.py

# Run for 8 hours
python scripts/start_data_collection.py --hours 8

# Use custom configuration
python scripts/start_data_collection.py --config config/my_config.json

Monitoring

# Check status once
python scripts/monitor_clean.py

# Monitor continuously every 60 seconds
python scripts/monitor_clean.py --interval 60

Configuration

The service uses JSON configuration files with automatic default creation if none exists.

Default Configuration Location

config/data_collection.json

Configuration Structure

{
  "exchanges": {
    "okx": {
      "enabled": true,
      "trading_pairs": [
        {
          "symbol": "BTC-USDT",
          "enabled": true,
          "data_types": ["trade"],
          "timeframes": ["1m", "5m", "15m", "1h"]
        },
        {
          "symbol": "ETH-USDT",
          "enabled": true,
          "data_types": ["trade"],
          "timeframes": ["1m", "5m", "15m", "1h"]
        }
      ]
    }
  },
  "collection_settings": {
    "health_check_interval": 120,
    "store_raw_data": true,
    "auto_restart": true,
    "max_restart_attempts": 3
  },
  "logging": {
    "level": "INFO",
    "log_errors_only": true,
    "verbose_data_logging": false
  }
}

Configuration Options

Exchange Settings

  • enabled: Whether to enable this exchange
  • trading_pairs: Array of trading pair configurations

Trading Pair Settings

  • symbol: Trading pair symbol (e.g., "BTC-USDT")
  • enabled: Whether to collect data for this pair
  • data_types: Types of data to collect (["trade"], ["ticker"], etc.)
  • timeframes: Candle timeframes to generate (["1m", "5m", "15m", "1h", "4h", "1d"])

Collection Settings

  • health_check_interval: Health check frequency in seconds
  • store_raw_data: Whether to store raw trade data
  • auto_restart: Enable automatic restart on failures
  • max_restart_attempts: Maximum restart attempts before giving up

Logging Settings

  • level: Log level ("DEBUG", "INFO", "WARNING", "ERROR")
  • log_errors_only: Only log errors and essential events
  • verbose_data_logging: Enable verbose logging of individual trades/candles

Service Architecture

Core Components

  1. DataCollectionService: Main service class managing the lifecycle
  2. CollectorManager: Manages multiple data collectors with health monitoring
  3. ExchangeFactory: Creates exchange-specific collectors
  4. BaseDataCollector: Abstract base for all data collectors

Data Flow

Exchange API → Data Collector → Data Processor → Database
                     ↓
              Health Monitor → Service Manager

Storage

  • Raw Data: PostgreSQL raw_trades table
  • Candles: PostgreSQL market_data table with multiple timeframes
  • Real-time: Redis pub/sub for live data distribution

Logging Philosophy

The service implements clean production logging focused on operational needs:

What Gets Logged

Service Lifecycle

  • Service start/stop
  • Collector initialization
  • Database connections

Connection Events

  • WebSocket connect/disconnect
  • Reconnection attempts
  • API errors

Health & Errors

  • Health check results
  • Error conditions
  • Recovery actions

Statistics

  • Periodic uptime reports
  • Collection summary

What Doesn't Get Logged

Individual Data Points

  • Every trade received
  • Every candle generated
  • Raw market data

Verbose Operations

  • Database queries
  • Internal processing steps
  • Routine heartbeats

API Reference

DataCollectionService

The main service class for managing data collection.

Constructor

DataCollectionService(config_path: str = "config/data_collection.json")

Methods

async run(duration_hours: Optional[float] = None) -> bool

Run the service for a specified duration or indefinitely.

Parameters:

  • duration_hours: Optional duration in hours (None = indefinite)

Returns:

  • bool: True if successful, False if error occurred
async start() -> bool

Start the data collection service.

Returns:

  • bool: True if started successfully
async stop() -> None

Stop the service gracefully.

get_status() -> Dict[str, Any]

Get current service status including uptime, collector counts, and errors.

Returns:

  • dict: Status information

Standalone Function

run_data_collection_service(config_path, duration_hours)

async def run_data_collection_service(
    config_path: str = "config/data_collection.json",
    duration_hours: Optional[float] = None
) -> bool

Convenience function to run the service.

Integration Examples

Basic Integration

import asyncio
from data.collection_service import DataCollectionService

async def main():
    service = DataCollectionService("config/my_config.json")
    await service.run(duration_hours=24)  # Run for 24 hours

if __name__ == "__main__":
    asyncio.run(main())

Custom Status Monitoring

import asyncio
from data.collection_service import DataCollectionService

async def monitor_service():
    service = DataCollectionService()
    
    # Start service in background
    start_task = asyncio.create_task(service.run())
    
    # Monitor status every 5 minutes
    while service.running:
        status = service.get_status()
        print(f"Uptime: {status['uptime_hours']:.1f}h, "
              f"Collectors: {status['collectors_running']}, "
              f"Errors: {status['errors_count']}")
        
        await asyncio.sleep(300)  # 5 minutes
    
    await start_task

asyncio.run(monitor_service())

Programmatic Control

import asyncio
from data.collection_service import DataCollectionService

async def controlled_collection():
    service = DataCollectionService()
    
    # Initialize and start
    await service.initialize_collectors()
    await service.start()
    
    try:
        # Run for 1 hour
        await asyncio.sleep(3600)
    finally:
        # Graceful shutdown
        await service.stop()

asyncio.run(controlled_collection())

Error Handling

The service implements robust error handling at multiple levels:

Service Level

  • Configuration Errors: Invalid JSON, missing files
  • Initialization Errors: Database connection, collector creation
  • Runtime Errors: Unexpected exceptions during operation

Collector Level

  • Connection Errors: WebSocket disconnections, API failures
  • Data Errors: Invalid data formats, processing failures
  • Health Errors: Failed health checks, timeout conditions

Recovery Strategies

  1. Automatic Restart: Collectors auto-restart on failures
  2. Exponential Backoff: Increasing delays between retry attempts
  3. Circuit Breaker: Stop retrying after max attempts exceeded
  4. Graceful Degradation: Continue with healthy collectors

Testing

Running Tests

# Run all data collection service tests
uv run pytest tests/test_data_collection_service.py -v

# Run specific test
uv run pytest tests/test_data_collection_service.py::TestDataCollectionService::test_service_initialization -v

# Run with coverage
uv run pytest tests/test_data_collection_service.py --cov=data.collection_service

Test Coverage

The test suite covers:

  • Service initialization and configuration
  • Collector creation and management
  • Service lifecycle (start/stop)
  • Error handling and recovery
  • Configuration validation
  • Signal handling
  • Status reporting

Troubleshooting

Common Issues

Configuration Not Found

❌ Failed to load config from config/data_collection.json: [Errno 2] No such file or directory

Solution: The service will create a default configuration. Check the created file and adjust as needed.

Database Connection Failed

❌ Database connection failed: connection refused

Solution: Ensure PostgreSQL and Redis are running via Docker:

docker-compose up -d postgres redis

No Collectors Created

❌ No collectors were successfully initialized

Solution: Check configuration - ensure at least one exchange is enabled with valid trading pairs.

WebSocket Connection Issues

❌ Failed to start data collectors

Solution: Check network connectivity and API credentials. Verify exchange is accessible.

Debug Mode

For verbose debugging, modify the logging configuration:

{
  "logging": {
    "level": "DEBUG",
    "log_errors_only": false,
    "verbose_data_logging": true
  }
}

⚠️ Warning: Debug mode generates extensive logs and should not be used in production.

Production Deployment

Docker

The service can be containerized for production deployment:

FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install uv
RUN uv pip install -r requirements.txt

CMD ["python", "scripts/start_data_collection.py", "--config", "config/production.json"]

Systemd Service

Create a systemd service for Linux deployment:

[Unit]
Description=Cryptocurrency Data Collection Service
After=network.target postgres.service redis.service

[Service]
Type=simple
User=crypto-collector
WorkingDirectory=/opt/crypto-dashboard
ExecStart=/usr/bin/python scripts/start_data_collection.py --config config/production.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Environment Variables

Configure sensitive data via environment variables:

export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=crypto_dashboard
export POSTGRES_USER=dashboard_user
export POSTGRES_PASSWORD=secure_password
export REDIS_HOST=localhost
export REDIS_PORT=6379

Performance Considerations

Resource Usage

  • Memory: ~100MB base + ~10MB per trading pair
  • CPU: Low (async I/O bound)
  • Network: ~1KB/s per trading pair
  • Storage: ~1GB/day per trading pair (with raw data)

Scaling

  • Vertical: Increase timeframes and trading pairs
  • Horizontal: Run multiple services with different configurations
  • Database: Use TimescaleDB for time-series optimization

Optimization Tips

  1. Disable Raw Data: Set store_raw_data: false to reduce storage
  2. Limit Timeframes: Only collect needed timeframes
  3. Batch Processing: Use longer health check intervals
  4. Connection Pooling: Database connections are automatically pooled

Changelog

v1.0.0 (Current)

  • Initial implementation
  • OKX exchange support
  • Clean logging system
  • Comprehensive test coverage
  • JSON configuration
  • Health monitoring
  • Graceful shutdown