Vasily.onl 1cca8cda16 Remove complete time series aggregation example and add data collection service implementation

- Deleted `example_complete_series_aggregation.py` as it is no longer needed.
- Introduced `data_collection_service.py`, a production-ready service for cryptocurrency market data collection with clean logging and robust error handling.
- Added configuration management for multiple trading pairs and exchanges, supporting health monitoring and graceful shutdown.
- Created `data_collection.json` for service configuration, including exchange settings and logging preferences.
- Updated `CandleProcessingConfig` to reflect changes in timeframes for candle processing.
- Enhanced documentation to cover the new data collection service and its configuration, ensuring clarity for users.

2025-06-02 14:23:08 +08:00

11 KiB

Raw Blame History

Data Collection Service

The Data Collection Service is a production-ready service for cryptocurrency market data collection with clean logging and robust error handling. It manages multiple data collectors for different trading pairs and exchanges.

Features

Clean Logging: Only essential information (connections, disconnections, errors)
Multi-Exchange Support: Extensible architecture for multiple exchanges
Health Monitoring: Built-in health checks and auto-recovery
Configurable: JSON-based configuration with sensible defaults
Graceful Shutdown: Proper signal handling and cleanup
Testing: Comprehensive unit test coverage

Quick Start

Basic Usage

# Start with default configuration (indefinite run)
python scripts/start_data_collection.py

# Run for 8 hours
python scripts/start_data_collection.py --hours 8

# Use custom configuration
python scripts/start_data_collection.py --config config/my_config.json

Monitoring

# Check status once
python scripts/monitor_clean.py

# Monitor continuously every 60 seconds
python scripts/monitor_clean.py --interval 60

Configuration

The service uses JSON configuration files with automatic default creation if none exists.

Default Configuration Location

config/data_collection.json

Configuration Structure

{
  "exchanges": {
    "okx": {
      "enabled": true,
      "trading_pairs": [
        {
          "symbol": "BTC-USDT",
          "enabled": true,
          "data_types": ["trade"],
          "timeframes": ["1m", "5m", "15m", "1h"]
        },
        {
          "symbol": "ETH-USDT",
          "enabled": true,
          "data_types": ["trade"],
          "timeframes": ["1m", "5m", "15m", "1h"]
        }
      ]
    }
  },
  "collection_settings": {
    "health_check_interval": 120,
    "store_raw_data": true,
    "auto_restart": true,
    "max_restart_attempts": 3
  },
  "logging": {
    "level": "INFO",
    "log_errors_only": true,
    "verbose_data_logging": false
  }
}

Configuration Options

Exchange Settings

enabled: Whether to enable this exchange
trading_pairs: Array of trading pair configurations

Trading Pair Settings

symbol: Trading pair symbol (e.g., "BTC-USDT")
enabled: Whether to collect data for this pair
data_types: Types of data to collect (["trade"], ["ticker"], etc.)
timeframes: Candle timeframes to generate (["1m", "5m", "15m", "1h", "4h", "1d"])

Collection Settings

health_check_interval: Health check frequency in seconds
store_raw_data: Whether to store raw trade data
auto_restart: Enable automatic restart on failures
max_restart_attempts: Maximum restart attempts before giving up

Logging Settings

level: Log level ("DEBUG", "INFO", "WARNING", "ERROR")
log_errors_only: Only log errors and essential events
verbose_data_logging: Enable verbose logging of individual trades/candles

Service Architecture

Core Components

DataCollectionService: Main service class managing the lifecycle
CollectorManager: Manages multiple data collectors with health monitoring
ExchangeFactory: Creates exchange-specific collectors
BaseDataCollector: Abstract base for all data collectors

Data Flow

Exchange API → Data Collector → Data Processor → Database
                     ↓
              Health Monitor → Service Manager

Storage

Raw Data: PostgreSQL raw_trades table
Candles: PostgreSQL market_data table with multiple timeframes
Real-time: Redis pub/sub for live data distribution

Logging Philosophy

The service implements clean production logging focused on operational needs:

What Gets Logged

✅ Service Lifecycle

Service start/stop
Collector initialization
Database connections

✅ Connection Events

WebSocket connect/disconnect
Reconnection attempts
API errors

✅ Health & Errors

Health check results
Error conditions
Recovery actions

✅ Statistics

Periodic uptime reports
Collection summary

What Doesn't Get Logged

❌ Individual Data Points

Every trade received
Every candle generated
Raw market data

❌ Verbose Operations

Database queries
Internal processing steps
Routine heartbeats

API Reference

DataCollectionService

The main service class for managing data collection.

Constructor

DataCollectionService(config_path: str = "config/data_collection.json")

Methods

`async run(duration_hours: Optional[float] = None) -> bool`

Run the service for a specified duration or indefinitely.

Parameters:

duration_hours: Optional duration in hours (None = indefinite)

Returns:

bool: True if successful, False if error occurred

`async start() -> bool`

Start the data collection service.

Returns:

bool: True if started successfully

`async stop() -> None`

Stop the service gracefully.

`get_status() -> Dict[str, Any]`

Get current service status including uptime, collector counts, and errors.

Returns:

dict: Status information

Standalone Function

`run_data_collection_service(config_path, duration_hours)`

async def run_data_collection_service(
    config_path: str = "config/data_collection.json",
    duration_hours: Optional[float] = None
) -> bool

Convenience function to run the service.

Integration Examples

Basic Integration

import asyncio
from data.collection_service import DataCollectionService

async def main():
    service = DataCollectionService("config/my_config.json")
    await service.run(duration_hours=24)  # Run for 24 hours

if __name__ == "__main__":
    asyncio.run(main())

Custom Status Monitoring

import asyncio
from data.collection_service import DataCollectionService

async def monitor_service():
    service = DataCollectionService()
    
    # Start service in background
    start_task = asyncio.create_task(service.run())
    
    # Monitor status every 5 minutes
    while service.running:
        status = service.get_status()
        print(f"Uptime: {status['uptime_hours']:.1f}h, "
              f"Collectors: {status['collectors_running']}, "
              f"Errors: {status['errors_count']}")
        
        await asyncio.sleep(300)  # 5 minutes
    
    await start_task

asyncio.run(monitor_service())

Programmatic Control

import asyncio
from data.collection_service import DataCollectionService

async def controlled_collection():
    service = DataCollectionService()
    
    # Initialize and start
    await service.initialize_collectors()
    await service.start()
    
    try:
        # Run for 1 hour
        await asyncio.sleep(3600)
    finally:
        # Graceful shutdown
        await service.stop()

asyncio.run(controlled_collection())

Error Handling

The service implements robust error handling at multiple levels:

Service Level

Configuration Errors: Invalid JSON, missing files
Initialization Errors: Database connection, collector creation
Runtime Errors: Unexpected exceptions during operation

Collector Level

Connection Errors: WebSocket disconnections, API failures
Data Errors: Invalid data formats, processing failures
Health Errors: Failed health checks, timeout conditions

Recovery Strategies

Automatic Restart: Collectors auto-restart on failures
Exponential Backoff: Increasing delays between retry attempts
Circuit Breaker: Stop retrying after max attempts exceeded
Graceful Degradation: Continue with healthy collectors

Testing

Running Tests

# Run all data collection service tests
uv run pytest tests/test_data_collection_service.py -v

# Run specific test
uv run pytest tests/test_data_collection_service.py::TestDataCollectionService::test_service_initialization -v

# Run with coverage
uv run pytest tests/test_data_collection_service.py --cov=data.collection_service

Test Coverage

The test suite covers:

Service initialization and configuration
Collector creation and management
Service lifecycle (start/stop)
Error handling and recovery
Configuration validation
Signal handling
Status reporting

Troubleshooting

Common Issues

Configuration Not Found

❌ Failed to load config from config/data_collection.json: [Errno 2] No such file or directory

Solution: The service will create a default configuration. Check the created file and adjust as needed.

Database Connection Failed

❌ Database connection failed: connection refused

Solution: Ensure PostgreSQL and Redis are running via Docker:

docker-compose up -d postgres redis

No Collectors Created

❌ No collectors were successfully initialized

Solution: Check configuration - ensure at least one exchange is enabled with valid trading pairs.

WebSocket Connection Issues

❌ Failed to start data collectors

Solution: Check network connectivity and API credentials. Verify exchange is accessible.

Debug Mode

For verbose debugging, modify the logging configuration:

{
  "logging": {
    "level": "DEBUG",
    "log_errors_only": false,
    "verbose_data_logging": true
  }
}

⚠️ Warning: Debug mode generates extensive logs and should not be used in production.

Production Deployment

Docker

The service can be containerized for production deployment:

FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install uv
RUN uv pip install -r requirements.txt

CMD ["python", "scripts/start_data_collection.py", "--config", "config/production.json"]

Systemd Service

Create a systemd service for Linux deployment:

[Unit]
Description=Cryptocurrency Data Collection Service
After=network.target postgres.service redis.service

[Service]
Type=simple
User=crypto-collector
WorkingDirectory=/opt/crypto-dashboard
ExecStart=/usr/bin/python scripts/start_data_collection.py --config config/production.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Environment Variables

Configure sensitive data via environment variables:

export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DB=crypto_dashboard
export POSTGRES_USER=dashboard_user
export POSTGRES_PASSWORD=secure_password
export REDIS_HOST=localhost
export REDIS_PORT=6379

Performance Considerations

Resource Usage

Memory: ~100MB base + ~10MB per trading pair
CPU: Low (async I/O bound)
Network: ~1KB/s per trading pair
Storage: ~1GB/day per trading pair (with raw data)

Scaling

Vertical: Increase timeframes and trading pairs
Horizontal: Run multiple services with different configurations
Database: Use TimescaleDB for time-series optimization

Optimization Tips

Disable Raw Data: Set store_raw_data: false to reduce storage
Limit Timeframes: Only collect needed timeframes
Batch Processing: Use longer health check intervals
Connection Pooling: Database connections are automatically pooled

Changelog

v1.0.0 (Current)

Initial implementation
OKX exchange support
Clean logging system
Comprehensive test coverage
JSON configuration
Health monitoring
Graceful shutdown

11 KiB Raw Blame History

Data Collection Service

Features

Quick Start

Basic Usage

Monitoring

Configuration

Default Configuration Location

Configuration Structure

Configuration Options

Exchange Settings

Trading Pair Settings

Collection Settings

Logging Settings

Service Architecture

Core Components

Data Flow

Storage

Logging Philosophy

What Gets Logged

What Doesn't Get Logged

API Reference

DataCollectionService

Constructor

Methods

async run(duration_hours: Optional[float] = None) -> bool

async start() -> bool

async stop() -> None

get_status() -> Dict[str, Any]

Standalone Function

run_data_collection_service(config_path, duration_hours)

Integration Examples

Basic Integration

Custom Status Monitoring

Programmatic Control

Error Handling

Service Level

Collector Level

Recovery Strategies

Testing

Running Tests

Test Coverage

Troubleshooting

Common Issues

Configuration Not Found

Database Connection Failed

No Collectors Created

WebSocket Connection Issues

Debug Mode

Production Deployment

Docker

Systemd Service

Environment Variables

Performance Considerations

Resource Usage

Scaling

Optimization Tips

Changelog

v1.0.0 (Current)

11 KiB

Raw Blame History

`async run(duration_hours: Optional[float] = None) -> bool`

`async start() -> bool`

`async stop() -> None`

`get_status() -> Dict[str, Any]`

`run_data_collection_service(config_path, duration_hours)`