- Introduced a modular architecture for data processing, including common utilities for validation, transformation, and aggregation. - Implemented `StandardizedTrade`, `OHLCVCandle`, and `TimeframeBucket` classes for unified data handling across exchanges. - Developed `OKXDataProcessor` for OKX-specific data validation and processing, leveraging the new common framework. - Enhanced `OKXCollector` to utilize the common data processing utilities, improving modularity and maintainability. - Updated documentation to reflect the new architecture and provide guidance on the data processing framework. - Created comprehensive tests for the new data processing components to ensure reliability and functionality.
14 KiB
Refactored Data Processing Architecture
Overview
The data processing system has been significantly refactored to improve reusability, maintainability, and scalability across different exchanges. The key improvement is the extraction of common utilities into a shared framework while keeping exchange-specific components focused and minimal.
Architecture Changes
Before (Monolithic)
data/exchanges/okx/
├── data_processor.py # 1343 lines - everything in one file
├── collector.py
└── websocket.py
After (Modular)
data/
├── common/ # Shared utilities for all exchanges
│ ├── __init__.py
│ ├── data_types.py # StandardizedTrade, OHLCVCandle, etc.
│ ├── aggregation.py # TimeframeBucket, RealTimeCandleProcessor
│ ├── transformation.py # BaseDataTransformer, UnifiedDataTransformer
│ └── validation.py # BaseDataValidator, common validation
└── exchanges/
└── okx/
├── data_processor.py # ~600 lines - OKX-specific only
├── collector.py # Updated to use common utilities
└── websocket.py
Key Benefits
1. Reusability Across Exchanges
- Candle aggregation logic works for any exchange
- Standardized data formats enable uniform processing
- Base classes provide common patterns for new exchanges
2. Maintainability
- Smaller, focused files are easier to understand and modify
- Common utilities are tested once and reused everywhere
- Clear separation of concerns
3. Extensibility
- Adding new exchanges requires minimal code
- New data types and timeframes are automatically supported
- Validation and transformation patterns are consistent
4. Performance
- Optimized aggregation algorithms and memory usage
- Efficient candle bucketing algorithms
- Lazy evaluation where possible
5. Testing
- Modular components are easier to test independently
Time Aggregation Strategy
Right-Aligned Timestamps (Industry Standard)
The system uses RIGHT-ALIGNED timestamps following industry standards from major exchanges (Binance, OKX, Coinbase):
- Candle timestamp = end time of the interval (close time)
- 5-minute candle with timestamp
09:05:00represents data from09:00:01to09:05:00 - 1-minute candle with timestamp
14:32:00represents data from14:31:01to14:32:00 - This aligns with how exchanges report historical data
Aggregation Process (No Future Leakage)
def process_trade_realtime(trade: StandardizedTrade, timeframe: str):
"""
Real-time aggregation with strict future leakage prevention
CRITICAL: Only emit completed candles, never incomplete ones
"""
# 1. Calculate which time bucket this trade belongs to
trade_bucket_start = get_bucket_start_time(trade.timestamp, timeframe)
# 2. Check if current bucket exists and matches
current_bucket = current_buckets.get(timeframe)
# 3. Handle time boundary crossing
if current_bucket is None:
# First bucket for this timeframe
current_bucket = create_bucket(trade_bucket_start, timeframe)
elif current_bucket.start_time != trade_bucket_start:
# Time boundary crossed - complete previous bucket FIRST
if current_bucket.has_trades():
completed_candle = current_bucket.to_candle(is_complete=True)
emit_candle(completed_candle) # Store in market_data table
# Create new bucket for current time period
current_bucket = create_bucket(trade_bucket_start, timeframe)
# 4. Add trade to current bucket
current_bucket.add_trade(trade)
# 5. Return only completed candles (never incomplete/future data)
return completed_candles # Empty list unless boundary crossed
Time Bucket Calculation Examples
# 5-minute timeframes (00:00, 00:05, 00:10, 00:15, etc.)
trade_time = "09:03:45" -> bucket_start = "09:00:00", bucket_end = "09:05:00"
trade_time = "09:07:23" -> bucket_start = "09:05:00", bucket_end = "09:10:00"
trade_time = "09:05:00" -> bucket_start = "09:05:00", bucket_end = "09:10:00"
# 1-hour timeframes (align to hour boundaries)
trade_time = "14:35:22" -> bucket_start = "14:00:00", bucket_end = "15:00:00"
trade_time = "15:00:00" -> bucket_start = "15:00:00", bucket_end = "16:00:00"
# 4-hour timeframes (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
trade_time = "13:45:12" -> bucket_start = "12:00:00", bucket_end = "16:00:00"
trade_time = "16:00:01" -> bucket_start = "16:00:00", bucket_end = "20:00:00"
Future Leakage Prevention
CRITICAL SAFEGUARDS:
- Boundary Crossing Detection: Only complete candles when trade timestamp definitively crosses time boundary
- No Premature Completion: Never emit incomplete candles during real-time processing
- Strict Time Validation: Trades only added to buckets if
start_time <= trade.timestamp < end_time - Historical Consistency: Same logic for real-time and historical processing
# CORRECT: Only complete candle when boundary is crossed
if current_bucket.start_time != trade_bucket_start:
# Time boundary definitely crossed - safe to complete
completed_candle = current_bucket.to_candle(is_complete=True)
emit_to_storage(completed_candle)
# INCORRECT: Would cause future leakage
if some_timer_expires():
# Never complete based on timers or external events
completed_candle = current_bucket.to_candle(is_complete=True) # WRONG!
Data Storage Flow
WebSocket Trade Data → Validation → Transformation → Aggregation → Storage
| | |
↓ ↓ ↓
Raw individual trades Completed OHLCV Incomplete OHLCV
| candles (storage) (monitoring only)
↓ |
raw_trades table market_data table
(debugging/compliance) (trading decisions)
Storage Rules:
- Raw trades →
raw_tradestable (every individual trade/orderbook/ticker) - Completed candles →
market_datatable (only when timeframe boundary crossed) - Incomplete candles → Memory only (never stored, used for monitoring)
Aggregation Logic Implementation
def aggregate_to_timeframe(trades: List[StandardizedTrade], timeframe: str) -> List[OHLCVCandle]:
"""
Aggregate trades to specified timeframe with right-aligned timestamps
"""
# Group trades by time intervals
buckets = {}
completed_candles = []
for trade in sorted(trades, key=lambda t: t.timestamp):
# Calculate bucket start time (left boundary)
bucket_start = get_bucket_start_time(trade.timestamp, timeframe)
# Get or create bucket
if bucket_start not in buckets:
buckets[bucket_start] = TimeframeBucket(timeframe, bucket_start)
# Add trade to bucket
buckets[bucket_start].add_trade(trade)
# Convert all buckets to candles with right-aligned timestamps
for bucket in buckets.values():
candle = bucket.to_candle(is_complete=True)
# candle.timestamp = bucket.end_time (right-aligned)
completed_candles.append(candle)
return completed_candles
Common Components
Data Types (data/common/data_types.py)
StandardizedTrade: Universal trade format
@dataclass
class StandardizedTrade:
symbol: str
trade_id: str
price: Decimal
size: Decimal
side: str # 'buy' or 'sell'
timestamp: datetime
exchange: str = "okx"
raw_data: Optional[Dict[str, Any]] = None
OHLCVCandle: Universal candle format
@dataclass
class OHLCVCandle:
symbol: str
timeframe: str
start_time: datetime
end_time: datetime
open: Decimal
high: Decimal
low: Decimal
close: Decimal
volume: Decimal
trade_count: int
is_complete: bool = False
Aggregation (data/common/aggregation.py)
RealTimeCandleProcessor: Handles real-time candle building for any exchange
- Processes trades immediately as they arrive
- Supports multiple timeframes simultaneously
- Emits completed candles when time boundaries cross
- Thread-safe and memory efficient
BatchCandleProcessor: Handles historical data processing
- Processes large batches of trades efficiently
- Memory-optimized for backfill scenarios
- Same candle output format as real-time processor
Transformation (data/common/transformation.py)
BaseDataTransformer: Abstract base class for exchange transformers
- Common transformation utilities (timestamp conversion, decimal handling)
- Abstract methods for exchange-specific transformations
- Consistent error handling patterns
UnifiedDataTransformer: Unified interface for all transformation scenarios
- Works with real-time, historical, and backfill data
- Handles batch processing efficiently
- Integrates with aggregation components
Validation (data/common/validation.py)
BaseDataValidator: Common validation patterns
- Price, size, volume validation
- Timestamp validation
- Orderbook validation
- Generic symbol validation
Exchange-Specific Components
OKX Data Processor (data/exchanges/okx/data_processor.py)
Now focused only on OKX-specific functionality:
OKXDataValidator: Extends BaseDataValidator
- OKX-specific symbol patterns (BTC-USDT format)
- OKX message structure validation
- OKX field mappings and requirements
OKXDataTransformer: Extends BaseDataTransformer
- OKX WebSocket format transformation
- OKX-specific field extraction
- Integration with common utilities
OKXDataProcessor: Main processor using common framework
- Uses common validation and transformation utilities
- Significantly simplified (~600 lines vs 1343 lines)
- Better separation of concerns
Updated OKX Collector (data/exchanges/okx/collector.py)
Key improvements:
- Uses OKXDataProcessor with common utilities
- Automatic candle generation for trades
- Simplified message processing
- Better error handling and statistics
- Callback system for real-time data
Usage Examples
Creating a New Exchange
To add support for a new exchange (e.g., Binance):
- Create exchange-specific validator:
class BinanceDataValidator(BaseDataValidator):
def __init__(self, component_name="binance_validator"):
super().__init__("binance", component_name)
self._symbol_pattern = re.compile(r'^[A-Z]+[A-Z]+$') # BTCUSDT format
def validate_symbol_format(self, symbol: str) -> ValidationResult:
# Binance-specific symbol validation
pass
- Create exchange-specific transformer:
class BinanceDataTransformer(BaseDataTransformer):
def transform_trade_data(self, raw_data: Dict[str, Any], symbol: str) -> Optional[StandardizedTrade]:
return create_standardized_trade(
symbol=raw_data['s'], # Binance field mapping
trade_id=raw_data['t'],
price=raw_data['p'],
size=raw_data['q'],
side='buy' if raw_data['m'] else 'sell',
timestamp=raw_data['T'],
exchange="binance",
raw_data=raw_data
)
- Automatic candle support:
# Real-time candles work automatically
processor = RealTimeCandleProcessor(symbol, "binance", config)
for trade in trades:
completed_candles = processor.process_trade(trade)
Using Common Utilities
Data transformation:
# Works with any exchange
transformer = UnifiedDataTransformer(exchange_transformer)
standardized_trade = transformer.transform_trade_data(raw_trade, symbol)
# Batch processing
candles = transformer.process_trades_to_candles(
trades_iterator,
['1m', '5m', '1h'],
symbol
)
Real-time candle processing:
# Same code works for any exchange
candle_processor = RealTimeCandleProcessor(symbol, exchange, config)
candle_processor.add_candle_callback(my_candle_handler)
for trade in real_time_trades:
completed_candles = candle_processor.process_trade(trade)
Testing
The refactored architecture includes comprehensive testing:
Test script: scripts/test_refactored_okx.py
- Tests common utilities
- Tests OKX-specific components
- Tests integration between components
- Performance and memory testing
Run tests:
python scripts/test_refactored_okx.py
Migration Guide
For Existing OKX Code
- Update imports:
# Old
from data.exchanges.okx.data_processor import StandardizedTrade, OHLCVCandle
# New
from data.common import StandardizedTrade, OHLCVCandle
- Use new processor:
# Old
from data.exchanges.okx.data_processor import OKXDataProcessor, UnifiedDataTransformer
# New
from data.exchanges.okx.data_processor import OKXDataProcessor # Uses common utilities internally
- Existing functionality preserved:
- All existing APIs remain the same
- Performance improved due to optimizations
- More features available (better candle processing, validation)
For New Exchange Development
- Start with common base classes
- Implement only exchange-specific validation and transformation
- Get candle processing, batch processing, and validation for free
- Focus on exchange API integration rather than data processing logic
Performance Improvements
Memory Usage:
- Streaming processing reduces memory footprint
- Efficient candle bucketing algorithms
- Lazy evaluation where possible
Processing Speed:
- Optimized validation with early returns
- Batch processing capabilities
- Parallel processing support
Maintainability:
- Smaller, focused components
- Better test coverage
- Clear error handling and logging
Future Enhancements
Planned Features:
- Exchange Factory Pattern - Automatically create collectors for any exchange
- Plugin System - Load exchange implementations dynamically
- Configuration-Driven Development - Define new exchanges via config files
- Enhanced Analytics - Built-in technical indicators and statistics
- Multi-Exchange Arbitrage - Cross-exchange data synchronization
This refactored architecture provides a solid foundation for scalable, maintainable cryptocurrency data processing across any number of exchanges while keeping exchange-specific code minimal and focused.