- Introduced a modular architecture for data processing, including common utilities for validation, transformation, and aggregation. - Implemented `StandardizedTrade`, `OHLCVCandle`, and `TimeframeBucket` classes for unified data handling across exchanges. - Developed `OKXDataProcessor` for OKX-specific data validation and processing, leveraging the new common framework. - Enhanced `OKXCollector` to utilize the common data processing utilities, improving modularity and maintainability. - Updated documentation to reflect the new architecture and provide guidance on the data processing framework. - Created comprehensive tests for the new data processing components to ensure reliability and functionality.
10 KiB
Data Aggregation Strategy
Overview
This document describes the comprehensive data aggregation strategy used in the TCP Trading Platform for converting real-time trade data into OHLCV (Open, High, Low, Close, Volume) candles across multiple timeframes.
Core Principles
1. Right-Aligned Timestamps (Industry Standard)
The system follows the RIGHT-ALIGNED timestamp convention used by major exchanges:
- Candle timestamp = end time of the interval (close time)
- This represents when the candle period closes, not when it opens
- Aligns with Binance, OKX, Coinbase, and other major exchanges
- Ensures consistency with historical data APIs
Examples:
5-minute candle with timestamp 09:05:00:
├─ Represents data from 09:00:01 to 09:05:00
├─ Includes all trades in the interval [09:00:01, 09:05:00]
└─ Candle "closes" at 09:05:00
1-hour candle with timestamp 14:00:00:
├─ Represents data from 13:00:01 to 14:00:00
├─ Includes all trades in the interval [13:00:01, 14:00:00]
└─ Candle "closes" at 14:00:00
2. Future Leakage Prevention
CRITICAL: The system implements strict safeguards to prevent future leakage:
- Only emit completed candles when time boundary is definitively crossed
- Never emit incomplete candles during real-time processing
- No timer-based completion - only trade timestamp-driven
- Strict time validation for all trade additions
Aggregation Process
Real-Time Processing Flow
graph TD
A[Trade Arrives from WebSocket] --> B[Extract Timestamp T]
B --> C[For Each Timeframe]
C --> D[Calculate Bucket Start Time]
D --> E{Bucket Exists?}
E -->|No| F[Create New Bucket]
E -->|Yes| G{Same Time Period?}
G -->|Yes| H[Add Trade to Current Bucket]
G -->|No| I[Complete Previous Bucket]
I --> J[Emit Completed Candle]
J --> K[Store in market_data Table]
K --> F
F --> H
H --> L[Update OHLCV Values]
L --> M[Continue Processing]
Time Bucket Calculation
The system calculates which time bucket a trade belongs to based on its timestamp:
def get_bucket_start_time(timestamp: datetime, timeframe: str) -> datetime:
"""
Calculate the start time of the bucket for a given trade timestamp.
This determines the LEFT boundary of the time interval.
The RIGHT boundary (end_time) becomes the candle timestamp.
"""
# Normalize to remove seconds/microseconds
dt = timestamp.replace(second=0, microsecond=0)
if timeframe == '1m':
# 1-minute: align to minute boundaries
return dt
elif timeframe == '5m':
# 5-minute: 00:00, 00:05, 00:10, 00:15, etc.
return dt.replace(minute=(dt.minute // 5) * 5)
elif timeframe == '15m':
# 15-minute: 00:00, 00:15, 00:30, 00:45
return dt.replace(minute=(dt.minute // 15) * 15)
elif timeframe == '1h':
# 1-hour: align to hour boundaries
return dt.replace(minute=0)
elif timeframe == '4h':
# 4-hour: 00:00, 04:00, 08:00, 12:00, 16:00, 20:00
return dt.replace(minute=0, hour=(dt.hour // 4) * 4)
elif timeframe == '1d':
# 1-day: align to midnight UTC
return dt.replace(minute=0, hour=0)
Detailed Examples
5-Minute Timeframe Processing
Current time: 09:03:45
Trade arrives at: 09:03:45
Step 1: Calculate bucket start time
├─ timeframe = '5m'
├─ minute = 3
├─ bucket_minute = (3 // 5) * 5 = 0
└─ bucket_start = 09:00:00
Step 2: Bucket boundaries
├─ start_time = 09:00:00 (inclusive)
├─ end_time = 09:05:00 (exclusive)
└─ candle_timestamp = 09:05:00 (right-aligned)
Step 3: Trade validation
├─ 09:00:00 <= 09:03:45 < 09:05:00 ✓
└─ Trade belongs to this bucket
Step 4: OHLCV update
├─ If first trade: set open price
├─ Update high/low prices
├─ Set close price (latest trade)
├─ Add to volume
└─ Increment trade count
Boundary Crossing Example
Scenario: 5-minute timeframe, transition from 09:04:59 to 09:05:00
Trade 1: timestamp = 09:04:59
├─ bucket_start = 09:00:00
├─ Belongs to current bucket [09:00:00 - 09:05:00)
└─ Add to current bucket
Trade 2: timestamp = 09:05:00
├─ bucket_start = 09:05:00
├─ Different from current bucket (09:00:00)
├─ TIME BOUNDARY CROSSED!
├─ Complete previous bucket → candle with timestamp 09:05:00
├─ Store completed candle in market_data table
├─ Create new bucket [09:05:00 - 09:10:00)
└─ Add Trade 2 to new bucket
Data Storage Strategy
Storage Tables
1. raw_trades Table
Purpose: Store every individual piece of data as received Data: Trades, orderbook updates, tickers Usage: Debugging, compliance, detailed analysis
CREATE TABLE raw_trades (
id SERIAL PRIMARY KEY,
exchange VARCHAR(50) NOT NULL,
symbol VARCHAR(20) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
data_type VARCHAR(20) NOT NULL, -- 'trade', 'orderbook', 'ticker'
raw_data JSONB NOT NULL
);
2. market_data Table
Purpose: Store completed OHLCV candles for trading decisions Data: Only completed candles with right-aligned timestamps Usage: Bot strategies, backtesting, analysis
CREATE TABLE market_data (
id SERIAL PRIMARY KEY,
exchange VARCHAR(50) NOT NULL,
symbol VARCHAR(20) NOT NULL,
timeframe VARCHAR(5) NOT NULL,
timestamp TIMESTAMPTZ NOT NULL, -- RIGHT-ALIGNED (candle close time)
open DECIMAL(18,8) NOT NULL,
high DECIMAL(18,8) NOT NULL,
low DECIMAL(18,8) NOT NULL,
close DECIMAL(18,8) NOT NULL,
volume DECIMAL(18,8) NOT NULL,
trades_count INTEGER
);
Storage Flow
WebSocket Message
├─ Contains multiple trades
├─ Each trade stored in raw_trades table
└─ Each trade processed through aggregation
Aggregation Engine
├─ Groups trades by timeframe buckets
├─ Updates OHLCV values incrementally
├─ Detects time boundary crossings
└─ Emits completed candles only
Completed Candles
├─ Stored in market_data table
├─ Timestamp = bucket end time (right-aligned)
├─ is_complete = true
└─ Available for trading strategies
Future Leakage Prevention
Critical Safeguards
1. Boundary Crossing Detection
# CORRECT: Only complete when boundary definitively crossed
if current_bucket.start_time != trade_bucket_start:
# Time boundary crossed - safe to complete previous bucket
if current_bucket.trade_count > 0:
completed_candle = current_bucket.to_candle(is_complete=True)
emit_candle(completed_candle)
2. No Premature Completion
# WRONG: Never complete based on timers or external events
if time.now() > bucket.end_time:
completed_candle = bucket.to_candle(is_complete=True) # FUTURE LEAKAGE!
# WRONG: Never complete incomplete buckets during real-time
if some_condition:
completed_candle = current_bucket.to_candle(is_complete=True) # WRONG!
3. Strict Time Validation
def add_trade(self, trade: StandardizedTrade) -> bool:
# Only accept trades within bucket boundaries
if not (self.start_time <= trade.timestamp < self.end_time):
return False # Reject trades outside time range
# Safe to add trade
self.update_ohlcv(trade)
return True
4. Historical Consistency
# Same logic for real-time and historical processing
def process_trade(trade):
"""Used for both real-time WebSocket and historical API data"""
return self._process_trade_for_timeframe(trade, timeframe)
Testing Strategy
Validation Tests
-
Timestamp Alignment Tests
- Verify candle timestamps are right-aligned
- Check bucket boundary calculations
- Validate timeframe-specific alignment
-
Future Leakage Tests
- Ensure no incomplete candles are emitted
- Verify boundary crossing detection
- Test with edge case timestamps
-
Data Integrity Tests
- OHLCV calculation accuracy
- Volume aggregation correctness
- Trade count validation
Test Examples
def test_right_aligned_timestamps():
"""Test that candle timestamps are right-aligned"""
trades = [
create_trade("09:01:30", price=100),
create_trade("09:03:45", price=101),
create_trade("09:05:00", price=102), # Boundary crossing
]
candles = process_trades(trades, timeframe='5m')
# First candle should have timestamp 09:05:00 (right-aligned)
assert candles[0].timestamp == datetime(hour=9, minute=5)
assert candles[0].start_time == datetime(hour=9, minute=0)
assert candles[0].end_time == datetime(hour=9, minute=5)
def test_no_future_leakage():
"""Test that incomplete candles are never emitted"""
processor = RealTimeCandleProcessor(symbol='BTC-USDT', timeframes=['5m'])
# Add trades within same bucket
trade1 = create_trade("09:01:00", price=100)
trade2 = create_trade("09:03:00", price=101)
# Should return empty list (no completed candles)
completed = processor.process_trade(trade1)
assert len(completed) == 0
completed = processor.process_trade(trade2)
assert len(completed) == 0
# Only when boundary crossed should candle be emitted
trade3 = create_trade("09:05:00", price=102)
completed = processor.process_trade(trade3)
assert len(completed) == 1 # Previous bucket completed
assert completed[0].is_complete == True
Performance Considerations
Memory Management
- Keep only current buckets in memory
- Clear completed buckets immediately after emission
- Limit maximum number of active timeframes
Database Optimization
- Batch insert completed candles
- Use prepared statements for frequent inserts
- Index on (symbol, timeframe, timestamp) for queries
Processing Efficiency
- Process all timeframes in single trade iteration
- Use efficient bucket start time calculations
- Minimize object creation in hot paths
Conclusion
This aggregation strategy ensures:
✅ Industry Standard Compliance: Right-aligned timestamps matching major exchanges
✅ Future Leakage Prevention: Strict boundary detection and validation
✅ Data Integrity: Accurate OHLCV calculations and storage
✅ Performance: Efficient real-time and batch processing
✅ Consistency: Same logic for real-time and historical data
The implementation provides a robust foundation for building trading strategies with confidence in data accuracy and timing.