orderflow_backtest/docs/architecture.md

308 lines
12 KiB
Markdown
Raw Normal View History

# System Architecture
## Overview
The Orderflow Backtest System is designed as a modular, high-performance data processing pipeline for cryptocurrency trading analysis. The architecture emphasizes separation of concerns, efficient memory usage, and scalable processing of large datasets.
## High-Level Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Data Sources │ │ Processing │ │ Presentation │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌──────────────┐ │ │ ┌─────────────┐ │
│ │SQLite Files │─┼────┼→│ Storage │─┼────┼→│ Visualizer │ │
│ │- orderbook │ │ │ │- Orchestrator│ │ │ │- OHLC Charts│ │
│ │- trades │ │ │ │- Calculator │ │ │ │- OBI/CVD │ │
│ └─────────────┘ │ │ └──────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │ ▲ │
└─────────────────┘ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Strategy │──┼────┼→│ Reports │ │
│ │- Analysis │ │ │ │- Metrics │ │
│ │- Alerts │ │ │ │- Summaries │ │
│ └─────────────┘ │ │ └─────────────┘ │
└──────────────────┘ └─────────────────┘
```
## Component Architecture
### Data Layer
#### Models (`models.py`)
**Purpose**: Core data structures and calculation logic
```python
# Core data models
OrderbookLevel # Single price level (price, size, order_count, liquidation_count)
Trade # Individual trade execution (price, size, side, timestamp)
BookSnapshot # Complete orderbook state at timestamp
Book # Container for snapshot sequence
Metric # Calculated OBI/CVD values
# Calculation engine
MetricCalculator # Static methods for OBI/CVD computation
```
**Relationships**:
- `Book` contains multiple `BookSnapshot` instances
- `BookSnapshot` contains dictionaries of `OrderbookLevel` and lists of `Trade`
- `Metric` stores calculated values for each `BookSnapshot`
- `MetricCalculator` operates on snapshots to produce metrics
#### Repositories (`repositories/`)
**Purpose**: Database access and persistence layer
```python
# Read-only base repository
SQLiteOrderflowRepository:
- connect() # Optimized SQLite connection
- load_trades_by_timestamp() # Efficient trade loading
- iterate_book_rows() # Memory-efficient snapshot streaming
- count_rows() # Performance monitoring
# Write-enabled metrics repository
SQLiteMetricsRepository:
- create_metrics_table() # Schema creation
- insert_metrics_batch() # High-performance batch inserts
- load_metrics_by_timerange() # Time-range queries
- table_exists() # Schema validation
```
**Design Patterns**:
- **Repository Pattern**: Clean separation between data access and business logic
- **Batch Processing**: Process 1000 records per database operation
- **Connection Management**: Caller manages connection lifecycle
- **Performance Optimization**: SQLite PRAGMAs for high-speed operations
### Processing Layer
#### Storage (`storage.py`)
**Purpose**: Orchestrates data loading, processing, and metrics calculation
```python
class Storage:
- build_booktick_from_db() # Main processing pipeline
- _create_snapshots_and_metrics() # Per-snapshot processing
- _snapshot_from_row() # Individual snapshot creation
```
**Processing Pipeline**:
1. **Initialize**: Create metrics repository and table if needed
2. **Load Trades**: Group trades by timestamp for efficient access
3. **Stream Processing**: Process snapshots one-by-one to minimize memory
4. **Calculate Metrics**: OBI and CVD calculation per snapshot
5. **Batch Persistence**: Store metrics in batches of 1000
6. **Memory Management**: Discard full snapshots after metric extraction
#### Strategy Framework (`strategies.py`)
**Purpose**: Trading analysis and signal generation
```python
class DefaultStrategy:
- set_db_path() # Configure database access
- compute_OBI() # Real-time OBI calculation (fallback)
- load_stored_metrics() # Retrieve persisted metrics
- get_metrics_summary() # Statistical analysis
- on_booktick() # Main analysis entry point
```
**Analysis Capabilities**:
- **Stored Metrics**: Primary analysis using persisted data
- **Real-time Fallback**: Live calculation for compatibility
- **Statistical Summaries**: Min/max/average OBI, CVD changes
- **Alert System**: Configurable thresholds for significant imbalances
### Presentation Layer
#### Visualization (`visualizer.py`)
**Purpose**: Multi-chart rendering and display
```python
class Visualizer:
- set_db_path() # Configure metrics access
- update_from_book() # Main rendering pipeline
- _load_stored_metrics() # Retrieve metrics for chart range
- _draw() # Multi-subplot rendering
- show() # Display interactive charts
```
**Chart Layout**:
```
┌─────────────────────────────────────┐
│ OHLC Candlesticks │ ← Price action
├─────────────────────────────────────┤
│ Volume Bars │ ← Trading volume
├─────────────────────────────────────┤
│ OBI Line Chart │ ← Order book imbalance
├─────────────────────────────────────┤
│ CVD Line Chart │ ← Cumulative volume delta
└─────────────────────────────────────┘
```
**Features**:
- **Shared Time Axis**: Synchronized X-axis across all subplots
- **Auto-scaling**: Y-axis optimization for each metric type
- **Performance**: Efficient rendering of large datasets
- **Interactive**: Qt5Agg backend for zooming and panning
## Data Flow
### Processing Flow
```
1. SQLite DB → Repository → Raw Data
2. Raw Data → Storage → BookSnapshot
3. BookSnapshot → MetricCalculator → OBI/CVD
4. Metrics → Repository → Database Storage
5. Stored Metrics → Strategy → Analysis
6. Stored Metrics → Visualizer → Charts
```
### Memory Management Flow
```
Traditional: DB → All Snapshots in Memory → Analysis (High Memory)
Optimized: DB → Process Snapshot → Calculate Metrics → Store → Discard (Low Memory)
```
## Database Schema
### Input Schema (Required)
```sql
-- Orderbook snapshots
CREATE TABLE book (
id INTEGER PRIMARY KEY,
instrument TEXT,
bids TEXT, -- JSON: [[price, size, liq_count, order_count], ...]
asks TEXT, -- JSON: [[price, size, liq_count, order_count], ...]
timestamp TEXT
);
-- Trade executions
CREATE TABLE trades (
id INTEGER PRIMARY KEY,
instrument TEXT,
trade_id TEXT,
price REAL,
size REAL,
side TEXT, -- "buy" or "sell"
timestamp TEXT
);
```
### Output Schema (Auto-created)
```sql
-- Calculated metrics
CREATE TABLE metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
snapshot_id INTEGER,
timestamp TEXT,
obi REAL, -- Order Book Imbalance [-1, 1]
cvd REAL, -- Cumulative Volume Delta
best_bid REAL,
best_ask REAL,
FOREIGN KEY (snapshot_id) REFERENCES book(id)
);
-- Performance indexes
CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id);
```
## Performance Characteristics
### Memory Optimization
- **Before**: Store all snapshots in memory (~1GB for 600K snapshots)
- **After**: Store only metrics data (~300MB for same dataset)
- **Reduction**: >70% memory usage decrease
### Processing Performance
- **Batch Size**: 1000 records per database operation
- **Processing Speed**: ~1000 snapshots/second on modern hardware
- **Database Overhead**: <20% storage increase for metrics table
- **Query Performance**: Sub-second retrieval for typical time ranges
### Scalability Limits
- **Single File**: 1M+ snapshots per database file
- **Time Range**: Months to years of historical data
- **Memory Peak**: <2GB for year-long datasets
- **Disk Space**: Original size + 20% for metrics
## Integration Points
### External Interfaces
```python
# Main application entry point
main.py:
- CLI argument parsing
- Database file discovery
- Component orchestration
- Progress monitoring
# Plugin interfaces
Strategy.on_booktick(book: Book) # Strategy integration point
Visualizer.update_from_book(book) # Visualization integration
```
### Internal Interfaces
```python
# Repository interfaces
Repository.connect() → Connection
Repository.load_data() → TypedData
Repository.store_data(data) → None
# Calculator interfaces
MetricCalculator.calculate_obi(snapshot) → float
MetricCalculator.calculate_cvd(prev_cvd, trades) → float
```
## Security Considerations
### Data Protection
- **SQL Injection**: All queries use parameterized statements
- **File Access**: Validates database file paths and permissions
- **Error Handling**: No sensitive data in error messages
- **Input Validation**: Sanitizes all external inputs
### Access Control
- **Database**: Respects file system permissions
- **Memory**: No sensitive data persistence beyond processing
- **Logging**: Configurable log levels without data exposure
## Configuration Management
### Performance Tuning
```python
# Storage configuration
BATCH_SIZE = 1000 # Records per database operation
LOG_FREQUENCY = 20 # Progress reports per processing run
# SQLite optimization
PRAGMA journal_mode = OFF # Maximum write performance
PRAGMA synchronous = OFF # Disable synchronous writes
PRAGMA cache_size = 100000 # Large memory cache
```
### Visualization Settings
```python
# Chart configuration
WINDOW_SECONDS = 60 # OHLC aggregation window
MAX_BARS = 500 # Maximum bars displayed
FIGURE_SIZE = (12, 10) # Chart dimensions
```
## Error Handling Strategy
### Graceful Degradation
- **Database Errors**: Continue with reduced functionality
- **Calculation Errors**: Skip problematic snapshots with logging
- **Visualization Errors**: Display available data, note issues
- **Memory Pressure**: Adjust batch sizes automatically
### Recovery Mechanisms
- **Partial Processing**: Resume from last successful batch
- **Data Validation**: Verify metrics calculations before storage
- **Rollback Support**: Transaction boundaries for data consistency
---
This architecture provides a robust, scalable foundation for high-frequency trading data analysis while maintaining clean separation of concerns and efficient resource utilization.