Simon Moisy fa6df78c1e Add initial implementation of the Orderflow Backtest System with OBI and CVD metrics integration, including core modules for storage, strategies, and visualization. Introduced persistent metrics storage in SQLite, optimized memory usage, and enhanced documentation.

2025-08-26 17:22:07 +08:00

4.9 KiB

Raw Blame History

ADR-001: Persistent Metrics Storage

Status

Accepted

Context

The original orderflow backtest system kept all orderbook snapshots in memory during processing, leading to excessive memory usage (>1GB for typical datasets). With the addition of OBI and CVD metrics calculation, we needed to decide how to handle the computed metrics and manage memory efficiently.

Decision

We will implement persistent storage of calculated metrics in the SQLite database with the following approach:

Metrics Table: Create a dedicated metrics table to store OBI, CVD, and related data
Streaming Processing: Process snapshots one-by-one, calculate metrics, store results, then discard snapshots
Batch Operations: Use batch inserts (1000 records) for optimal database performance
Query Interface: Provide time-range queries for metrics retrieval and analysis

Consequences

Positive

Memory Reduction: >70% reduction in peak memory usage during processing
Avoid Recalculation: Metrics calculated once and reused for multiple analysis runs
Scalability: Can process months/years of data without memory constraints
Performance: Batch database operations provide high throughput
Persistence: Metrics survive between application runs
Analysis Ready: Stored metrics enable complex time-series analysis

Negative

Storage Overhead: Metrics table adds ~20% to database size
Complexity: Additional database schema and management code
Dependencies: Tighter coupling between processing and database layer
Migration: Existing databases need schema updates for metrics table

Alternatives Considered

Option 1: Keep All Snapshots in Memory

Rejected: Unsustainable memory usage for large datasets. Would limit analysis to small time ranges.

Option 2: Calculate Metrics On-Demand

Rejected: Recalculating metrics for every analysis run is computationally expensive and time-consuming.

Option 3: External Metrics Database

Rejected: Adds deployment complexity. SQLite co-location provides better performance and simpler management.

Option 4: Compressed In-Memory Cache

Rejected: Still faces fundamental memory scaling issues. Compression/decompression adds CPU overhead.

Implementation Details

Database Schema

CREATE TABLE metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    snapshot_id INTEGER NOT NULL,
    timestamp TEXT NOT NULL,
    obi REAL NOT NULL,
    cvd REAL NOT NULL,
    best_bid REAL,
    best_ask REAL,
    FOREIGN KEY (snapshot_id) REFERENCES book(id)
);

CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id);

Processing Pipeline

Create metrics table if not exists
Stream through orderbook snapshots
For each snapshot:
- Calculate OBI and CVD metrics
- Batch store metrics (1000 records per commit)
- Discard snapshot from memory
Provide query interface for time-range retrieval

Memory Management

Before: Store all snapshots → Calculate on demand → High memory usage
After: Stream snapshots → Calculate immediately → Store metrics → Low memory usage

Migration Strategy

Backward Compatibility

Existing databases continue to work without metrics table
System auto-creates metrics table on first processing run
Fallback to real-time calculation if metrics unavailable

Performance Impact

Processing Time: Slight increase due to database writes (~10%)
Query Performance: Significant improvement for repeated analysis
Overall: Net positive performance for typical usage patterns

Monitoring and Validation

Success Metrics

Memory Usage: Target >70% reduction in peak memory usage
Processing Speed: Maintain >500 snapshots/second processing rate
Storage Efficiency: Metrics table <25% of total database size
Query Performance: <1 second retrieval for typical time ranges

Validation Methods

Memory profiling during large dataset processing
Performance benchmarks vs. original system
Storage overhead analysis across different dataset sizes
Query performance testing with various time ranges

Future Considerations

Potential Enhancements

Compression: Consider compression for metrics storage if overhead becomes significant
Partitioning: Time-based partitioning for very large datasets
Caching: In-memory cache for frequently accessed metrics
Export: Direct export capabilities for external analysis tools

Scalability Options

Database Upgrade: PostgreSQL if SQLite becomes limiting factor
Parallel Processing: Multi-threaded metrics calculation
Distributed Storage: For institutional-scale datasets

This decision provides a solid foundation for efficient, scalable metrics processing while maintaining simplicity and performance characteristics suitable for the target use cases.

4.9 KiB Raw Blame History