# ADR-001: Persistent Metrics Storage ## Status Accepted ## Context The original orderflow backtest system kept all orderbook snapshots in memory during processing, leading to excessive memory usage (>1GB for typical datasets). With the addition of OBI and CVD metrics calculation, we needed to decide how to handle the computed metrics and manage memory efficiently. ## Decision We will implement persistent storage of calculated metrics in the SQLite database with the following approach: 1. **Metrics Table**: Create a dedicated `metrics` table to store OBI, CVD, and related data 2. **Streaming Processing**: Process snapshots one-by-one, calculate metrics, store results, then discard snapshots 3. **Batch Operations**: Use batch inserts (1000 records) for optimal database performance 4. **Query Interface**: Provide time-range queries for metrics retrieval and analysis ## Consequences ### Positive - **Memory Reduction**: >70% reduction in peak memory usage during processing - **Avoid Recalculation**: Metrics calculated once and reused for multiple analysis runs - **Scalability**: Can process months/years of data without memory constraints - **Performance**: Batch database operations provide high throughput - **Persistence**: Metrics survive between application runs - **Analysis Ready**: Stored metrics enable complex time-series analysis ### Negative - **Storage Overhead**: Metrics table adds ~20% to database size - **Complexity**: Additional database schema and management code - **Dependencies**: Tighter coupling between processing and database layer - **Migration**: Existing databases need schema updates for metrics table ## Alternatives Considered ### Option 1: Keep All Snapshots in Memory **Rejected**: Unsustainable memory usage for large datasets. Would limit analysis to small time ranges. ### Option 2: Calculate Metrics On-Demand **Rejected**: Recalculating metrics for every analysis run is computationally expensive and time-consuming. ### Option 3: External Metrics Database **Rejected**: Adds deployment complexity. SQLite co-location provides better performance and simpler management. ### Option 4: Compressed In-Memory Cache **Rejected**: Still faces fundamental memory scaling issues. Compression/decompression adds CPU overhead. ## Implementation Details ### Database Schema ```sql CREATE TABLE metrics ( id INTEGER PRIMARY KEY AUTOINCREMENT, snapshot_id INTEGER NOT NULL, timestamp TEXT NOT NULL, obi REAL NOT NULL, cvd REAL NOT NULL, best_bid REAL, best_ask REAL, FOREIGN KEY (snapshot_id) REFERENCES book(id) ); CREATE INDEX idx_metrics_timestamp ON metrics(timestamp); CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id); ``` ### Processing Pipeline 1. Create metrics table if not exists 2. Stream through orderbook snapshots 3. For each snapshot: - Calculate OBI and CVD metrics - Batch store metrics (1000 records per commit) - Discard snapshot from memory 4. Provide query interface for time-range retrieval ### Memory Management - **Before**: Store all snapshots → Calculate on demand → High memory usage - **After**: Stream snapshots → Calculate immediately → Store metrics → Low memory usage ## Migration Strategy ### Backward Compatibility - Existing databases continue to work without metrics table - System auto-creates metrics table on first processing run - Fallback to real-time calculation if metrics unavailable ### Performance Impact - **Processing Time**: Slight increase due to database writes (~10%) - **Query Performance**: Significant improvement for repeated analysis - **Overall**: Net positive performance for typical usage patterns ## Monitoring and Validation ### Success Metrics - **Memory Usage**: Target >70% reduction in peak memory usage - **Processing Speed**: Maintain >500 snapshots/second processing rate - **Storage Efficiency**: Metrics table <25% of total database size - **Query Performance**: <1 second retrieval for typical time ranges ### Validation Methods - Memory profiling during large dataset processing - Performance benchmarks vs. original system - Storage overhead analysis across different dataset sizes - Query performance testing with various time ranges ## Future Considerations ### Potential Enhancements - **Compression**: Consider compression for metrics storage if overhead becomes significant - **Partitioning**: Time-based partitioning for very large datasets - **Caching**: In-memory cache for frequently accessed metrics - **Export**: Direct export capabilities for external analysis tools ### Scalability Options - **Database Upgrade**: PostgreSQL if SQLite becomes limiting factor - **Parallel Processing**: Multi-threaded metrics calculation - **Distributed Storage**: For institutional-scale datasets --- This decision provides a solid foundation for efficient, scalable metrics processing while maintaining simplicity and performance characteristics suitable for the target use cases.