4.9 KiB
ADR-001: Persistent Metrics Storage
Status
Accepted
Context
The original orderflow backtest system kept all orderbook snapshots in memory during processing, leading to excessive memory usage (>1GB for typical datasets). With the addition of OBI and CVD metrics calculation, we needed to decide how to handle the computed metrics and manage memory efficiently.
Decision
We will implement persistent storage of calculated metrics in the SQLite database with the following approach:
- Metrics Table: Create a dedicated
metricstable to store OBI, CVD, and related data - Streaming Processing: Process snapshots one-by-one, calculate metrics, store results, then discard snapshots
- Batch Operations: Use batch inserts (1000 records) for optimal database performance
- Query Interface: Provide time-range queries for metrics retrieval and analysis
Consequences
Positive
- Memory Reduction: >70% reduction in peak memory usage during processing
- Avoid Recalculation: Metrics calculated once and reused for multiple analysis runs
- Scalability: Can process months/years of data without memory constraints
- Performance: Batch database operations provide high throughput
- Persistence: Metrics survive between application runs
- Analysis Ready: Stored metrics enable complex time-series analysis
Negative
- Storage Overhead: Metrics table adds ~20% to database size
- Complexity: Additional database schema and management code
- Dependencies: Tighter coupling between processing and database layer
- Migration: Existing databases need schema updates for metrics table
Alternatives Considered
Option 1: Keep All Snapshots in Memory
Rejected: Unsustainable memory usage for large datasets. Would limit analysis to small time ranges.
Option 2: Calculate Metrics On-Demand
Rejected: Recalculating metrics for every analysis run is computationally expensive and time-consuming.
Option 3: External Metrics Database
Rejected: Adds deployment complexity. SQLite co-location provides better performance and simpler management.
Option 4: Compressed In-Memory Cache
Rejected: Still faces fundamental memory scaling issues. Compression/decompression adds CPU overhead.
Implementation Details
Database Schema
CREATE TABLE metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
snapshot_id INTEGER NOT NULL,
timestamp TEXT NOT NULL,
obi REAL NOT NULL,
cvd REAL NOT NULL,
best_bid REAL,
best_ask REAL,
FOREIGN KEY (snapshot_id) REFERENCES book(id)
);
CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id);
Processing Pipeline
- Create metrics table if not exists
- Stream through orderbook snapshots
- For each snapshot:
- Calculate OBI and CVD metrics
- Batch store metrics (1000 records per commit)
- Discard snapshot from memory
- Provide query interface for time-range retrieval
Memory Management
- Before: Store all snapshots → Calculate on demand → High memory usage
- After: Stream snapshots → Calculate immediately → Store metrics → Low memory usage
Migration Strategy
Backward Compatibility
- Existing databases continue to work without metrics table
- System auto-creates metrics table on first processing run
- Fallback to real-time calculation if metrics unavailable
Performance Impact
- Processing Time: Slight increase due to database writes (~10%)
- Query Performance: Significant improvement for repeated analysis
- Overall: Net positive performance for typical usage patterns
Monitoring and Validation
Success Metrics
- Memory Usage: Target >70% reduction in peak memory usage
- Processing Speed: Maintain >500 snapshots/second processing rate
- Storage Efficiency: Metrics table <25% of total database size
- Query Performance: <1 second retrieval for typical time ranges
Validation Methods
- Memory profiling during large dataset processing
- Performance benchmarks vs. original system
- Storage overhead analysis across different dataset sizes
- Query performance testing with various time ranges
Future Considerations
Potential Enhancements
- Compression: Consider compression for metrics storage if overhead becomes significant
- Partitioning: Time-based partitioning for very large datasets
- Caching: In-memory cache for frequently accessed metrics
- Export: Direct export capabilities for external analysis tools
Scalability Options
- Database Upgrade: PostgreSQL if SQLite becomes limiting factor
- Parallel Processing: Multi-threaded metrics calculation
- Distributed Storage: For institutional-scale datasets
This decision provides a solid foundation for efficient, scalable metrics processing while maintaining simplicity and performance characteristics suitable for the target use cases.