orderflow_backtest/docs/decisions/ADR-001-metrics-storage.md

121 lines
4.9 KiB
Markdown
Raw Normal View History

# ADR-001: Persistent Metrics Storage
## Status
Accepted
## Context
The original orderflow backtest system kept all orderbook snapshots in memory during processing, leading to excessive memory usage (>1GB for typical datasets). With the addition of OBI and CVD metrics calculation, we needed to decide how to handle the computed metrics and manage memory efficiently.
## Decision
We will implement persistent storage of calculated metrics in the SQLite database with the following approach:
1. **Metrics Table**: Create a dedicated `metrics` table to store OBI, CVD, and related data
2. **Streaming Processing**: Process snapshots one-by-one, calculate metrics, store results, then discard snapshots
3. **Batch Operations**: Use batch inserts (1000 records) for optimal database performance
4. **Query Interface**: Provide time-range queries for metrics retrieval and analysis
## Consequences
### Positive
- **Memory Reduction**: >70% reduction in peak memory usage during processing
- **Avoid Recalculation**: Metrics calculated once and reused for multiple analysis runs
- **Scalability**: Can process months/years of data without memory constraints
- **Performance**: Batch database operations provide high throughput
- **Persistence**: Metrics survive between application runs
- **Analysis Ready**: Stored metrics enable complex time-series analysis
### Negative
- **Storage Overhead**: Metrics table adds ~20% to database size
- **Complexity**: Additional database schema and management code
- **Dependencies**: Tighter coupling between processing and database layer
- **Migration**: Existing databases need schema updates for metrics table
## Alternatives Considered
### Option 1: Keep All Snapshots in Memory
**Rejected**: Unsustainable memory usage for large datasets. Would limit analysis to small time ranges.
### Option 2: Calculate Metrics On-Demand
**Rejected**: Recalculating metrics for every analysis run is computationally expensive and time-consuming.
### Option 3: External Metrics Database
**Rejected**: Adds deployment complexity. SQLite co-location provides better performance and simpler management.
### Option 4: Compressed In-Memory Cache
**Rejected**: Still faces fundamental memory scaling issues. Compression/decompression adds CPU overhead.
## Implementation Details
### Database Schema
```sql
CREATE TABLE metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
snapshot_id INTEGER NOT NULL,
timestamp TEXT NOT NULL,
obi REAL NOT NULL,
cvd REAL NOT NULL,
best_bid REAL,
best_ask REAL,
FOREIGN KEY (snapshot_id) REFERENCES book(id)
);
CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id);
```
### Processing Pipeline
1. Create metrics table if not exists
2. Stream through orderbook snapshots
3. For each snapshot:
- Calculate OBI and CVD metrics
- Batch store metrics (1000 records per commit)
- Discard snapshot from memory
4. Provide query interface for time-range retrieval
### Memory Management
- **Before**: Store all snapshots → Calculate on demand → High memory usage
- **After**: Stream snapshots → Calculate immediately → Store metrics → Low memory usage
## Migration Strategy
### Backward Compatibility
- Existing databases continue to work without metrics table
- System auto-creates metrics table on first processing run
- Fallback to real-time calculation if metrics unavailable
### Performance Impact
- **Processing Time**: Slight increase due to database writes (~10%)
- **Query Performance**: Significant improvement for repeated analysis
- **Overall**: Net positive performance for typical usage patterns
## Monitoring and Validation
### Success Metrics
- **Memory Usage**: Target >70% reduction in peak memory usage
- **Processing Speed**: Maintain >500 snapshots/second processing rate
- **Storage Efficiency**: Metrics table <25% of total database size
- **Query Performance**: <1 second retrieval for typical time ranges
### Validation Methods
- Memory profiling during large dataset processing
- Performance benchmarks vs. original system
- Storage overhead analysis across different dataset sizes
- Query performance testing with various time ranges
## Future Considerations
### Potential Enhancements
- **Compression**: Consider compression for metrics storage if overhead becomes significant
- **Partitioning**: Time-based partitioning for very large datasets
- **Caching**: In-memory cache for frequently accessed metrics
- **Export**: Direct export capabilities for external analysis tools
### Scalability Options
- **Database Upgrade**: PostgreSQL if SQLite becomes limiting factor
- **Parallel Processing**: Multi-threaded metrics calculation
- **Distributed Storage**: For institutional-scale datasets
---
This decision provides a solid foundation for efficient, scalable metrics processing while maintaining simplicity and performance characteristics suitable for the target use cases.