121 lines
4.9 KiB
Markdown
121 lines
4.9 KiB
Markdown
|
|
# ADR-001: Persistent Metrics Storage
|
||
|
|
|
||
|
|
## Status
|
||
|
|
Accepted
|
||
|
|
|
||
|
|
## Context
|
||
|
|
The original orderflow backtest system kept all orderbook snapshots in memory during processing, leading to excessive memory usage (>1GB for typical datasets). With the addition of OBI and CVD metrics calculation, we needed to decide how to handle the computed metrics and manage memory efficiently.
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
We will implement persistent storage of calculated metrics in the SQLite database with the following approach:
|
||
|
|
|
||
|
|
1. **Metrics Table**: Create a dedicated `metrics` table to store OBI, CVD, and related data
|
||
|
|
2. **Streaming Processing**: Process snapshots one-by-one, calculate metrics, store results, then discard snapshots
|
||
|
|
3. **Batch Operations**: Use batch inserts (1000 records) for optimal database performance
|
||
|
|
4. **Query Interface**: Provide time-range queries for metrics retrieval and analysis
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
### Positive
|
||
|
|
- **Memory Reduction**: >70% reduction in peak memory usage during processing
|
||
|
|
- **Avoid Recalculation**: Metrics calculated once and reused for multiple analysis runs
|
||
|
|
- **Scalability**: Can process months/years of data without memory constraints
|
||
|
|
- **Performance**: Batch database operations provide high throughput
|
||
|
|
- **Persistence**: Metrics survive between application runs
|
||
|
|
- **Analysis Ready**: Stored metrics enable complex time-series analysis
|
||
|
|
|
||
|
|
### Negative
|
||
|
|
- **Storage Overhead**: Metrics table adds ~20% to database size
|
||
|
|
- **Complexity**: Additional database schema and management code
|
||
|
|
- **Dependencies**: Tighter coupling between processing and database layer
|
||
|
|
- **Migration**: Existing databases need schema updates for metrics table
|
||
|
|
|
||
|
|
## Alternatives Considered
|
||
|
|
|
||
|
|
### Option 1: Keep All Snapshots in Memory
|
||
|
|
**Rejected**: Unsustainable memory usage for large datasets. Would limit analysis to small time ranges.
|
||
|
|
|
||
|
|
### Option 2: Calculate Metrics On-Demand
|
||
|
|
**Rejected**: Recalculating metrics for every analysis run is computationally expensive and time-consuming.
|
||
|
|
|
||
|
|
### Option 3: External Metrics Database
|
||
|
|
**Rejected**: Adds deployment complexity. SQLite co-location provides better performance and simpler management.
|
||
|
|
|
||
|
|
### Option 4: Compressed In-Memory Cache
|
||
|
|
**Rejected**: Still faces fundamental memory scaling issues. Compression/decompression adds CPU overhead.
|
||
|
|
|
||
|
|
## Implementation Details
|
||
|
|
|
||
|
|
### Database Schema
|
||
|
|
```sql
|
||
|
|
CREATE TABLE metrics (
|
||
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||
|
|
snapshot_id INTEGER NOT NULL,
|
||
|
|
timestamp TEXT NOT NULL,
|
||
|
|
obi REAL NOT NULL,
|
||
|
|
cvd REAL NOT NULL,
|
||
|
|
best_bid REAL,
|
||
|
|
best_ask REAL,
|
||
|
|
FOREIGN KEY (snapshot_id) REFERENCES book(id)
|
||
|
|
);
|
||
|
|
|
||
|
|
CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
|
||
|
|
CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id);
|
||
|
|
```
|
||
|
|
|
||
|
|
### Processing Pipeline
|
||
|
|
1. Create metrics table if not exists
|
||
|
|
2. Stream through orderbook snapshots
|
||
|
|
3. For each snapshot:
|
||
|
|
- Calculate OBI and CVD metrics
|
||
|
|
- Batch store metrics (1000 records per commit)
|
||
|
|
- Discard snapshot from memory
|
||
|
|
4. Provide query interface for time-range retrieval
|
||
|
|
|
||
|
|
### Memory Management
|
||
|
|
- **Before**: Store all snapshots → Calculate on demand → High memory usage
|
||
|
|
- **After**: Stream snapshots → Calculate immediately → Store metrics → Low memory usage
|
||
|
|
|
||
|
|
## Migration Strategy
|
||
|
|
|
||
|
|
### Backward Compatibility
|
||
|
|
- Existing databases continue to work without metrics table
|
||
|
|
- System auto-creates metrics table on first processing run
|
||
|
|
- Fallback to real-time calculation if metrics unavailable
|
||
|
|
|
||
|
|
### Performance Impact
|
||
|
|
- **Processing Time**: Slight increase due to database writes (~10%)
|
||
|
|
- **Query Performance**: Significant improvement for repeated analysis
|
||
|
|
- **Overall**: Net positive performance for typical usage patterns
|
||
|
|
|
||
|
|
## Monitoring and Validation
|
||
|
|
|
||
|
|
### Success Metrics
|
||
|
|
- **Memory Usage**: Target >70% reduction in peak memory usage
|
||
|
|
- **Processing Speed**: Maintain >500 snapshots/second processing rate
|
||
|
|
- **Storage Efficiency**: Metrics table <25% of total database size
|
||
|
|
- **Query Performance**: <1 second retrieval for typical time ranges
|
||
|
|
|
||
|
|
### Validation Methods
|
||
|
|
- Memory profiling during large dataset processing
|
||
|
|
- Performance benchmarks vs. original system
|
||
|
|
- Storage overhead analysis across different dataset sizes
|
||
|
|
- Query performance testing with various time ranges
|
||
|
|
|
||
|
|
## Future Considerations
|
||
|
|
|
||
|
|
### Potential Enhancements
|
||
|
|
- **Compression**: Consider compression for metrics storage if overhead becomes significant
|
||
|
|
- **Partitioning**: Time-based partitioning for very large datasets
|
||
|
|
- **Caching**: In-memory cache for frequently accessed metrics
|
||
|
|
- **Export**: Direct export capabilities for external analysis tools
|
||
|
|
|
||
|
|
### Scalability Options
|
||
|
|
- **Database Upgrade**: PostgreSQL if SQLite becomes limiting factor
|
||
|
|
- **Parallel Processing**: Multi-threaded metrics calculation
|
||
|
|
- **Distributed Storage**: For institutional-scale datasets
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This decision provides a solid foundation for efficient, scalable metrics processing while maintaining simplicity and performance characteristics suitable for the target use cases.
|