# ADR-001: Persistent Metrics Storage

## Status
Accepted

## Context
The original orderflow backtest system kept all orderbook snapshots in memory during processing, leading to excessive memory usage (>1GB for typical datasets). With the addition of OBI and CVD metrics calculation, we needed to decide how to handle the computed metrics and manage memory efficiently.

## Decision
We will implement persistent storage of calculated metrics in the SQLite database with the following approach:

1. **Metrics Table**: Create a dedicated `metrics` table to store OBI, CVD, and related data
2. **Streaming Processing**: Process snapshots one-by-one, calculate metrics, store results, then discard snapshots
3. **Batch Operations**: Use batch inserts (1000 records) for optimal database performance
4. **Query Interface**: Provide time-range queries for metrics retrieval and analysis

## Consequences

### Positive
- **Memory Reduction**: >70% reduction in peak memory usage during processing
- **Avoid Recalculation**: Metrics calculated once and reused for multiple analysis runs
- **Scalability**: Can process months/years of data without memory constraints
- **Performance**: Batch database operations provide high throughput
- **Persistence**: Metrics survive between application runs
- **Analysis Ready**: Stored metrics enable complex time-series analysis

### Negative
- **Storage Overhead**: Metrics table adds ~20% to database size
- **Complexity**: Additional database schema and management code
- **Dependencies**: Tighter coupling between processing and database layer
- **Migration**: Existing databases need schema updates for metrics table

## Alternatives Considered

### Option 1: Keep All Snapshots in Memory
**Rejected**: Unsustainable memory usage for large datasets. Would limit analysis to small time ranges.

### Option 2: Calculate Metrics On-Demand
**Rejected**: Recalculating metrics for every analysis run is computationally expensive and time-consuming.

### Option 3: External Metrics Database
**Rejected**: Adds deployment complexity. SQLite co-location provides better performance and simpler management.

### Option 4: Compressed In-Memory Cache
**Rejected**: Still faces fundamental memory scaling issues. Compression/decompression adds CPU overhead.

## Implementation Details

### Database Schema
```sql
CREATE TABLE metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    snapshot_id INTEGER NOT NULL,
    timestamp TEXT NOT NULL,
    obi REAL NOT NULL,
    cvd REAL NOT NULL,
    best_bid REAL,
    best_ask REAL,
    FOREIGN KEY (snapshot_id) REFERENCES book(id)
);

CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
CREATE INDEX idx_metrics_snapshot_id ON metrics(snapshot_id);
```

### Processing Pipeline
1. Create metrics table if not exists
2. Stream through orderbook snapshots
3. For each snapshot:
   - Calculate OBI and CVD metrics
   - Batch store metrics (1000 records per commit)
   - Discard snapshot from memory
4. Provide query interface for time-range retrieval

### Memory Management
- **Before**: Store all snapshots → Calculate on demand → High memory usage
- **After**: Stream snapshots → Calculate immediately → Store metrics → Low memory usage

## Migration Strategy

### Backward Compatibility
- Existing databases continue to work without metrics table
- System auto-creates metrics table on first processing run
- Fallback to real-time calculation if metrics unavailable

### Performance Impact
- **Processing Time**: Slight increase due to database writes (~10%)
- **Query Performance**: Significant improvement for repeated analysis
- **Overall**: Net positive performance for typical usage patterns

## Monitoring and Validation

### Success Metrics
- **Memory Usage**: Target >70% reduction in peak memory usage
- **Processing Speed**: Maintain >500 snapshots/second processing rate  
- **Storage Efficiency**: Metrics table <25% of total database size
- **Query Performance**: <1 second retrieval for typical time ranges

### Validation Methods
- Memory profiling during large dataset processing
- Performance benchmarks vs. original system
- Storage overhead analysis across different dataset sizes
- Query performance testing with various time ranges

## Future Considerations

### Potential Enhancements
- **Compression**: Consider compression for metrics storage if overhead becomes significant
- **Partitioning**: Time-based partitioning for very large datasets
- **Caching**: In-memory cache for frequently accessed metrics
- **Export**: Direct export capabilities for external analysis tools

### Scalability Options
- **Database Upgrade**: PostgreSQL if SQLite becomes limiting factor
- **Parallel Processing**: Multi-threaded metrics calculation
- **Distributed Storage**: For institutional-scale datasets

---

This decision provides a solid foundation for efficient, scalable metrics processing while maintaining simplicity and performance characteristics suitable for the target use cases.