# ADR-001: SQLite Database Choice ## Status Accepted ## Context The orderflow backtest system needs to efficiently store and stream large volumes of historical orderbook and trade data. Key requirements include: - Fast sequential read access for time-series data - Minimal setup and maintenance overhead - Support for concurrent reads from visualization layer - Ability to handle databases ranging from 100MB to 10GB+ - No network dependencies for data access ## Decision We will use SQLite as the primary database for storing historical orderbook and trade data. ## Consequences ### Positive - **Zero configuration**: No database server setup or administration required - **Excellent read performance**: Optimized for sequential scans with proper PRAGMA settings - **Built-in Python support**: No external dependencies or connection libraries needed - **File portability**: Database files can be easily shared and archived - **ACID compliance**: Ensures data integrity during writes (for data ingestion) - **Small footprint**: Minimal memory and storage overhead - **Fast startup**: No connection pooling or server initialization delays ### Negative - **Single writer limitation**: Cannot handle concurrent writes (acceptable for read-only backtest) - **Limited scalability**: Not suitable for high-concurrency production trading systems - **No network access**: Cannot query databases remotely (acceptable for local analysis) - **File locking**: Potential issues with file system sharing (mitigated by read-only access) ## Implementation Details ### Schema Design ```sql -- Orderbook snapshots with timestamp windows CREATE TABLE book ( id INTEGER PRIMARY KEY, instrument TEXT, bids TEXT NOT NULL, -- JSON array of [price, size] pairs asks TEXT NOT NULL, -- JSON array of [price, size] pairs timestamp TEXT NOT NULL ); -- Individual trade records CREATE TABLE trades ( id INTEGER PRIMARY KEY, instrument TEXT, trade_id TEXT, price REAL NOT NULL, size REAL NOT NULL, side TEXT NOT NULL, -- "buy" or "sell" timestamp TEXT NOT NULL ); -- Indexes for efficient time-based queries CREATE INDEX idx_book_timestamp ON book(timestamp); CREATE INDEX idx_trades_timestamp ON trades(timestamp); ``` ### Performance Optimizations ```python # Read-only connection with optimized PRAGMA settings connection_uri = f"file:{db_path}?immutable=1&mode=ro" conn = sqlite3.connect(connection_uri, uri=True) conn.execute("PRAGMA query_only = 1") conn.execute("PRAGMA temp_store = MEMORY") conn.execute("PRAGMA mmap_size = 268435456") # 256MB conn.execute("PRAGMA cache_size = 10000") ``` ## Alternatives Considered ### PostgreSQL - **Rejected**: Requires server setup and maintenance - **Pros**: Better concurrent access, richer query features - **Cons**: Overkill for read-only use case, deployment complexity ### Parquet Files - **Rejected**: Limited query capabilities for time-series data - **Pros**: Excellent compression, columnar format - **Cons**: No indexes, complex range queries, requires additional libraries ### MongoDB - **Rejected**: Document structure not optimal for time-series data - **Pros**: Flexible schema, good aggregation pipeline - **Cons**: Requires server, higher memory usage, learning curve ### CSV Files - **Rejected**: Poor query performance for large datasets - **Pros**: Simple format, universal compatibility - **Cons**: No indexing, slow filtering, type conversion overhead ### InfluxDB - **Rejected**: Overkill for historical data analysis - **Pros**: Optimized for time-series, good compression - **Cons**: Additional service dependency, learning curve ## Migration Path If scalability becomes an issue in the future: 1. **Phase 1**: Implement database abstraction layer in `db_interpreter` 2. **Phase 2**: Add PostgreSQL adapter for production workloads 3. **Phase 3**: Implement data partitioning for very large datasets 4. **Phase 4**: Consider distributed storage for multi-terabyte datasets ## Monitoring Track the following metrics to validate this decision: - Database file sizes and growth rates - Query performance for different date ranges - Memory usage during streaming operations - Time to process complete backtests ## Review Date This decision should be reviewed if: - Database files consistently exceed 50GB - Query performance degrades below 1000 rows/second - Concurrent access requirements change - Network-based data sharing becomes necessary