4.4 KiB
4.4 KiB
ADR-001: SQLite Database Choice
Status
Accepted
Context
The orderflow backtest system needs to efficiently store and stream large volumes of historical orderbook and trade data. Key requirements include:
- Fast sequential read access for time-series data
- Minimal setup and maintenance overhead
- Support for concurrent reads from visualization layer
- Ability to handle databases ranging from 100MB to 10GB+
- No network dependencies for data access
Decision
We will use SQLite as the primary database for storing historical orderbook and trade data.
Consequences
Positive
- Zero configuration: No database server setup or administration required
- Excellent read performance: Optimized for sequential scans with proper PRAGMA settings
- Built-in Python support: No external dependencies or connection libraries needed
- File portability: Database files can be easily shared and archived
- ACID compliance: Ensures data integrity during writes (for data ingestion)
- Small footprint: Minimal memory and storage overhead
- Fast startup: No connection pooling or server initialization delays
Negative
- Single writer limitation: Cannot handle concurrent writes (acceptable for read-only backtest)
- Limited scalability: Not suitable for high-concurrency production trading systems
- No network access: Cannot query databases remotely (acceptable for local analysis)
- File locking: Potential issues with file system sharing (mitigated by read-only access)
Implementation Details
Schema Design
-- Orderbook snapshots with timestamp windows
CREATE TABLE book (
id INTEGER PRIMARY KEY,
instrument TEXT,
bids TEXT NOT NULL, -- JSON array of [price, size] pairs
asks TEXT NOT NULL, -- JSON array of [price, size] pairs
timestamp TEXT NOT NULL
);
-- Individual trade records
CREATE TABLE trades (
id INTEGER PRIMARY KEY,
instrument TEXT,
trade_id TEXT,
price REAL NOT NULL,
size REAL NOT NULL,
side TEXT NOT NULL, -- "buy" or "sell"
timestamp TEXT NOT NULL
);
-- Indexes for efficient time-based queries
CREATE INDEX idx_book_timestamp ON book(timestamp);
CREATE INDEX idx_trades_timestamp ON trades(timestamp);
Performance Optimizations
# Read-only connection with optimized PRAGMA settings
connection_uri = f"file:{db_path}?immutable=1&mode=ro"
conn = sqlite3.connect(connection_uri, uri=True)
conn.execute("PRAGMA query_only = 1")
conn.execute("PRAGMA temp_store = MEMORY")
conn.execute("PRAGMA mmap_size = 268435456") # 256MB
conn.execute("PRAGMA cache_size = 10000")
Alternatives Considered
PostgreSQL
- Rejected: Requires server setup and maintenance
- Pros: Better concurrent access, richer query features
- Cons: Overkill for read-only use case, deployment complexity
Parquet Files
- Rejected: Limited query capabilities for time-series data
- Pros: Excellent compression, columnar format
- Cons: No indexes, complex range queries, requires additional libraries
MongoDB
- Rejected: Document structure not optimal for time-series data
- Pros: Flexible schema, good aggregation pipeline
- Cons: Requires server, higher memory usage, learning curve
CSV Files
- Rejected: Poor query performance for large datasets
- Pros: Simple format, universal compatibility
- Cons: No indexing, slow filtering, type conversion overhead
InfluxDB
- Rejected: Overkill for historical data analysis
- Pros: Optimized for time-series, good compression
- Cons: Additional service dependency, learning curve
Migration Path
If scalability becomes an issue in the future:
- Phase 1: Implement database abstraction layer in
db_interpreter - Phase 2: Add PostgreSQL adapter for production workloads
- Phase 3: Implement data partitioning for very large datasets
- Phase 4: Consider distributed storage for multi-terabyte datasets
Monitoring
Track the following metrics to validate this decision:
- Database file sizes and growth rates
- Query performance for different date ranges
- Memory usage during streaming operations
- Time to process complete backtests
Review Date
This decision should be reviewed if:
- Database files consistently exceed 50GB
- Query performance degrades below 1000 rows/second
- Concurrent access requirements change
- Network-based data sharing becomes necessary