123 lines
4.4 KiB
Markdown
123 lines
4.4 KiB
Markdown
# ADR-001: SQLite Database Choice
|
|
|
|
## Status
|
|
Accepted
|
|
|
|
## Context
|
|
The orderflow backtest system needs to efficiently store and stream large volumes of historical orderbook and trade data. Key requirements include:
|
|
|
|
- Fast sequential read access for time-series data
|
|
- Minimal setup and maintenance overhead
|
|
- Support for concurrent reads from visualization layer
|
|
- Ability to handle databases ranging from 100MB to 10GB+
|
|
- No network dependencies for data access
|
|
|
|
## Decision
|
|
We will use SQLite as the primary database for storing historical orderbook and trade data.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- **Zero configuration**: No database server setup or administration required
|
|
- **Excellent read performance**: Optimized for sequential scans with proper PRAGMA settings
|
|
- **Built-in Python support**: No external dependencies or connection libraries needed
|
|
- **File portability**: Database files can be easily shared and archived
|
|
- **ACID compliance**: Ensures data integrity during writes (for data ingestion)
|
|
- **Small footprint**: Minimal memory and storage overhead
|
|
- **Fast startup**: No connection pooling or server initialization delays
|
|
|
|
### Negative
|
|
- **Single writer limitation**: Cannot handle concurrent writes (acceptable for read-only backtest)
|
|
- **Limited scalability**: Not suitable for high-concurrency production trading systems
|
|
- **No network access**: Cannot query databases remotely (acceptable for local analysis)
|
|
- **File locking**: Potential issues with file system sharing (mitigated by read-only access)
|
|
|
|
## Implementation Details
|
|
|
|
### Schema Design
|
|
```sql
|
|
-- Orderbook snapshots with timestamp windows
|
|
CREATE TABLE book (
|
|
id INTEGER PRIMARY KEY,
|
|
instrument TEXT,
|
|
bids TEXT NOT NULL, -- JSON array of [price, size] pairs
|
|
asks TEXT NOT NULL, -- JSON array of [price, size] pairs
|
|
timestamp TEXT NOT NULL
|
|
);
|
|
|
|
-- Individual trade records
|
|
CREATE TABLE trades (
|
|
id INTEGER PRIMARY KEY,
|
|
instrument TEXT,
|
|
trade_id TEXT,
|
|
price REAL NOT NULL,
|
|
size REAL NOT NULL,
|
|
side TEXT NOT NULL, -- "buy" or "sell"
|
|
timestamp TEXT NOT NULL
|
|
);
|
|
|
|
-- Indexes for efficient time-based queries
|
|
CREATE INDEX idx_book_timestamp ON book(timestamp);
|
|
CREATE INDEX idx_trades_timestamp ON trades(timestamp);
|
|
```
|
|
|
|
### Performance Optimizations
|
|
```python
|
|
# Read-only connection with optimized PRAGMA settings
|
|
connection_uri = f"file:{db_path}?immutable=1&mode=ro"
|
|
conn = sqlite3.connect(connection_uri, uri=True)
|
|
conn.execute("PRAGMA query_only = 1")
|
|
conn.execute("PRAGMA temp_store = MEMORY")
|
|
conn.execute("PRAGMA mmap_size = 268435456") # 256MB
|
|
conn.execute("PRAGMA cache_size = 10000")
|
|
```
|
|
|
|
## Alternatives Considered
|
|
|
|
### PostgreSQL
|
|
- **Rejected**: Requires server setup and maintenance
|
|
- **Pros**: Better concurrent access, richer query features
|
|
- **Cons**: Overkill for read-only use case, deployment complexity
|
|
|
|
### Parquet Files
|
|
- **Rejected**: Limited query capabilities for time-series data
|
|
- **Pros**: Excellent compression, columnar format
|
|
- **Cons**: No indexes, complex range queries, requires additional libraries
|
|
|
|
### MongoDB
|
|
- **Rejected**: Document structure not optimal for time-series data
|
|
- **Pros**: Flexible schema, good aggregation pipeline
|
|
- **Cons**: Requires server, higher memory usage, learning curve
|
|
|
|
### CSV Files
|
|
- **Rejected**: Poor query performance for large datasets
|
|
- **Pros**: Simple format, universal compatibility
|
|
- **Cons**: No indexing, slow filtering, type conversion overhead
|
|
|
|
### InfluxDB
|
|
- **Rejected**: Overkill for historical data analysis
|
|
- **Pros**: Optimized for time-series, good compression
|
|
- **Cons**: Additional service dependency, learning curve
|
|
|
|
## Migration Path
|
|
If scalability becomes an issue in the future:
|
|
|
|
1. **Phase 1**: Implement database abstraction layer in `db_interpreter`
|
|
2. **Phase 2**: Add PostgreSQL adapter for production workloads
|
|
3. **Phase 3**: Implement data partitioning for very large datasets
|
|
4. **Phase 4**: Consider distributed storage for multi-terabyte datasets
|
|
|
|
## Monitoring
|
|
Track the following metrics to validate this decision:
|
|
- Database file sizes and growth rates
|
|
- Query performance for different date ranges
|
|
- Memory usage during streaming operations
|
|
- Time to process complete backtests
|
|
|
|
## Review Date
|
|
This decision should be reviewed if:
|
|
- Database files consistently exceed 50GB
|
|
- Query performance degrades below 1000 rows/second
|
|
- Concurrent access requirements change
|
|
- Network-based data sharing becomes necessary
|