orderflow_backtest/docs/decisions/ADR-001-sqlite-database-choice.md

4.4 KiB

ADR-001: SQLite Database Choice

Status

Accepted

Context

The orderflow backtest system needs to efficiently store and stream large volumes of historical orderbook and trade data. Key requirements include:

  • Fast sequential read access for time-series data
  • Minimal setup and maintenance overhead
  • Support for concurrent reads from visualization layer
  • Ability to handle databases ranging from 100MB to 10GB+
  • No network dependencies for data access

Decision

We will use SQLite as the primary database for storing historical orderbook and trade data.

Consequences

Positive

  • Zero configuration: No database server setup or administration required
  • Excellent read performance: Optimized for sequential scans with proper PRAGMA settings
  • Built-in Python support: No external dependencies or connection libraries needed
  • File portability: Database files can be easily shared and archived
  • ACID compliance: Ensures data integrity during writes (for data ingestion)
  • Small footprint: Minimal memory and storage overhead
  • Fast startup: No connection pooling or server initialization delays

Negative

  • Single writer limitation: Cannot handle concurrent writes (acceptable for read-only backtest)
  • Limited scalability: Not suitable for high-concurrency production trading systems
  • No network access: Cannot query databases remotely (acceptable for local analysis)
  • File locking: Potential issues with file system sharing (mitigated by read-only access)

Implementation Details

Schema Design

-- Orderbook snapshots with timestamp windows
CREATE TABLE book (
    id INTEGER PRIMARY KEY,
    instrument TEXT,
    bids TEXT NOT NULL,        -- JSON array of [price, size] pairs
    asks TEXT NOT NULL,        -- JSON array of [price, size] pairs
    timestamp TEXT NOT NULL
);

-- Individual trade records
CREATE TABLE trades (
    id INTEGER PRIMARY KEY,
    instrument TEXT,
    trade_id TEXT,
    price REAL NOT NULL,
    size REAL NOT NULL,
    side TEXT NOT NULL,        -- "buy" or "sell"
    timestamp TEXT NOT NULL
);

-- Indexes for efficient time-based queries
CREATE INDEX idx_book_timestamp ON book(timestamp);
CREATE INDEX idx_trades_timestamp ON trades(timestamp);

Performance Optimizations

# Read-only connection with optimized PRAGMA settings
connection_uri = f"file:{db_path}?immutable=1&mode=ro"
conn = sqlite3.connect(connection_uri, uri=True)
conn.execute("PRAGMA query_only = 1")
conn.execute("PRAGMA temp_store = MEMORY")
conn.execute("PRAGMA mmap_size = 268435456")  # 256MB
conn.execute("PRAGMA cache_size = 10000")

Alternatives Considered

PostgreSQL

  • Rejected: Requires server setup and maintenance
  • Pros: Better concurrent access, richer query features
  • Cons: Overkill for read-only use case, deployment complexity

Parquet Files

  • Rejected: Limited query capabilities for time-series data
  • Pros: Excellent compression, columnar format
  • Cons: No indexes, complex range queries, requires additional libraries

MongoDB

  • Rejected: Document structure not optimal for time-series data
  • Pros: Flexible schema, good aggregation pipeline
  • Cons: Requires server, higher memory usage, learning curve

CSV Files

  • Rejected: Poor query performance for large datasets
  • Pros: Simple format, universal compatibility
  • Cons: No indexing, slow filtering, type conversion overhead

InfluxDB

  • Rejected: Overkill for historical data analysis
  • Pros: Optimized for time-series, good compression
  • Cons: Additional service dependency, learning curve

Migration Path

If scalability becomes an issue in the future:

  1. Phase 1: Implement database abstraction layer in db_interpreter
  2. Phase 2: Add PostgreSQL adapter for production workloads
  3. Phase 3: Implement data partitioning for very large datasets
  4. Phase 4: Consider distributed storage for multi-terabyte datasets

Monitoring

Track the following metrics to validate this decision:

  • Database file sizes and growth rates
  • Query performance for different date ranges
  • Memory usage during streaming operations
  • Time to process complete backtests

Review Date

This decision should be reviewed if:

  • Database files consistently exceed 50GB
  • Query performance degrades below 1000 rows/second
  • Concurrent access requirements change
  • Network-based data sharing becomes necessary