orderflow_backtest/docs/decisions/ADR-001-sqlite-database-choice.md

# ADR-001: SQLite Database Choice

## Status
Accepted

## Context
The orderflow backtest system needs to efficiently store and stream large volumes of historical orderbook and trade data. Key requirements include:

- Fast sequential read access for time-series data
- Minimal setup and maintenance overhead
- Support for concurrent reads from visualization layer
- Ability to handle databases ranging from 100MB to 10GB+
- No network dependencies for data access

## Decision
We will use SQLite as the primary database for storing historical orderbook and trade data.

## Consequences

### Positive
- **Zero configuration**: No database server setup or administration required
- **Excellent read performance**: Optimized for sequential scans with proper PRAGMA settings
- **Built-in Python support**: No external dependencies or connection libraries needed
- **File portability**: Database files can be easily shared and archived
- **ACID compliance**: Ensures data integrity during writes (for data ingestion)
- **Small footprint**: Minimal memory and storage overhead
- **Fast startup**: No connection pooling or server initialization delays

### Negative
- **Single writer limitation**: Cannot handle concurrent writes (acceptable for read-only backtest)
- **Limited scalability**: Not suitable for high-concurrency production trading systems
- **No network access**: Cannot query databases remotely (acceptable for local analysis)
- **File locking**: Potential issues with file system sharing (mitigated by read-only access)

## Implementation Details

### Schema Design
```sql
-- Orderbook snapshots with timestamp windows
CREATE TABLE book (
    id INTEGER PRIMARY KEY,
    instrument TEXT,
    bids TEXT NOT NULL,        -- JSON array of [price, size] pairs
    asks TEXT NOT NULL,        -- JSON array of [price, size] pairs
    timestamp TEXT NOT NULL
);

-- Individual trade records
CREATE TABLE trades (
    id INTEGER PRIMARY KEY,
    instrument TEXT,
    trade_id TEXT,
    price REAL NOT NULL,
    size REAL NOT NULL,
    side TEXT NOT NULL,        -- "buy" or "sell"
    timestamp TEXT NOT NULL
);

-- Indexes for efficient time-based queries
CREATE INDEX idx_book_timestamp ON book(timestamp);
CREATE INDEX idx_trades_timestamp ON trades(timestamp);
```

### Performance Optimizations
```python
# Read-only connection with optimized PRAGMA settings
connection_uri = f"file:{db_path}?immutable=1&mode=ro"
conn = sqlite3.connect(connection_uri, uri=True)
conn.execute("PRAGMA query_only = 1")
conn.execute("PRAGMA temp_store = MEMORY")
conn.execute("PRAGMA mmap_size = 268435456")  # 256MB
conn.execute("PRAGMA cache_size = 10000")
```

## Alternatives Considered

### PostgreSQL
- **Rejected**: Requires server setup and maintenance
- **Pros**: Better concurrent access, richer query features
- **Cons**: Overkill for read-only use case, deployment complexity

### Parquet Files
- **Rejected**: Limited query capabilities for time-series data
- **Pros**: Excellent compression, columnar format
- **Cons**: No indexes, complex range queries, requires additional libraries

### MongoDB
- **Rejected**: Document structure not optimal for time-series data
- **Pros**: Flexible schema, good aggregation pipeline
- **Cons**: Requires server, higher memory usage, learning curve

### CSV Files
- **Rejected**: Poor query performance for large datasets
- **Pros**: Simple format, universal compatibility
- **Cons**: No indexing, slow filtering, type conversion overhead

### InfluxDB
- **Rejected**: Overkill for historical data analysis
- **Pros**: Optimized for time-series, good compression
- **Cons**: Additional service dependency, learning curve

## Migration Path
If scalability becomes an issue in the future:

1. **Phase 1**: Implement database abstraction layer in `db_interpreter`
2. **Phase 2**: Add PostgreSQL adapter for production workloads
3. **Phase 3**: Implement data partitioning for very large datasets
4. **Phase 4**: Consider distributed storage for multi-terabyte datasets

## Monitoring
Track the following metrics to validate this decision:
- Database file sizes and growth rates
- Query performance for different date ranges
- Memory usage during streaming operations
- Time to process complete backtests

## Review Date
This decision should be reviewed if:
- Database files consistently exceed 50GB
- Query performance degrades below 1000 rows/second
- Concurrent access requirements change
- Network-based data sharing becomes necessary
WIP UI rework with qt6 2025-09-10 15:39:16 +08:00			`# ADR-001: SQLite Database Choice`

			`## Status`
			`Accepted`

			`## Context`
			`The orderflow backtest system needs to efficiently store and stream large volumes of historical orderbook and trade data. Key requirements include:`

			`- Fast sequential read access for time-series data`
			`- Minimal setup and maintenance overhead`
			`- Support for concurrent reads from visualization layer`
			`- Ability to handle databases ranging from 100MB to 10GB+`
			`- No network dependencies for data access`

			`## Decision`
			`We will use SQLite as the primary database for storing historical orderbook and trade data.`

			`## Consequences`

			`### Positive`
			`- Zero configuration: No database server setup or administration required`
			`- Excellent read performance: Optimized for sequential scans with proper PRAGMA settings`
			`- Built-in Python support: No external dependencies or connection libraries needed`
			`- File portability: Database files can be easily shared and archived`
			`- ACID compliance: Ensures data integrity during writes (for data ingestion)`
			`- Small footprint: Minimal memory and storage overhead`
			`- Fast startup: No connection pooling or server initialization delays`

			`### Negative`
			`- Single writer limitation: Cannot handle concurrent writes (acceptable for read-only backtest)`
			`- Limited scalability: Not suitable for high-concurrency production trading systems`
			`- No network access: Cannot query databases remotely (acceptable for local analysis)`
			`- File locking: Potential issues with file system sharing (mitigated by read-only access)`

			`## Implementation Details`

			`### Schema Design`
			```sql
			`-- Orderbook snapshots with timestamp windows`
			`CREATE TABLE book (`
			`id INTEGER PRIMARY KEY,`
			`instrument TEXT,`
			`bids TEXT NOT NULL, -- JSON array of [price, size] pairs`
			`asks TEXT NOT NULL, -- JSON array of [price, size] pairs`
			`timestamp TEXT NOT NULL`
			`);`

			`-- Individual trade records`
			`CREATE TABLE trades (`
			`id INTEGER PRIMARY KEY,`
			`instrument TEXT,`
			`trade_id TEXT,`
			`price REAL NOT NULL,`
			`size REAL NOT NULL,`
			`side TEXT NOT NULL, -- "buy" or "sell"`
			`timestamp TEXT NOT NULL`
			`);`

			`-- Indexes for efficient time-based queries`
			`CREATE INDEX idx_book_timestamp ON book(timestamp);`
			`CREATE INDEX idx_trades_timestamp ON trades(timestamp);`
			```

			`### Performance Optimizations`
			```python
			`# Read-only connection with optimized PRAGMA settings`
			`connection_uri = f"file:{db_path}?immutable=1&mode=ro"`
			`conn = sqlite3.connect(connection_uri, uri=True)`
			`conn.execute("PRAGMA query_only = 1")`
			`conn.execute("PRAGMA temp_store = MEMORY")`
			`conn.execute("PRAGMA mmap_size = 268435456") # 256MB`
			`conn.execute("PRAGMA cache_size = 10000")`
			```

			`## Alternatives Considered`

			`### PostgreSQL`
			`- Rejected: Requires server setup and maintenance`
			`- Pros: Better concurrent access, richer query features`
			`- Cons: Overkill for read-only use case, deployment complexity`

			`### Parquet Files`
			`- Rejected: Limited query capabilities for time-series data`
			`- Pros: Excellent compression, columnar format`
			`- Cons: No indexes, complex range queries, requires additional libraries`

			`### MongoDB`
			`- Rejected: Document structure not optimal for time-series data`
			`- Pros: Flexible schema, good aggregation pipeline`
			`- Cons: Requires server, higher memory usage, learning curve`

			`### CSV Files`
			`- Rejected: Poor query performance for large datasets`
			`- Pros: Simple format, universal compatibility`
			`- Cons: No indexing, slow filtering, type conversion overhead`

			`### InfluxDB`
			`- Rejected: Overkill for historical data analysis`
			`- Pros: Optimized for time-series, good compression`
			`- Cons: Additional service dependency, learning curve`

			`## Migration Path`
			`If scalability becomes an issue in the future:`

			1. Phase 1: Implement database abstraction layer in `db_interpreter`
			`2. Phase 2: Add PostgreSQL adapter for production workloads`
			`3. Phase 3: Implement data partitioning for very large datasets`
			`4. Phase 4: Consider distributed storage for multi-terabyte datasets`

			`## Monitoring`
			`Track the following metrics to validate this decision:`
			`- Database file sizes and growth rates`
			`- Query performance for different date ranges`
			`- Memory usage during streaming operations`
			`- Time to process complete backtests`

			`## Review Date`
			`This decision should be reviewed if:`
			`- Database files consistently exceed 50GB`
			`- Query performance degrades below 1000 rows/second`
			`- Concurrent access requirements change`
			`- Network-based data sharing becomes necessary`