OHLCVPredictor/README.md

120 lines
3.9 KiB
Markdown

# OHLCVPredictor
End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.
## Quickstart (uv)
Prereqs:
- Python 3.12+
- `uv` installed (see `https://docs.astral.sh/uv/`)
Install dependencies:
```powershell
uv sync
```
Run training (expects an input CSV; see Data Requirements):
```powershell
uv run python main.py
```
Run the inference demo:
```powershell
uv run python inference_example.py
```
## Data requirements
Your input DataFrame/CSV must include these columns:
- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return`
Notes:
- `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
- `log_return` should be computed as:
```python
df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
```
The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
- The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default.
## Training workflow
The training entrypoint is `main.py`:
- Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path.
- Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`).
- Optionally performs walk-forward cross validation to compute averaged feature importances.
- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
- Produces charts with Plotly into `charts/`.
Outputs produced by training:
- Model: `../data/xgboost_model_all_features.json`
- Results CSV: `../data/cumulative_feature_results.csv`
- Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`)
Run:
```powershell
uv run python main.py
```
If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).
## Inference usage
You can reuse the predictor in other projects or run the included example.
Minimal example:
```python
from predictor import OHLCVPredictor
import numpy as np
predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')
# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)
```
Run the comprehensive demo:
```powershell
uv run python inference_example.py
```
Files needed to embed the predictor in another project:
- `predictor.py`
- `custom_xgboost.py`
- `feature_engineering.py`
- `technical_indicator_functions.py`
- your trained model file (e.g., `xgboost_model_all_features.json`)
## GPU/CPU notes
Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers:
- Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or
- Pass `device='cpu'` when calling `train()` wherever applicable.
Inference works on CPU even if the model was trained on GPU.
## Dependencies
The project is managed via `pyproject.toml` and `uv`. Key runtime deps include:
- `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba`
- `dash`/Plotly for charts (Plotly is used by `plot_results.py`)
Install using:
```powershell
uv sync
```
## Troubleshooting
- KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above.
- Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`).
- Empty/old charts: delete the `charts/` folder and rerun training.
- Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.