3.9 KiB
OHLCVPredictor
End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.
Quickstart (uv)
Prereqs:
- Python 3.12+
uvinstalled (seehttps://docs.astral.sh/uv/)
Install dependencies:
uv sync
Run training (expects an input CSV; see Data Requirements):
uv run python main.py
Run the inference demo:
uv run python inference_example.py
Data requirements
Your input DataFrame/CSV must include these columns:
Timestamp,Open,High,Low,Close,Volume,log_return
Notes:
Timestampcan be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.log_returnshould be computed as:
The training script (df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))main.py) computes it automatically. For standalone inference, ensure it exists before calling the predictor.- The training script filters out rows with
Volume == 0and focuses on data newer than2017-06-01by default.
Training workflow
The training entrypoint is main.py:
- Reads the CSV at
../data/btcusd_1-min_data.csvby default. Adjustcsv_pathinmain.pyto point to your data, or move your CSV to that path. - Engineers a large set of technical and OHLCV-derived features (see
feature_engineering.pyandtechnical_indicator_functions.py). - Optionally performs walk-forward cross validation to compute averaged feature importances.
- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
- Produces charts with Plotly into
charts/.
Outputs produced by training:
- Model:
../data/xgboost_model_all_features.json - Results CSV:
../data/cumulative_feature_results.csv - Charts: files under
charts/(e.g.,all_features_prediction_error_distribution.html)
Run:
uv run python main.py
If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).
Inference usage
You can reuse the predictor in other projects or run the included example.
Minimal example:
from predictor import OHLCVPredictor
import numpy as np
predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')
# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)
Run the comprehensive demo:
uv run python inference_example.py
Files needed to embed the predictor in another project:
predictor.pycustom_xgboost.pyfeature_engineering.pytechnical_indicator_functions.py- your trained model file (e.g.,
xgboost_model_all_features.json)
GPU/CPU notes
Training uses XGBoost with device='cuda' by default (see custom_xgboost.py). If you do not have a CUDA-capable GPU or drivers:
- Change the parameter in
CustomXGBoostGPU.train()fromdevice='cuda'todevice='cpu', or - Pass
device='cpu'when callingtrain()wherever applicable.
Inference works on CPU even if the model was trained on GPU.
Dependencies
The project is managed via pyproject.toml and uv. Key runtime deps include:
xgboost,pandas,numpy,scikit-learn,ta,numbadash/Plotly for charts (Plotly is used byplot_results.py)
Install using:
uv sync
Troubleshooting
- KeyError:
'log_return'during inference: ensure your input DataFrame includeslog_returnas described above. - Model file not found: confirm the path passed to
OHLCVPredictor(...)matches where training saved it (default../data/xgboost_model_all_features.json). - Empty/old charts: delete the
charts/folder and rerun training. - Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.