OHLCVPredictor
End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.
Quickstart (uv)
Prereqs:
- Python 3.12+
uvinstalled (seehttps://docs.astral.sh/uv/)
Install dependencies:
uv sync
Run training (expects an input CSV; see Data Requirements):
uv run python main.py
Run the inference demo:
uv run python inference_example.py
Data requirements
Your input DataFrame/CSV must include these columns:
Timestamp,Open,High,Low,Close,Volume,log_return
Notes:
Timestampcan be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.log_returnshould be computed as:
The training script (df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))main.py) computes it automatically. For standalone inference, ensure it exists before calling the predictor.- The training script filters out rows with
Volume == 0and focuses on data newer than2017-06-01by default.
Training workflow
The training entrypoint is main.py:
- Reads the CSV at
../data/btcusd_1-min_data.csvby default. Adjustcsv_pathinmain.pyto point to your data, or move your CSV to that path. - Engineers a large set of technical and OHLCV-derived features (see
feature_engineering.pyandtechnical_indicator_functions.py). - Optionally performs walk-forward cross validation to compute averaged feature importances.
- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
- Produces charts with Plotly into
charts/.
Outputs produced by training:
- Model:
../data/xgboost_model_all_features.json - Feature list:
../data/xgboost_model_all_features_features.json(exact feature names and order used for training) - Results CSV:
../data/cumulative_feature_results.csv - Charts: files under
charts/(e.g.,all_features_prediction_error_distribution.html)
Run:
uv run python main.py
If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).
Inference usage
You can reuse the predictor in other projects or run the included example.
Minimal example:
from predictor import OHLCVPredictor
import numpy as np
predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')
# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)
Run the comprehensive demo:
uv run python inference_example.py
Files needed to embed the predictor in another project:
predictor.pycustom_xgboost.pyfeature_engineering.pytechnical_indicator_functions.py- your trained model file (e.g.,
xgboost_model_all_features.json) - the companion feature list JSON saved next to the model (same basename with
_features.json)
GPU/CPU notes
Training uses XGBoost with device='cuda' by default (see custom_xgboost.py). If you do not have a CUDA-capable GPU or drivers:
- Change the parameter in
CustomXGBoostGPU.train()fromdevice='cuda'todevice='cpu', or - Pass
device='cpu'when callingtrain()wherever applicable.
Inference works on CPU even if the model was trained on GPU.
Dependencies
The project is managed via pyproject.toml and uv. Key runtime deps include:
xgboost,pandas,numpy,scikit-learn,ta,numbadash/Plotly for charts (Plotly is used byplot_results.py)
Install using:
uv sync
Troubleshooting
- KeyError:
'log_return'during inference: ensure your input DataFrame includeslog_returnas described above. - Model file not found: confirm the path passed to
OHLCVPredictor(...)matches where training saved it (default../data/xgboost_model_all_features.json). - Feature mismatch (e.g., XGBoost "Number of columns does not match"): ensure you use the model together with its companion feature list JSON. The predictor will automatically use it if present. If missing, retrain with the current code so the feature list is generated.
- Empty/old charts: delete the
charts/folder and rerun training. - Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.