2025-05-30 16:47:08 +00:00
2025-05-31 00:57:31 +08:00
2025-05-31 00:57:31 +08:00
2025-05-31 00:57:31 +08:00

OHLCVPredictor

End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.

Quickstart (uv)

Prereqs:

  • Python 3.12+
  • uv installed (see https://docs.astral.sh/uv/)

Install dependencies:

uv sync

Run training (expects an input CSV; see Data Requirements):

uv run python main.py

Run the inference demo:

uv run python inference_example.py

Data requirements

Your input DataFrame/CSV must include these columns:

  • Timestamp, Open, High, Low, Close, Volume, log_return

Notes:

  • Timestamp can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
  • log_return should be computed as:
    df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
    
    The training script (main.py) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
  • The training script filters out rows with Volume == 0 and focuses on data newer than 2017-06-01 by default.

Training workflow

The training entrypoint is main.py:

  • Reads the CSV at ../data/btcusd_1-min_data.csv by default. Adjust csv_path in main.py to point to your data, or move your CSV to that path.
  • Engineers a large set of technical and OHLCV-derived features (see feature_engineering.py and technical_indicator_functions.py).
  • Optionally performs walk-forward cross validation to compute averaged feature importances.
  • Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
  • Produces charts with Plotly into charts/.

Outputs produced by training:

  • Model: ../data/xgboost_model_all_features.json
  • Results CSV: ../data/cumulative_feature_results.csv
  • Charts: files under charts/ (e.g., all_features_prediction_error_distribution.html)

Run:

uv run python main.py

If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).

Inference usage

You can reuse the predictor in other projects or run the included example.

Minimal example:

from predictor import OHLCVPredictor
import numpy as np

predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')

# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)

Run the comprehensive demo:

uv run python inference_example.py

Files needed to embed the predictor in another project:

  • predictor.py
  • custom_xgboost.py
  • feature_engineering.py
  • technical_indicator_functions.py
  • your trained model file (e.g., xgboost_model_all_features.json)

GPU/CPU notes

Training uses XGBoost with device='cuda' by default (see custom_xgboost.py). If you do not have a CUDA-capable GPU or drivers:

  • Change the parameter in CustomXGBoostGPU.train() from device='cuda' to device='cpu', or
  • Pass device='cpu' when calling train() wherever applicable.

Inference works on CPU even if the model was trained on GPU.

Dependencies

The project is managed via pyproject.toml and uv. Key runtime deps include:

  • xgboost, pandas, numpy, scikit-learn, ta, numba
  • dash/Plotly for charts (Plotly is used by plot_results.py)

Install using:

uv sync

Troubleshooting

  • KeyError: 'log_return' during inference: ensure your input DataFrame includes log_return as described above.
  • Model file not found: confirm the path passed to OHLCVPredictor(...) matches where training saved it (default ../data/xgboost_model_all_features.json).
  • Empty/old charts: delete the charts/ folder and rerun training.
  • Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.
Description
No description provided
Readme 7.8 MiB
Languages
Python 100%