End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.

Quickstart (uv)

Prereqs:

Python 3.12+
uv installed (see https://docs.astral.sh/uv/)

Install dependencies:

uv sync

Run training (expects an input CSV; see Data Requirements):

uv run python main.py

Run the inference demo:

uv run python inference_example.py

Data requirements

Your input DataFrame/CSV must include these columns:

Timestamp, Open, High, Low, Close, Volume, log_return

Notes:

Timestamp can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
log_return should be computed as:
```
df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
```
The training script (main.py) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
The training script filters out rows with Volume == 0 and focuses on data newer than 2017-06-01 by default.

Training workflow

The training entrypoint is main.py:

Reads the CSV at ../data/btcusd_1-min_data.csv by default. Adjust csv_path in main.py to point to your data, or move your CSV to that path.
Engineers a large set of technical and OHLCV-derived features (see feature_engineering.py and technical_indicator_functions.py).
Optionally performs walk-forward cross validation to compute averaged feature importances.
Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
Produces charts with Plotly into charts/.

Outputs produced by training:

Model: ../data/xgboost_model_all_features.json
Results CSV: ../data/cumulative_feature_results.csv
Charts: files under charts/ (e.g., all_features_prediction_error_distribution.html)

Run:

uv run python main.py

If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).

Inference usage

You can reuse the predictor in other projects or run the included example.

Minimal example:

from predictor import OHLCVPredictor
import numpy as np

predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')

# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)

Run the comprehensive demo:

uv run python inference_example.py

Files needed to embed the predictor in another project:

predictor.py
custom_xgboost.py
feature_engineering.py
technical_indicator_functions.py
your trained model file (e.g., xgboost_model_all_features.json)

GPU/CPU notes

Training uses XGBoost with device='cuda' by default (see custom_xgboost.py). If you do not have a CUDA-capable GPU or drivers:

Change the parameter in CustomXGBoostGPU.train() from device='cuda' to device='cpu', or
Pass device='cpu' when calling train() wherever applicable.

Inference works on CPU even if the model was trained on GPU.

Dependencies

The project is managed via pyproject.toml and uv. Key runtime deps include:

xgboost, pandas, numpy, scikit-learn, ta, numba
dash/Plotly for charts (Plotly is used by plot_results.py)

Install using:

uv sync

Troubleshooting

KeyError: 'log_return' during inference: ensure your input DataFrame includes log_return as described above.
Model file not found: confirm the path passed to OHLCVPredictor(...) matches where training saved it (default ../data/xgboost_model_all_features.json).
Empty/old charts: delete the charts/ folder and rerun training.
Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.