# OHLCVPredictor End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API. ## Quickstart (uv) Prereqs: - Python 3.12+ - `uv` installed (see `https://docs.astral.sh/uv/`) Install dependencies: ```powershell uv sync ``` Run training (expects an input CSV; see Data Requirements): ```powershell uv run python main.py ``` Run the inference demo: ```powershell uv run python inference_example.py ``` ## Data requirements Your input DataFrame/CSV must include these columns: - `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return` Notes: - `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds. - `log_return` should be computed as: ```python df['log_return'] = np.log(df['Close'] / df['Close'].shift(1)) ``` The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor. - The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default. ## Training workflow The training entrypoint is `main.py`: - Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path. - Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`). - Optionally performs walk-forward cross validation to compute averaged feature importances. - Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts. - Produces charts with Plotly into `charts/`. Outputs produced by training: - Model: `../data/xgboost_model_all_features.json` - Results CSV: `../data/cumulative_feature_results.csv` - Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`) Run: ```powershell uv run python main.py ``` If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section). ## Inference usage You can reuse the predictor in other projects or run the included example. Minimal example: ```python from predictor import OHLCVPredictor import numpy as np predictor = OHLCVPredictor('../data/xgboost_model_all_features.json') # df must contain: Timestamp, Open, High, Low, Close, Volume, log_return log_returns = predictor.predict(df) prices_pred, prices_actual = predictor.predict_prices(df) ``` Run the comprehensive demo: ```powershell uv run python inference_example.py ``` Files needed to embed the predictor in another project: - `predictor.py` - `custom_xgboost.py` - `feature_engineering.py` - `technical_indicator_functions.py` - your trained model file (e.g., `xgboost_model_all_features.json`) ## GPU/CPU notes Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers: - Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or - Pass `device='cpu'` when calling `train()` wherever applicable. Inference works on CPU even if the model was trained on GPU. ## Dependencies The project is managed via `pyproject.toml` and `uv`. Key runtime deps include: - `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba` - `dash`/Plotly for charts (Plotly is used by `plot_results.py`) Install using: ```powershell uv sync ``` ## Troubleshooting - KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above. - Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`). - Empty/old charts: delete the `charts/` folder and rerun training. - Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.