120 lines
3.9 KiB
Markdown
120 lines
3.9 KiB
Markdown
# OHLCVPredictor
|
|
|
|
End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.
|
|
|
|
## Quickstart (uv)
|
|
|
|
Prereqs:
|
|
- Python 3.12+
|
|
- `uv` installed (see `https://docs.astral.sh/uv/`)
|
|
|
|
Install dependencies:
|
|
|
|
```powershell
|
|
uv sync
|
|
```
|
|
|
|
Run training (expects an input CSV; see Data Requirements):
|
|
|
|
```powershell
|
|
uv run python main.py
|
|
```
|
|
|
|
Run the inference demo:
|
|
|
|
```powershell
|
|
uv run python inference_example.py
|
|
```
|
|
|
|
## Data requirements
|
|
|
|
Your input DataFrame/CSV must include these columns:
|
|
- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return`
|
|
|
|
Notes:
|
|
- `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
|
|
- `log_return` should be computed as:
|
|
```python
|
|
df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
|
|
```
|
|
The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
|
|
- The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default.
|
|
|
|
## Training workflow
|
|
|
|
The training entrypoint is `main.py`:
|
|
- Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path.
|
|
- Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`).
|
|
- Optionally performs walk-forward cross validation to compute averaged feature importances.
|
|
- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
|
|
- Produces charts with Plotly into `charts/`.
|
|
|
|
Outputs produced by training:
|
|
- Model: `../data/xgboost_model_all_features.json`
|
|
- Results CSV: `../data/cumulative_feature_results.csv`
|
|
- Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`)
|
|
|
|
Run:
|
|
```powershell
|
|
uv run python main.py
|
|
```
|
|
|
|
If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).
|
|
|
|
## Inference usage
|
|
|
|
You can reuse the predictor in other projects or run the included example.
|
|
|
|
Minimal example:
|
|
|
|
```python
|
|
from predictor import OHLCVPredictor
|
|
import numpy as np
|
|
|
|
predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')
|
|
|
|
# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
|
|
log_returns = predictor.predict(df)
|
|
prices_pred, prices_actual = predictor.predict_prices(df)
|
|
```
|
|
|
|
Run the comprehensive demo:
|
|
|
|
```powershell
|
|
uv run python inference_example.py
|
|
```
|
|
|
|
Files needed to embed the predictor in another project:
|
|
- `predictor.py`
|
|
- `custom_xgboost.py`
|
|
- `feature_engineering.py`
|
|
- `technical_indicator_functions.py`
|
|
- your trained model file (e.g., `xgboost_model_all_features.json`)
|
|
|
|
## GPU/CPU notes
|
|
|
|
Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers:
|
|
- Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or
|
|
- Pass `device='cpu'` when calling `train()` wherever applicable.
|
|
|
|
Inference works on CPU even if the model was trained on GPU.
|
|
|
|
## Dependencies
|
|
|
|
The project is managed via `pyproject.toml` and `uv`. Key runtime deps include:
|
|
- `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba`
|
|
- `dash`/Plotly for charts (Plotly is used by `plot_results.py`)
|
|
|
|
Install using:
|
|
```powershell
|
|
uv sync
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
- KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above.
|
|
- Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`).
|
|
- Empty/old charts: delete the `charts/` folder and rerun training.
|
|
- Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.
|
|
|