OHLCVPredictor/README.md

# OHLCVPredictor

End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.

## Quickstart (uv)

Prereqs:
- Python 3.12+
- `uv` installed (see `https://docs.astral.sh/uv/`)

Install dependencies:

```powershell
uv sync
```

Run training (expects an input CSV; see Data Requirements):

```powershell
uv run python main.py
```

Run the inference demo:

```powershell
uv run python inference_example.py
```

## Data requirements

Your input DataFrame/CSV must include these columns:
- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return`

Notes:
- `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
- `log_return` should be computed as:
  ```python
  df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
  ```
  The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
- The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default.

## Training workflow

The training entrypoint is `main.py`:
- Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path.
- Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`).
- Optionally performs walk-forward cross validation to compute averaged feature importances.
- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
- Produces charts with Plotly into `charts/`.

Outputs produced by training:
- Model: `../data/xgboost_model_all_features.json`
- Results CSV: `../data/cumulative_feature_results.csv`
- Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`)

Run:
```powershell
uv run python main.py
```

If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).

## Inference usage

You can reuse the predictor in other projects or run the included example.

Minimal example:

```python
from predictor import OHLCVPredictor
import numpy as np

predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')

# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)
```

Run the comprehensive demo:

```powershell
uv run python inference_example.py
```

Files needed to embed the predictor in another project:
- `predictor.py`
- `custom_xgboost.py`
- `feature_engineering.py`
- `technical_indicator_functions.py`
- your trained model file (e.g., `xgboost_model_all_features.json`)

## GPU/CPU notes

Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers:
- Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or
- Pass `device='cpu'` when calling `train()` wherever applicable.

Inference works on CPU even if the model was trained on GPU.

## Dependencies

The project is managed via `pyproject.toml` and `uv`. Key runtime deps include:
- `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba`
- `dash`/Plotly for charts (Plotly is used by `plot_results.py`)

Install using:
```powershell
uv sync
```

## Troubleshooting

- KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above.
- Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`).
- Empty/old charts: delete the `charts/` folder and rerun training.
- Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.