OHLCVPredictor/README.md

# OHLCVPredictor

End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.

## Quickstart (uv)

Prereqs:
- Python 3.12+
- `uv` installed (see `https://docs.astral.sh/uv/`)

Install dependencies:

```powershell
uv sync
```

Run training (expects an input CSV; see Data Requirements):

```powershell
uv run python main.py
```

Run the inference demo:

```powershell
uv run python inference_example.py
```

## Data requirements

Your input DataFrame/CSV must include these columns:
- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return`

Notes:
- `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
- `log_return` should be computed as:
  ```python
  df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
  ```
  The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
- The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default.

## Training workflow

The training entrypoint is `main.py`:
- Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path.
- Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`).
- Optionally performs walk-forward cross validation to compute averaged feature importances.
- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.
- Produces charts with Plotly into `charts/`.

Outputs produced by training:
- Model: `../data/xgboost_model_all_features.json`
- Feature list: `../data/xgboost_model_all_features_features.json` (exact feature names and order used for training)
- Results CSV: `../data/cumulative_feature_results.csv`
- Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`)

Run:
```powershell
uv run python main.py
```

If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).

## Inference usage

You can reuse the predictor in other projects or run the included example.

Minimal example:

```python
from predictor import OHLCVPredictor
import numpy as np

predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')

# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return
log_returns = predictor.predict(df)
prices_pred, prices_actual = predictor.predict_prices(df)
```

Run the comprehensive demo:

```powershell
uv run python inference_example.py
```

Files needed to embed the predictor in another project:
- `predictor.py`
- `custom_xgboost.py`
- `feature_engineering.py`
- `technical_indicator_functions.py`
- your trained model file (e.g., `xgboost_model_all_features.json`)
 - the companion feature list JSON saved next to the model (same basename with `_features.json`)

## GPU/CPU notes

Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers:
- Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or
- Pass `device='cpu'` when calling `train()` wherever applicable.

Inference works on CPU even if the model was trained on GPU.

## Dependencies

The project is managed via `pyproject.toml` and `uv`. Key runtime deps include:
- `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba`
- `dash`/Plotly for charts (Plotly is used by `plot_results.py`)

Install using:
```powershell
uv sync
```

## Troubleshooting

- KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above.
- Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`).
- Feature mismatch (e.g., XGBoost "Number of columns does not match"): ensure you use the model together with its companion feature list JSON. The predictor will automatically use it if present. If missing, retrain with the current code so the feature list is generated.
- Empty/old charts: delete the `charts/` folder and rerun training.
- Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.
Initial commit 2025-05-30 16:47:08 +00:00			`# OHLCVPredictor`

Update inference documentation and main script for improved usability. Revise README for clarity on data requirements and minimal usage. Adjust main.py to ensure proper handling of test predictions and add checks for local variables before plotting. 2025-08-12 15:12:07 +08:00			`End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API.`

			`## Quickstart (uv)`

			`Prereqs:`
			`- Python 3.12+`
			- `uv` installed (see `https://docs.astral.sh/uv/`)

			`Install dependencies:`

			```powershell
			`uv sync`
			```

			`Run training (expects an input CSV; see Data Requirements):`

			```powershell
			`uv run python main.py`
			```

			`Run the inference demo:`

			```powershell
			`uv run python inference_example.py`
			```

			`## Data requirements`

			`Your input DataFrame/CSV must include these columns:`
			- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return`

			`Notes:`
			- `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds.
			- `log_return` should be computed as:
			```python
			`df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))`
			```
			The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor.
			- The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default.

			`## Training workflow`

			The training entrypoint is `main.py`:
			- Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path.
			- Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`).
			`- Optionally performs walk-forward cross validation to compute averaged feature importances.`
			`- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts.`
			- Produces charts with Plotly into `charts/`.

			`Outputs produced by training:`
			- Model: `../data/xgboost_model_all_features.json`
Refactor main script and introduce CLI for OHLCV Predictor. Consolidate functionality into a new package structure, enhancing modularity. Update README to reflect new features and usage instructions, including the requirement for a companion feature list JSON. Add configuration classes for better parameter management and streamline data loading and preprocessing. 2025-08-12 16:06:05 +08:00			- Feature list: `../data/xgboost_model_all_features_features.json` (exact feature names and order used for training)
Update inference documentation and main script for improved usability. Revise README for clarity on data requirements and minimal usage. Adjust main.py to ensure proper handling of test predictions and add checks for local variables before plotting. 2025-08-12 15:12:07 +08:00			- Results CSV: `../data/cumulative_feature_results.csv`
			- Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`)

			`Run:`
			```powershell
			`uv run python main.py`
			```

			`If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section).`

			`## Inference usage`

			`You can reuse the predictor in other projects or run the included example.`

			`Minimal example:`

			```python
			`from predictor import OHLCVPredictor`
			`import numpy as np`

			`predictor = OHLCVPredictor('../data/xgboost_model_all_features.json')`

			`# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return`
			`log_returns = predictor.predict(df)`
			`prices_pred, prices_actual = predictor.predict_prices(df)`
			```

			`Run the comprehensive demo:`

			```powershell
			`uv run python inference_example.py`
			```

			`Files needed to embed the predictor in another project:`
			- `predictor.py`
			- `custom_xgboost.py`
			- `feature_engineering.py`
			- `technical_indicator_functions.py`
			- your trained model file (e.g., `xgboost_model_all_features.json`)
Refactor main script and introduce CLI for OHLCV Predictor. Consolidate functionality into a new package structure, enhancing modularity. Update README to reflect new features and usage instructions, including the requirement for a companion feature list JSON. Add configuration classes for better parameter management and streamline data loading and preprocessing. 2025-08-12 16:06:05 +08:00			- the companion feature list JSON saved next to the model (same basename with `_features.json`)
Update inference documentation and main script for improved usability. Revise README for clarity on data requirements and minimal usage. Adjust main.py to ensure proper handling of test predictions and add checks for local variables before plotting. 2025-08-12 15:12:07 +08:00
			`## GPU/CPU notes`

			Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers:
			- Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or
			- Pass `device='cpu'` when calling `train()` wherever applicable.

			`Inference works on CPU even if the model was trained on GPU.`

			`## Dependencies`

			The project is managed via `pyproject.toml` and `uv`. Key runtime deps include:
			- `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba`
			- `dash`/Plotly for charts (Plotly is used by `plot_results.py`)

			`Install using:`
			```powershell
			`uv sync`
			```

			`## Troubleshooting`

			- KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above.
			- Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`).
Refactor main script and introduce CLI for OHLCV Predictor. Consolidate functionality into a new package structure, enhancing modularity. Update README to reflect new features and usage instructions, including the requirement for a companion feature list JSON. Add configuration classes for better parameter management and streamline data loading and preprocessing. 2025-08-12 16:06:05 +08:00			`- Feature mismatch (e.g., XGBoost "Number of columns does not match"): ensure you use the model together with its companion feature list JSON. The predictor will automatically use it if present. If missing, retrain with the current code so the feature list is generated.`
Update inference documentation and main script for improved usability. Revise README for clarity on data requirements and minimal usage. Adjust main.py to ensure proper handling of test predictions and add checks for local variables before plotting. 2025-08-12 15:12:07 +08:00			- Empty/old charts: delete the `charts/` folder and rerun training.
			`- Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible.`