diff --git a/INFERENCE_README.md b/INFERENCE_README.md index 03995e7..ea57518 100644 --- a/INFERENCE_README.md +++ b/INFERENCE_README.md @@ -1,38 +1,30 @@ -# OHLCV Predictor - Simple Inference +# OHLCV Predictor - Inference (Quick Reference) -Refactored for easy reuse in other projects. +For full instructions, see the main README. -## Usage +## Minimal usage ```python from predictor import OHLCVPredictor -predictor = OHLCVPredictor('model.json') +predictor = OHLCVPredictor('../data/xgboost_model_all_features.json') predictions = predictor.predict(your_ohlcv_dataframe) ``` -## Files Needed - -Copy these 5 files to your other project: - -1. `predictor.py` -2. `custom_xgboost.py` -3. `feature_engineering.py` -4. `technical_indicator_functions.py` -5. `xgboost_model_all_features.json` - -## Data Requirements - Your DataFrame needs these columns: -- `Open`, `High`, `Low`, `Close`, `Volume`, `Timestamp` +- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return` -## Dependencies +Note: If you are only running inference (not training with `main.py`), compute `log_return` first: +```python +import numpy as np +df['log_return'] = np.log(df['Close'] / df['Close'].shift(1)) ``` -xgboost >= 3.0.2 -pandas >= 2.2.3 -numpy >= 2.2.3 -scikit-learn >= 1.6.1 -ta >= 0.11.0 -numba >= 0.61.2 -``` \ No newline at end of file + +## Files to reuse in other projects + +- `predictor.py` +- `custom_xgboost.py` +- `feature_engineering.py` +- `technical_indicator_functions.py` +- your trained model file (e.g., `xgboost_model_all_features.json`) diff --git a/README.md b/README.md index 39b8dc7..fd86d4a 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,119 @@ # OHLCVPredictor +End-to-end pipeline for engineering OHLCV features, training an XGBoost regressor (GPU by default), and running inference via a small, reusable predictor API. + +## Quickstart (uv) + +Prereqs: +- Python 3.12+ +- `uv` installed (see `https://docs.astral.sh/uv/`) + +Install dependencies: + +```powershell +uv sync +``` + +Run training (expects an input CSV; see Data Requirements): + +```powershell +uv run python main.py +``` + +Run the inference demo: + +```powershell +uv run python inference_example.py +``` + +## Data requirements + +Your input DataFrame/CSV must include these columns: +- `Timestamp`, `Open`, `High`, `Low`, `Close`, `Volume`, `log_return` + +Notes: +- `Timestamp` can be either a pandas datetime-like column or Unix seconds (int). During inference, the predictor will try to parse strings as datetimes; non-object dtypes are treated as Unix seconds. +- `log_return` should be computed as: + ```python + df['log_return'] = np.log(df['Close'] / df['Close'].shift(1)) + ``` + The training script (`main.py`) computes it automatically. For standalone inference, ensure it exists before calling the predictor. +- The training script filters out rows with `Volume == 0` and focuses on data newer than `2017-06-01` by default. + +## Training workflow + +The training entrypoint is `main.py`: +- Reads the CSV at `../data/btcusd_1-min_data.csv` by default. Adjust `csv_path` in `main.py` to point to your data, or move your CSV to that path. +- Engineers a large set of technical and OHLCV-derived features (see `feature_engineering.py` and `technical_indicator_functions.py`). +- Optionally performs walk-forward cross validation to compute averaged feature importances. +- Prunes low-importance and redundant features, trains XGBoost (GPU by default), and saves artifacts. +- Produces charts with Plotly into `charts/`. + +Outputs produced by training: +- Model: `../data/xgboost_model_all_features.json` +- Results CSV: `../data/cumulative_feature_results.csv` +- Charts: files under `charts/` (e.g., `all_features_prediction_error_distribution.html`) + +Run: +```powershell +uv run python main.py +``` + +If you do not have a CUDA-capable GPU, set the device to CPU (see GPU/CPU section). + +## Inference usage + +You can reuse the predictor in other projects or run the included example. + +Minimal example: + +```python +from predictor import OHLCVPredictor +import numpy as np + +predictor = OHLCVPredictor('../data/xgboost_model_all_features.json') + +# df must contain: Timestamp, Open, High, Low, Close, Volume, log_return +log_returns = predictor.predict(df) +prices_pred, prices_actual = predictor.predict_prices(df) +``` + +Run the comprehensive demo: + +```powershell +uv run python inference_example.py +``` + +Files needed to embed the predictor in another project: +- `predictor.py` +- `custom_xgboost.py` +- `feature_engineering.py` +- `technical_indicator_functions.py` +- your trained model file (e.g., `xgboost_model_all_features.json`) + +## GPU/CPU notes + +Training uses XGBoost with `device='cuda'` by default (see `custom_xgboost.py`). If you do not have a CUDA-capable GPU or drivers: +- Change the parameter in `CustomXGBoostGPU.train()` from `device='cuda'` to `device='cpu'`, or +- Pass `device='cpu'` when calling `train()` wherever applicable. + +Inference works on CPU even if the model was trained on GPU. + +## Dependencies + +The project is managed via `pyproject.toml` and `uv`. Key runtime deps include: +- `xgboost`, `pandas`, `numpy`, `scikit-learn`, `ta`, `numba` +- `dash`/Plotly for charts (Plotly is used by `plot_results.py`) + +Install using: +```powershell +uv sync +``` + +## Troubleshooting + +- KeyError: `'log_return'` during inference: ensure your input DataFrame includes `log_return` as described above. +- Model file not found: confirm the path passed to `OHLCVPredictor(...)` matches where training saved it (default `../data/xgboost_model_all_features.json`). +- Empty/old charts: delete the `charts/` folder and rerun training. +- Memory issues: consider downcasting or using smaller windows; the code already downcasts numerics where possible. + diff --git a/main.py b/main.py index d1c4380..3acdeff 100644 --- a/main.py +++ b/main.py @@ -272,7 +272,8 @@ if __name__ == '__main__': # ) model.save_model(f'../data/xgboost_model_all_features.json') - test_preds = model.predict(X_test) + X_test_kept = df[kept_feature_cols].values.astype(np.float32)[split_idx:] + test_preds = model.predict(X_test_kept) rmse = np.sqrt(mean_squared_error(y_test, test_preds)) # Reconstruct price series from log returns @@ -322,7 +323,8 @@ if __name__ == '__main__': print(f'Cumulative feature run failed: {e}') print(f'All cumulative feature runs completed. Results saved to {results_csv}') - plot_prefix = f'all_features' - plot_prediction_error_distribution(predicted_prices, actual_prices, prefix=plot_prefix) + if 'predicted_prices' in locals() and 'actual_prices' in locals(): + plot_prefix = f'all_features' + plot_prediction_error_distribution(predicted_prices, actual_prices, prefix=plot_prefix) sys.exit(0) \ No newline at end of file