Merge branch 'xgboost'

# Conflicts: # .gitignore # README.md # cycles/backtest.py # main.py # pyproject.toml # uv.lock
2025-07-11 09:04:49 +08:00
parent b013183f67 65f30a4020
commit 267f040fe8
39 changed files with 6311 additions and 1332 deletions
--- a/docs/utils_storage.md
+++ b/docs/utils_storage.md
@@ -1,73 +1,207 @@
 # Storage Utilities

-This document describes the storage utility functions found in `cycles/utils/storage.py`.
+This document describes the refactored storage utilities found in `cycles/utils/` that provide modular, maintainable data and results management.

 ## Overview

-The `storage.py` module provides a `Storage` class designed for handling the loading and saving of data and results. It supports operations with CSV and JSON files and integrates with pandas DataFrames for data manipulation. The class also manages the creation of necessary `results` and `data` directories.
+The storage utilities have been refactored into a modular architecture with clear separation of concerns:
+
+- **`Storage`** - Main coordinator class providing unified interface (backward compatible)
+- **`DataLoader`** - Handles loading data from various file formats  
+- **`DataSaver`** - Manages saving data with proper format handling
+- **`ResultFormatter`** - Formats and writes backtest results to CSV files
+- **`storage_utils`** - Shared utilities and custom exceptions
+
+This design improves maintainability, testability, and follows the single responsibility principle.

 ## Constants

-   `RESULTS_DIR`: Defines the default directory name for storing results (default: "results").
-   `DATA_DIR`: Defines the default directory name for storing input data (default: "data").
+-   `RESULTS_DIR`: Default directory for storing results (default: "../results")
+-   `DATA_DIR`: Default directory for storing input data (default: "../data")

-## Class: `Storage`
+## Main Classes

-Handles storage operations for data and results.
+### `Storage` (Coordinator Class)

-### `__init__(self, logging=None, results_dir=RESULTS_DIR, data_dir=DATA_DIR)`
+The main interface that coordinates all storage operations while maintaining backward compatibility.

-   **Description**: Initializes the `Storage` class. It creates the results and data directories if they don't already exist.
-   **Parameters**:
-    -   `logging` (optional): A logging instance for outputting information. Defaults to `None`.
-    -   `results_dir` (str, optional): Path to the directory for storing results. Defaults to `RESULTS_DIR`.
-    -   `data_dir` (str, optional): Path to the directory for storing data. Defaults to `DATA_DIR`.
+#### `__init__(self, logging=None, results_dir=RESULTS_DIR, data_dir=DATA_DIR)`

-### `load_data(self, file_path, start_date, stop_date)`
+**Description**: Initializes the Storage coordinator with component instances.

-   **Description**: Loads data from a specified file (CSV or JSON), performs type optimization, filters by date range, and converts column names to lowercase. The timestamp column is set as the DataFrame index.
-   **Parameters**:
-    -   `file_path` (str): Path to the data file (relative to `data_dir`).
-    -   `start_date` (datetime-like): The start date for filtering data.
-    -   `stop_date` (datetime-like): The end date for filtering data.
-   **Returns**: `pandas.DataFrame` - The loaded and processed data, with a `timestamp` index. Returns an empty DataFrame on error.
+**Parameters**:
+- `logging` (optional): A logging instance for outputting information
+- `results_dir` (str, optional): Path to the directory for storing results
+- `data_dir` (str, optional): Path to the directory for storing data

-### `save_data(self, data: pd.DataFrame, file_path: str)`
+**Creates**: Component instances for DataLoader, DataSaver, and ResultFormatter

-   **Description**: Saves a pandas DataFrame to a CSV file within the `data_dir`. If the DataFrame has a DatetimeIndex, it's converted to a Unix timestamp (seconds since epoch) and stored in a column named 'timestamp', which becomes the first column in the CSV. The DataFrame's active index is not saved if a 'timestamp' column is created.
-   **Parameters**:
-    -   `data` (pd.DataFrame): The DataFrame to save.
-    -   `file_path` (str): Path to the data file (relative to `data_dir`).
+#### `load_data(self, file_path: str, start_date: Union[str, pd.Timestamp], stop_date: Union[str, pd.Timestamp]) -> pd.DataFrame`

-### `format_row(self, row)`
+**Description**: Loads data with optimized dtypes and filtering, supporting CSV and JSON input.

-   **Description**: Formats a dictionary row for output to a combined results CSV file, applying specific string formatting for percentages and float values.
-   **Parameters**:
-    -   `row` (dict): The row of data to format.
-   **Returns**: `dict` - The formatted row.
+**Parameters**:
+- `file_path` (str): Path to the data file (relative to `data_dir`)
+- `start_date` (datetime-like): The start date for filtering data
+- `stop_date` (datetime-like): The end date for filtering data

-### `write_results_chunk(self, filename, fieldnames, rows, write_header=False, initial_usd=None)`
+**Returns**: `pandas.DataFrame` with timestamp index

-   **Description**: Writes a chunk of results (list of dictionaries) to a CSV file. Can append to an existing file or write a new one with a header. An optional `initial_usd` can be written as a comment in the header.
-   **Parameters**:
-    -   `filename` (str): The name of the file to write to (path is absolute or relative to current working dir).
-    -   `fieldnames` (list): A list of strings representing the CSV header/column names.
-    -   `rows` (list): A list of dictionaries, where each dictionary is a row.
-    -   `write_header` (bool, optional): If `True`, writes the header. Defaults to `False`.
-    -   `initial_usd` (numeric, optional): If provided and `write_header` is `True`, this value is written as a comment in the CSV header. Defaults to `None`.
+**Raises**: `DataLoadingError` if loading fails

-### `write_results_combined(self, filename, fieldnames, rows)`
+#### `save_data(self, data: pd.DataFrame, file_path: str) -> None`

-   **Description**: Writes combined results to a CSV file in the `results_dir`. Uses tab as a delimiter and formats rows using `format_row`.
-   **Parameters**:
-    -   `filename` (str): The name of the file to write to (relative to `results_dir`).
-    -   `fieldnames` (list): A list of strings representing the CSV header/column names.
-    -   `rows` (list): A list of dictionaries, where each dictionary is a row.
+**Description**: Saves processed data to a CSV file with proper timestamp handling.

-### `write_trades(self, all_trade_rows, trades_fieldnames)`
+**Parameters**:
+- `data` (pd.DataFrame): The DataFrame to save
+- `file_path` (str): Path to the data file (relative to `data_dir`)

-   **Description**: Writes trade data to separate CSV files based on timeframe and stop-loss percentage. Files are named `trades_{tf}_ST{sl_percent}pct.csv` and stored in `results_dir`.
-   **Parameters**:
-    -   `all_trade_rows` (list): A list of dictionaries, where each dictionary represents a trade.
-    -   `trades_fieldnames` (list): A list of strings for the CSV header of trade files.
+**Raises**: `DataSavingError` if saving fails
+
+#### `format_row(self, row: Dict[str, Any]) -> Dict[str, str]`
+
+**Description**: Formats a dictionary row for output to results CSV files.
+
+**Parameters**:
+- `row` (dict): The row of data to format
+
+**Returns**: `dict` with formatted values (percentages, currency, etc.)
+
+#### `write_results_chunk(self, filename: str, fieldnames: List[str], rows: List[Dict], write_header: bool = False, initial_usd: Optional[float] = None) -> None`
+
+**Description**: Writes a chunk of results to a CSV file with optional header.
+
+**Parameters**:
+- `filename` (str): The name of the file to write to
+- `fieldnames` (list): CSV header/column names
+- `rows` (list): List of dictionaries representing rows
+- `write_header` (bool, optional): Whether to write the header
+- `initial_usd` (float, optional): Initial USD value for header comment
+
+#### `write_backtest_results(self, filename: str, fieldnames: List[str], rows: List[Dict], metadata_lines: Optional[List[str]] = None) -> str`
+
+**Description**: Writes combined backtest results to a CSV file with metadata.
+
+**Parameters**:
+- `filename` (str): Name of the file to write to (relative to `results_dir`)
+- `fieldnames` (list): CSV header/column names
+- `rows` (list): List of result dictionaries
+- `metadata_lines` (list, optional): Header comment lines
+
+**Returns**: Full path to the written file
+
+#### `write_trades(self, all_trade_rows: List[Dict], trades_fieldnames: List[str]) -> None`
+
+**Description**: Writes trade data to separate CSV files grouped by timeframe and stop-loss.
+
+**Parameters**:
+- `all_trade_rows` (list): List of trade dictionaries
+- `trades_fieldnames` (list): CSV header for trade files
+
+**Files Created**: `trades_{timeframe}_ST{sl_percent}pct.csv` in `results_dir`
+
+### `DataLoader`
+
+Handles loading and preprocessing of data from various file formats.
+
+#### Key Features:
+- Supports CSV and JSON formats
+- Optimized pandas dtypes for financial data
+- Intelligent timestamp parsing (Unix timestamps and datetime strings)
+- Date range filtering
+- Column name normalization (lowercase)
+- Comprehensive error handling
+
+#### Methods:
+- `load_data()` - Main loading interface
+- `_load_json_data()` - JSON-specific loading logic  
+- `_load_csv_data()` - CSV-specific loading logic
+- `_process_csv_timestamps()` - Timestamp parsing for CSV data
+
+### `DataSaver`
+
+Manages saving data with proper format handling and index conversion.
+
+#### Key Features:
+- Converts DatetimeIndex to Unix timestamps for CSV compatibility
+- Handles numeric indexes appropriately
+- Ensures 'timestamp' column is first in output
+- Comprehensive error handling and logging
+
+#### Methods:
+- `save_data()` - Main saving interface
+- `_prepare_data_for_saving()` - Data preparation logic
+- `_convert_datetime_index_to_timestamp()` - DatetimeIndex conversion
+- `_convert_numeric_index_to_timestamp()` - Numeric index conversion
+
+### `ResultFormatter`
+
+Handles formatting and writing of backtest results to CSV files.
+
+#### Key Features:
+- Consistent formatting for percentages and currency
+- Grouped trade file writing by timeframe/stop-loss
+- Metadata header support
+- Tab-delimited output for results
+- Error handling for all write operations
+
+#### Methods:
+- `format_row()` - Format individual result rows
+- `write_results_chunk()` - Write result chunks with headers
+- `write_backtest_results()` - Write combined results with metadata
+- `write_trades()` - Write grouped trade files
+
+## Utility Functions and Exceptions
+
+### Custom Exceptions
+
+- **`TimestampParsingError`** - Raised when timestamp parsing fails
+- **`DataLoadingError`** - Raised when data loading operations fail  
+- **`DataSavingError`** - Raised when data saving operations fail
+
+### Utility Functions
+
+- **`_parse_timestamp_column()`** - Parse timestamp columns with format detection
+- **`_filter_by_date_range()`** - Filter DataFrames by date range
+- **`_normalize_column_names()`** - Convert column names to lowercase
+
+## Architecture Benefits
+
+### Separation of Concerns
+- Each class has a single, well-defined responsibility
+- Data loading, saving, and result formatting are cleanly separated
+- Shared utilities are extracted to prevent code duplication
+
+### Maintainability  
+- All files are under 250 lines (quality gate)
+- All methods are under 50 lines (quality gate)
+- Clear interfaces and comprehensive documentation
+- Type hints for better IDE support and clarity
+
+### Error Handling
+- Custom exceptions for different error types
+- Consistent error logging patterns
+- Graceful degradation (empty DataFrames on load failure)
+
+### Backward Compatibility
+- Storage class maintains exact same public interface
+- All existing code continues to work unchanged
+- Component classes are available for advanced usage
+
+## Migration Notes
+
+The refactoring maintains full backward compatibility. Existing code using `Storage` will continue to work unchanged. For new code, consider using the component classes directly for more focused functionality:
+
+```python
+# Existing pattern (still works)
+from cycles.utils.storage import Storage
+storage = Storage(logging=logger)
+data = storage.load_data('file.csv', start, end)
+
+# New pattern for focused usage
+from cycles.utils.data_loader import DataLoader
+loader = DataLoader(data_dir, logger)
+data = loader.load_data('file.csv', start, end)
+```