napistu_torch.load.encoders
Custom transformations for NapistuGraph vertex and edge features.
This module provides custom sklearn-compatible transformers for encoding sparse continuous features and other specialized transformations.
Classes
- SparseContScaler
Transformer for encoding sparse continuous features as indicator + standardized value pairs.
Public Functions
- encode_sparse_continuous(values, scaler=None, fit=True)
Encode sparse continuous features as indicator + standardized value pairs.
Functions
|
Encode sparse continuous features as indicator + standardized value pairs. |
Classes
Wrapper for encode_sparse_continuous to use in NapistuFeatureEncoder. |
- class napistu_torch.load.encoders.SparseContScaler
Bases:
BaseEstimator,TransformerMixinWrapper for encode_sparse_continuous to use in NapistuFeatureEncoder.
Standardizes continuous features with missing values by encoding each feature as two columns: [indicator, scaled_value].
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Configuration with sparse continuous features >>> config = { ... 'sparse_num': { ... 'columns': ['kinetic_constant'], ... 'transformer': SparseContScaler() ... } ... } >>> >>> train_df = pd.DataFrame({ ... 'kinetic_constant': [1.0, 2.0, np.nan, 4.0, np.nan] ... }) >>> >>> encoder = NapistuFeatureEncoder(config) >>> features = encoder.fit_transform(train_df) >>> print(features.shape) # (5, 2) - indicator + value for 1 column
- __init__()
- fit(X)
Fit a scaler for each column in X.
- Parameters:
X (pd.DataFrame or np.ndarray) – Input data with NaN indicating missing values.
- Returns:
self
- Return type:
- get_feature_names_out(input_features=None)
Get output feature names for transformation.
Each input feature becomes two output features: - is_{feature_name}: Binary indicator (1 if present, 0 if missing) - value_{feature_name}: Standardized value (scaled if present, 0 if missing)
- Parameters:
input_features (array-like of str or None, default=None) – Input feature names. If None, uses feature_names_in_ from fit().
- Returns:
feature_names_out – Output feature names.
- Return type:
ndarray of str
Examples
>>> scaler = SparseContScaler() >>> df = pd.DataFrame({'kinetic_constant': [1.0, 2.0, np.nan]}) >>> scaler.fit(df) >>> scaler.get_feature_names_out() array(['is_kinetic_constant', 'value_kinetic_constant'], dtype=object)
- transform(X)
Transform X using fitted scalers.
- Parameters:
X (pd.DataFrame or np.ndarray) – Input data with same columns as fit().
- Returns:
encoded – Transformed features with shape (n_samples, 2 * n_columns).
- Return type:
np.ndarray
- _sklearn_auto_wrap_output_keys = {'transform'}
- napistu_torch.load.encoders.encode_sparse_continuous(values, scaler=None, fit=True)
Encode sparse continuous features as indicator + standardized value pairs.
Each sparse continuous feature is encoded as two columns: 1. Binary indicator (1 if present, 0 if missing/NaN) 2. Standardized value (scaled if present, 0 if missing)
The standardization is computed only from non-missing values to avoid zeros skewing the distribution statistics.
- Parameters:
values (array-like of shape (n_samples,)) – Feature values with NaN indicating missing entries.
scaler (StandardScaler or None, optional) – Pre-fitted scaler. If None and fit=True, a new scaler is created and fitted. If None and fit=False, values are not scaled.
fit (bool, default=True) – Whether to fit a new scaler on the non-missing values. Ignored if scaler is provided.
- Returns:
encoded (ndarray of shape (n_samples, 2)) – Array with columns [indicator, standardized_value].
scaler (StandardScaler or None) – The fitted scaler (newly created if fit=True, otherwise the input scaler). Returns None if no scaling was performed.
Examples
>>> import numpy as np >>> from sklearn.preprocessing import StandardScaler >>> >>> # Training data >>> train_vals = np.array([1.0, 2.0, np.nan, 4.0, np.nan]) >>> encoded_train, scaler = encode_sparse_continuous(train_vals, fit=True) >>> print(encoded_train) [[1.0, -1.134...], [1.0, -0.378...], [0.0, 0.0], [1.0, 1.512...], [0.0, 0.0]] >>> >>> # Test data with same scaler >>> test_vals = np.array([3.0, np.nan, 5.0]) >>> encoded_test, _ = encode_sparse_continuous(test_vals, scaler=scaler, fit=False)
Notes
Missing values are encoded as [0, 0] while present values are encoded as [1, (value - mean) / std] where mean and std are computed only from the non-missing values in the training data.