napistu_torch.load.encoding
DataFrame encoding and transformation utilities.
This module provides functions for automatically selecting encodings, fitting transformers, and transforming DataFrames for use in machine learning pipelines.
Public Functions
- auto_encode(graph_df, existing_encodings, encoders=DEFAULT_ENCODERS)
Select appropriate encodings for each column in a graph dataframe.
- classify_encoding(series, max_categories=50)
Classify the appropriate encoding type for a pandas Series.
- compose_encoding_configs(config1, config2)
Compose two encoding configurations.
- deduplicate_features(feature_names)
Deduplicate feature names by grouping identical features.
- config_to_column_transformer(config, encoders=DEFAULT_ENCODERS)
Convert encoding config to sklearn ColumnTransformer.
- encode_dataframe(df, config, encoders=DEFAULT_ENCODERS, fit=True)
Encode a DataFrame using the specified configuration.
- expand_deduplicated_features(feature_names, feature_name_aliases)
Expand deduplicated feature names using aliases.
- fit_encoders(df, config, encoders=DEFAULT_ENCODERS)
Fit encoders on a DataFrame using the specified configuration.
- transform_dataframe(df, preprocessor, feature_names)
Transform a DataFrame using a fitted preprocessor.
Functions
|
Select appropriate encodings for each column in a graph dataframe (either the vertex_df or edge_df) |
|
Classify the encoding type for a pandas Series. |
|
Compose encoding configurations with optional overrides. |
|
Convert validated config dict to sklearn ColumnTransformer. |
|
Deduplicate identical feature columns using shortest common prefix. |
|
Encode a DataFrame using sklearn transformers with configurable encoding rules. |
|
Expand deduplicated features back to original form. |
|
Fit encoding transformers on a DataFrame. |
|
Transform a DataFrame using a fitted ColumnTransformer. |
- napistu_torch.load.encoding._build_deduplicated_array(encoded_array: ndarray, feature_names: List[str], canonical_mapping: Dict[int, str], unique_columns: set) Tuple[ndarray, List[str]]
Build deduplicated array and feature name list.
- Parameters:
encoded_array (np.ndarray) – Original feature matrix
feature_names (List[str]) – Original feature names
canonical_mapping (Dict[int, str]) – Mapping from kept duplicate indices to canonical names
unique_columns (set) – Names of non-duplicate columns to keep as-is
- Returns:
pruned_array (np.ndarray) – Array with duplicate columns removed
canonical_names (List[str]) – Feature names for pruned array
- napistu_torch.load.encoding._get_feature_names(preprocessor: ColumnTransformer) List[str]
Get feature names from fitted ColumnTransformer using sklearn’s standard method.
- Parameters:
preprocessor (ColumnTransformer) – Fitted ColumnTransformer instance.
- Returns:
List of feature names in the same order as transform output columns.
- Return type:
List[str]
Examples
>>> preprocessor = config_to_column_transformer(config) >>> preprocessor.fit(data) # Must fit first! >>> feature_names = _get_feature_names(preprocessor) >>> # ['cat__node_type_A', 'cat__node_type_B', 'num__weight']
- napistu_torch.load.encoding._group_identical_columns(encoded_array: ndarray, feature_names: List[str]) Dict[int, List[Tuple[int, str]]]
Group columns by identical values using matrix operations.
- Parameters:
encoded_array (np.ndarray) – Feature matrix
feature_names (List[str]) – Feature names
- Returns:
Mapping from representative column index to list of (index, name) tuples
- Return type:
Dict[int, List[Tuple[int, str]]]
- napistu_torch.load.encoding._resolve_canonical_names(duplicate_groups: Dict[int, List[Tuple[int, str]]], unique_columns: set, min_prefix_length: int) Tuple[Dict[int, str], Dict[str, str]]
Resolve canonical names for duplicate groups with uniqueness guarantees.
Processes groups serially, checking each proposed canonical name against all previously assigned names and non-duplicate feature names to ensure uniqueness.
- Parameters:
duplicate_groups (Dict[int, List[Tuple[int, str]]]) – Groups of duplicate columns, mapping representative index to (index, name) tuples
unique_columns (set) – Names of non-duplicate columns that must not be shadowed
min_prefix_length (int) – Minimum prefix length for canonical names
- Returns:
canonical_mapping (Dict[int, str]) – Mapping from kept column index to its canonical name
alias_dict (Dict[str, str]) – Mapping from removed names to canonical names
- napistu_torch.load.encoding._validate_feature_names(feature_names: List[str]) None
Check for duplicates in feature_names.
- napistu_torch.load.encoding.auto_encode(graph_df: DataFrame, existing_encodings: Dict | EncodingManager, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}) EncodingManager
Select appropriate encodings for each column in a graph dataframe (either the vertex_df or edge_df)
- Parameters:
graph_df (pd.DataFrame) – The dataframe to select encodings for.
existing_encodings (Union[Dict, EncodingManager]) – The existing encodings to use. This could be VERTEX_DEFAULT_TRANSFORMS or EDGE_DEFAULT_TRANSFORMS or any modified version of these.
encoders (Dict, default=ENCODERS) – The encoders to use. These will be used to map from column encoding classes to the encoders themselves. If existing_encodings is a dict, then it must be passed in the ‘simple’ format which is a lookup from encoder keys to the columns using that encoder.
- Returns:
A new EncodingManager with the selected encodings.
- Return type:
- napistu_torch.load.encoding.classify_encoding(series: Series, max_categories: int = 50) str | None
Classify the encoding type for a pandas Series.
- Parameters:
series (pd.Series) – The column to classify
max_categories (int, default=50) – Maximum number of unique values for categorical encoding. If exceeded, logs a warning and returns None.
- Returns:
One of: ‘binary’, ‘categorical’, ‘numeric’, ‘numeric_sparse’, or None Returns None for constant variables or high-cardinality features.
- Return type:
Optional[str]
Examples
>>> classify_encoding(pd.Series([0, 1, 0, 1])) 'binary' >>> classify_encoding(pd.Series([0, 1, np.nan])) 'categorical' >>> classify_encoding(pd.Series([1.5, 2.3, 4.1])) 'numeric' >>> classify_encoding(pd.Series([1.5, np.nan, 4.1])) 'numeric_sparse' >>> classify_encoding(pd.Series([5, 5, 5, 5])) # Constant None
- napistu_torch.load.encoding.compose_encoding_configs(encoding_defaults: Dict | EncodingManager, encoding_overrides: Dict | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, verbose: bool = False) EncodingManager
Compose encoding configurations with optional overrides.
- Parameters:
encoding_defaults (Union[Dict, EncodingManager]) – Base encoding configuration.
encoding_overrides (Optional[Union[Dict, EncodingManager]], default=None) – Optional override configuration to merge with defaults. For column conflicts, overrides take precedence.
encoders (Dict, default=DEFAULT_ENCODERS) – Encoder instances to use when configs are in simple format.
verbose (bool, default=False) – If True, log config composition details.
- Returns:
Composed configuration (or just defaults if no overrides).
- Return type:
Examples
>>> defaults = {ENCODINGS.NUMERIC: ['col1']} >>> overrides = {ENCODINGS.CATEGORICAL: ['col2']} >>> config = compose_encoding_configs(defaults, overrides)
- napistu_torch.load.encoding.config_to_column_transformer(encoding_config: Dict[str, Dict] | EncodingConfig) ColumnTransformer
Convert validated config dict to sklearn ColumnTransformer.
- Parameters:
encoding_config (Union[Dict[str, Dict], EncodingConfig]) – Configuration dictionary (will be validated first).
- Returns:
sklearn ColumnTransformer ready for fit/transform.
- Return type:
ColumnTransformer
- Raises:
ValueError – If config is invalid.
Examples
>>> config = { ... 'categorical': { ... 'columns': ['node_type', 'species_type'], ... 'transformer': OneHotEncoder(handle_unknown='ignore') ... }, ... 'numerical': { ... 'columns': ['weight', 'score'], ... 'transformer': StandardScaler() ... } ... } >>> preprocessor = config_to_column_transformer(config) >>> # Equivalent to: >>> # ColumnTransformer([ >>> # ('categorical', OneHotEncoder(handle_unknown='ignore'), ['node_type', 'species_type']), >>> # ('numerical', StandardScaler(), ['weight', 'score']) >>> # ])
- napistu_torch.load.encoding.deduplicate_features(encoded_array: ndarray, feature_names: List[str], min_prefix_length: int = 3) Tuple[ndarray, List[str], Dict[str, str]]
Deduplicate identical feature columns using shortest common prefix.
Ensures all canonical names are unique by checking against both other canonical names and non-duplicate feature names.
- Parameters:
encoded_array (np.ndarray) – Feature matrix with potential duplicates, shape (n_samples, n_features)
feature_names (List[str]) – Names corresponding to columns in encoded_array
min_prefix_length (int, default=3) – Minimum prefix length for canonical names. If common prefix is shorter, falls back to alphabetically first name in the group.
- Returns:
pruned_array (np.ndarray) – Array with duplicate columns removed
canonical_names (List[str]) – Names of kept features (using shortest common prefix for duplicates)
feature_aliases (Dict[str, str]) – Mapping from removed feature names to their canonical representatives
- Raises:
ValueError – If feature_names contains duplicates
Examples
>>> array = np.array([[1, 1, 0], [0, 0, 1]]) >>> names = ['is_string_x', 'is_string_y', 'value_weight'] >>> pruned, canonical, aliases = deduplicate_features(array, names) >>> canonical ['is_string', 'value_weight'] >>> aliases {'is_string_y': 'is_string'}
- napistu_torch.load.encoding.encode_dataframe(df: DataFrame, encoding_defaults: Dict[str, Dict] | EncodingManager, encoding_overrides: Dict[str, Dict] | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, deduplicate: bool = True, verbose: bool = False) tuple[ndarray, List[str]]
Encode a DataFrame using sklearn transformers with configurable encoding rules.
This is a convenience function that combines fitting and transforming in one step. For more control (e.g., fitting on training data and transforming test data), use fit_encoders() and transform_dataframe() separately.
This function applies a series of transformations to a DataFrame based on encoding configurations. It supports both default encoding rules and optional overrides that can modify or extend the default behavior.
- Parameters:
df (pd.DataFrame) – Input DataFrame to be encoded. Must contain all columns specified in the encoding configurations.
encoding_defaults (Union[Dict[str, Dict], EncodingManager]) –
Base encoding configuration dictionary. Each key is a transform name and each value is a dict with ‘columns’ and ‘transformer’ keys. Example: {
- ’categorical’: {
‘columns’: [‘col1’, ‘col2’], ‘transformer’: OneHotEncoder()
}, ‘numerical’: {
’columns’: [‘col3’], ‘transformer’: StandardScaler()
}
}
encoding_overrides (Optional[Union[Dict[str, Dict], EncodingManager]], default=None) – Optional override configuration that will be merged with encoding_defaults. For column conflicts, the override configuration takes precedence. If None, only encoding_defaults will be used.
encoders (Dict, default=ENCODERS) – The encoders to use. If encoding_defaults or encoding_overrides are dicts, then these will be used to map from column encoding classes to the encoders themselves. If existing_encodings is a dict, then it must be passed in the ‘simple’ format which is a lookup from encoder keys to the columns using that encoder.
deduplicate (bool, default=True) – If True, deduplicate identical features and name the resulting columns using the shortest common prefix of the merged columns.
verbose (bool, default=False) – If True, log detailed information about config composition and conflicts.
- Returns:
A tuple containing: - encoded_array : np.ndarray
Transformed numpy array with encoded features. The number of columns may differ from the input due to transformations like OneHotEncoder.
- feature_namesList[str]
List of feature names corresponding to the columns in encoded_array. Names follow sklearn’s convention: ‘transform_name__column_name’.
- feature_aliasesDict[str, str]
Mapping from feature names to their aliases. If deduplicate is True, this will be a mapping from feature names to their canonical names. If deduplicate is False, this will be an empty dictionary.
- Return type:
tuple[np.ndarray, List[str]]
- Raises:
ValueError – If encoding configurations are invalid, have column conflicts, or if required columns are missing from the input DataFrame.
KeyError – If the input DataFrame is missing columns specified in the encoding config.
Examples
>>> import pandas as pd >>> from sklearn.preprocessing import OneHotEncoder, StandardScaler >>> >>> # Sample data >>> df = pd.DataFrame({ ... 'category': ['A', 'B', 'A', 'C'], ... 'value': [1.0, 2.0, 3.0, 4.0] ... }) >>> >>> # Encoding configuration >>> defaults = { ... 'categorical': { ... 'columns': ['category'], ... 'transformer': OneHotEncoder(sparse_output=False) ... }, ... 'numerical': { ... 'columns': ['value'], ... 'transformer': StandardScaler() ... } ... } >>> >>> # Encode the DataFrame (fit and transform in one step) >>> encoded_array, feature_names = encode_dataframe(df, defaults) >>> print(f"Encoded shape: {encoded_array.shape}") >>> print(f"Feature names: {feature_names}") >>> >>> # For train/test split, use the two-step approach: >>> fitted_transformer = fit_encoders(train_df, defaults) >>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer) >>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)
- napistu_torch.load.encoding.expand_deduplicated_features(encoded_array: ndarray, feature_names: List[str], feature_aliases: Dict[str, str]) Tuple[ndarray, List[str]]
Expand deduplicated features back to original form.
Takes a deduplicated feature matrix and expands it by duplicating columns for all aliased features. The expanded array will have the same number of columns as the original pre-deduplication array (though column order may differ).
- Parameters:
encoded_array (np.ndarray) – Deduplicated feature matrix, shape (n_samples, n_deduplicated_features)
feature_names (List[str]) – Canonical feature names corresponding to encoded_array columns
feature_aliases (Dict[str, str]) – Mapping from removed feature names to canonical names (output from deduplicate_features). This can be a subset of the dictionary to only restore specific aliases.
- Returns:
expanded_array (np.ndarray) – Array with aliased columns duplicated, shape (n_samples, n_original_features)
expanded_names (List[str]) – Feature names for expanded array (includes all original names)
Examples
>>> # After deduplication >>> deduplicated = np.array([[1, 0], [0, 1]]) >>> names = ['is_string', 'value_weight'] >>> aliases = {'is_string_x': 'is_string', 'is_string_y': 'is_string'} >>> >>> expanded, expanded_names = expand_deduplicated_features( ... deduplicated, names, aliases ... ) >>> expanded.shape (2, 4) # 2 samples, 4 features (is_string, is_string_x, is_string_y, value_weight) >>> expanded_names ['is_string', 'is_string_x', 'is_string_y', 'value_weight']
- napistu_torch.load.encoding.fit_encoders(df: DataFrame, encoding_defaults: Dict[str, Dict] | EncodingManager, encoding_overrides: Dict[str, Dict] | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, verbose: bool = False) ColumnTransformer
Fit encoding transformers on a DataFrame.
This function creates and fits a ColumnTransformer based on encoding configurations. The fitted transformer can then be used to transform this DataFrame or other DataFrames with the same schema.
- Parameters:
df (pd.DataFrame) – Input DataFrame to fit encoders on. Must contain all columns specified in the encoding configurations.
encoding_defaults (Union[Dict[str, Dict], EncodingManager]) – Base encoding configuration. Each key is a transform name and each value is a dict with ‘columns’ and ‘transformer’ keys.
encoding_overrides (Optional[Union[Dict[str, Dict], EncodingManager]], default=None) – Optional override configuration that will be merged with encoding_defaults. For column conflicts, the override configuration takes precedence.
verbose (bool, default=False) – If True, log detailed information about config composition and conflicts.
- Returns:
Fitted sklearn ColumnTransformer ready to transform data.
- Return type:
ColumnTransformer
- Raises:
ValueError – If encoding configurations are invalid or if the DataFrame is empty.
KeyError – If the input DataFrame is missing columns specified in the encoding config.
Examples
>>> # Fit encoders on training data >>> fitted_transformer = fit_encoders(train_df, encoding_defaults) >>> >>> # Use the fitted transformer on train and test data >>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer) >>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)
- napistu_torch.load.encoding.transform_dataframe(df: DataFrame, fitted_transformer: ColumnTransformer, deduplicate: bool = True) tuple[ndarray, List[str], Dict[str, str]]
Transform a DataFrame using a fitted ColumnTransformer.
This function applies pre-fitted transformations to a DataFrame. The transformer must have been fitted previously using fit_encoders() or by calling .fit() directly.
- Parameters:
df (pd.DataFrame) – Input DataFrame to transform. Must contain all columns that the transformer expects.
fitted_transformer (ColumnTransformer) – A fitted sklearn ColumnTransformer instance.
deduplicate (bool = True,) – If True, deduplicate identical features and name the resulting columns using the shortest common prefix of the merged columns.
- Returns:
A tuple containing: - encoded_array : np.ndarray
Transformed numpy array with encoded features.
- feature_namesList[str]
List of feature names corresponding to columns in encoded_array.
- feature_aliasesDict[str, str]
Mapping from feature names to their aliases. If deduplicate is True, this will be a mapping from feature names to their canonical names. If deduplicate is False, this will be an empty dictionary.
- Return type:
tuple[np.ndarray, List[str]]
- Raises:
ValueError – If the transformer is not fitted or if the DataFrame is empty.
KeyError – If the DataFrame is missing columns required by the transformer.
Examples
>>> # Fit on training data >>> fitted_transformer = fit_encoders(train_df, encoding_config) >>> >>> # Transform multiple DataFrames with same fitted transformer >>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer) >>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer) >>> val_encoded, val_features = transform_dataframe(val_df, fitted_transformer)