napistu_torch.load.encoding

DataFrame encoding and transformation utilities.

This module provides functions for automatically selecting encodings, fitting transformers, and transforming DataFrames for use in machine learning pipelines.

Public Functions

auto_encode(graph_df, existing_encodings, encoders=DEFAULT_ENCODERS): Select appropriate encodings for each column in a graph dataframe.
classify_encoding(series, max_categories=50): Classify the appropriate encoding type for a pandas Series.
compose_encoding_configs(config1, config2): Compose two encoding configurations.
deduplicate_features(feature_names): Deduplicate feature names by grouping identical features.
config_to_column_transformer(config, encoders=DEFAULT_ENCODERS): Convert encoding config to sklearn ColumnTransformer.
encode_dataframe(df, config, encoders=DEFAULT_ENCODERS, fit=True): Encode a DataFrame using the specified configuration.
expand_deduplicated_features(feature_names, feature_name_aliases): Expand deduplicated feature names using aliases.
fit_encoders(df, config, encoders=DEFAULT_ENCODERS): Fit encoders on a DataFrame using the specified configuration.
transform_dataframe(df, preprocessor, feature_names): Transform a DataFrame using a fitted preprocessor.

Functions

`auto_encode`(graph_df, existing_encodings[, ...])	Select appropriate encodings for each column in a graph dataframe (either the vertex_df or edge_df)
`classify_encoding`(series[, max_categories])	Classify the encoding type for a pandas Series.
`compose_encoding_configs`(encoding_defaults)	Compose encoding configurations with optional overrides.
`config_to_column_transformer`(encoding_config)	Convert validated config dict to sklearn ColumnTransformer.
`deduplicate_features`(encoded_array, ...[, ...])	Deduplicate identical feature columns using shortest common prefix.
`encode_dataframe`(df, encoding_defaults[, ...])	Encode a DataFrame using sklearn transformers with configurable encoding rules.
`expand_deduplicated_features`(encoded_array, ...)	Expand deduplicated features back to original form.
`fit_encoders`(df, encoding_defaults[, ...])	Fit encoding transformers on a DataFrame.
`transform_dataframe`(df, fitted_transformer)	Transform a DataFrame using a fitted ColumnTransformer.

napistu_torch.load.encoding._build_deduplicated_array(encoded_array: ndarray, feature_names: List[str], canonical_mapping: Dict[int, str], unique_columns: set) → Tuple[ndarray, List[str]]

Build deduplicated array and feature name list.

Parameters:

encoded_array (np.ndarray) – Original feature matrix
feature_names (List[str]) – Original feature names
canonical_mapping (Dict[int, str]) – Mapping from kept duplicate indices to canonical names
unique_columns (set) – Names of non-duplicate columns to keep as-is

Returns:

pruned_array (np.ndarray) – Array with duplicate columns removed
canonical_names (List[str]) – Feature names for pruned array

napistu_torch.load.encoding._get_feature_names(preprocessor: ColumnTransformer) → List[str]

Get feature names from fitted ColumnTransformer using sklearn’s standard method.

Parameters:: preprocessor (ColumnTransformer) – Fitted ColumnTransformer instance.
Returns:: List of feature names in the same order as transform output columns.
Return type:: List[str]

Examples

>>> preprocessor = config_to_column_transformer(config)
>>> preprocessor.fit(data)  # Must fit first!
>>> feature_names = _get_feature_names(preprocessor)
>>> # ['cat__node_type_A', 'cat__node_type_B', 'num__weight']

napistu_torch.load.encoding._group_identical_columns(encoded_array: ndarray, feature_names: List[str]) → Dict[int, List[Tuple[int, str]]]

Group columns by identical values using matrix operations.

Parameters:

encoded_array (np.ndarray) – Feature matrix
feature_names (List[str]) – Feature names

Returns:

Mapping from representative column index to list of (index, name) tuples

Return type:

Dict[int, List[Tuple[int, str]]]

napistu_torch.load.encoding._resolve_canonical_names(duplicate_groups: Dict[int, List[Tuple[int, str]]], unique_columns: set, min_prefix_length: int) → Tuple[Dict[int, str], Dict[str, str]]

Resolve canonical names for duplicate groups with uniqueness guarantees.

Processes groups serially, checking each proposed canonical name against all previously assigned names and non-duplicate feature names to ensure uniqueness.

Parameters:

duplicate_groups (Dict[int, List[Tuple[int, str]]]) – Groups of duplicate columns, mapping representative index to (index, name) tuples
unique_columns (set) – Names of non-duplicate columns that must not be shadowed
min_prefix_length (int) – Minimum prefix length for canonical names

Returns:

canonical_mapping (Dict[int, str]) – Mapping from kept column index to its canonical name
alias_dict (Dict[str, str]) – Mapping from removed names to canonical names

napistu_torch.load.encoding._validate_feature_names(feature_names: List[str]) → None: Check for duplicates in feature_names.

napistu_torch.load.encoding.auto_encode(graph_df: DataFrame, existing_encodings: Dict | EncodingManager, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}) → EncodingManager

Select appropriate encodings for each column in a graph dataframe (either the vertex_df or edge_df)

Parameters:

graph_df (pd.DataFrame) – The dataframe to select encodings for.
existing_encodings (Union[Dict, EncodingManager]) – The existing encodings to use. This could be VERTEX_DEFAULT_TRANSFORMS or EDGE_DEFAULT_TRANSFORMS or any modified version of these.
encoders (Dict, default=ENCODERS) – The encoders to use. These will be used to map from column encoding classes to the encoders themselves. If existing_encodings is a dict, then it must be passed in the ‘simple’ format which is a lookup from encoder keys to the columns using that encoder.

Returns:

A new EncodingManager with the selected encodings.

Return type:

EncodingManager

napistu_torch.load.encoding.classify_encoding(series: Series, max_categories: int = 50) → str | None

Classify the encoding type for a pandas Series.

Parameters:

series (pd.Series) – The column to classify
max_categories (int, default=50) – Maximum number of unique values for categorical encoding. If exceeded, logs a warning and returns None.

Returns:

One of: ‘binary’, ‘categorical’, ‘numeric’, ‘numeric_sparse’, or None Returns None for constant variables or high-cardinality features.

Return type:

Optional[str]

Examples

>>> classify_encoding(pd.Series([0, 1, 0, 1]))
'binary'
>>> classify_encoding(pd.Series([0, 1, np.nan]))
'categorical'
>>> classify_encoding(pd.Series([1.5, 2.3, 4.1]))
'numeric'
>>> classify_encoding(pd.Series([1.5, np.nan, 4.1]))
'numeric_sparse'
>>> classify_encoding(pd.Series([5, 5, 5, 5]))  # Constant
None

napistu_torch.load.encoding.compose_encoding_configs(encoding_defaults: Dict | EncodingManager, encoding_overrides: Dict | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, verbose: bool = False) → EncodingManager

Compose encoding configurations with optional overrides.

Parameters:

encoding_defaults (Union[Dict, EncodingManager]) – Base encoding configuration.
encoding_overrides (Optional[Union[Dict, EncodingManager]], default=None) – Optional override configuration to merge with defaults. For column conflicts, overrides take precedence.
encoders (Dict, default=DEFAULT_ENCODERS) – Encoder instances to use when configs are in simple format.
verbose (bool, default=False) – If True, log config composition details.

Returns:

Composed configuration (or just defaults if no overrides).

Return type:

EncodingManager

Examples

>>> defaults = {ENCODINGS.NUMERIC: ['col1']}
>>> overrides = {ENCODINGS.CATEGORICAL: ['col2']}
>>> config = compose_encoding_configs(defaults, overrides)

napistu_torch.load.encoding.config_to_column_transformer(encoding_config: Dict[str, Dict] | EncodingConfig) → ColumnTransformer

Convert validated config dict to sklearn ColumnTransformer.

Parameters:: encoding_config (Union[Dict[str, Dict], EncodingConfig]) – Configuration dictionary (will be validated first).
Returns:: sklearn ColumnTransformer ready for fit/transform.
Return type:: ColumnTransformer
Raises:: ValueError – If config is invalid.

Examples

>>> config = {
...     'categorical': {
...         'columns': ['node_type', 'species_type'],
...         'transformer': OneHotEncoder(handle_unknown='ignore')
...     },
...     'numerical': {
...         'columns': ['weight', 'score'],
...         'transformer': StandardScaler()
...     }
... }
>>> preprocessor = config_to_column_transformer(config)
>>> # Equivalent to:
>>> # ColumnTransformer([
>>> #     ('categorical', OneHotEncoder(handle_unknown='ignore'), ['node_type', 'species_type']),
>>> #     ('numerical', StandardScaler(), ['weight', 'score'])
>>> # ])

napistu_torch.load.encoding.deduplicate_features(encoded_array: ndarray, feature_names: List[str], min_prefix_length: int = 3) → Tuple[ndarray, List[str], Dict[str, str]]

Deduplicate identical feature columns using shortest common prefix.

Ensures all canonical names are unique by checking against both other canonical names and non-duplicate feature names.

Parameters:

encoded_array (np.ndarray) – Feature matrix with potential duplicates, shape (n_samples, n_features)
feature_names (List[str]) – Names corresponding to columns in encoded_array
min_prefix_length (int, default=3) – Minimum prefix length for canonical names. If common prefix is shorter, falls back to alphabetically first name in the group.

Returns:

pruned_array (np.ndarray) – Array with duplicate columns removed
canonical_names (List[str]) – Names of kept features (using shortest common prefix for duplicates)
feature_aliases (Dict[str, str]) – Mapping from removed feature names to their canonical representatives

Raises:

ValueError – If feature_names contains duplicates

Examples

>>> array = np.array([[1, 1, 0], [0, 0, 1]])
>>> names = ['is_string_x', 'is_string_y', 'value_weight']
>>> pruned, canonical, aliases = deduplicate_features(array, names)
>>> canonical
['is_string', 'value_weight']
>>> aliases
{'is_string_y': 'is_string'}

napistu_torch.load.encoding.encode_dataframe(df: DataFrame, encoding_defaults: Dict[str, Dict] | EncodingManager, encoding_overrides: Dict[str, Dict] | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, deduplicate: bool = True, verbose: bool = False) → tuple[ndarray, List[str]]

Encode a DataFrame using sklearn transformers with configurable encoding rules.

This is a convenience function that combines fitting and transforming in one step. For more control (e.g., fitting on training data and transforming test data), use fit_encoders() and transform_dataframe() separately.

This function applies a series of transformations to a DataFrame based on encoding configurations. It supports both default encoding rules and optional overrides that can modify or extend the default behavior.

Parameters:

df (pd.DataFrame) – Input DataFrame to be encoded. Must contain all columns specified in the encoding configurations.
encoding_defaults (Union[Dict[str, Dict], EncodingManager]) –
Base encoding configuration dictionary. Each key is a transform name and each value is a dict with ‘columns’ and ‘transformer’ keys. Example: {

’categorical’: {
‘columns’: [‘col1’, ‘col2’], ‘transformer’: OneHotEncoder()

}, ‘numerical’: {

’columns’: [‘col3’], ‘transformer’: StandardScaler()

}

}
encoding_overrides (Optional[Union[Dict[str, Dict], EncodingManager]], default=None) – Optional override configuration that will be merged with encoding_defaults. For column conflicts, the override configuration takes precedence. If None, only encoding_defaults will be used.
encoders (Dict, default=ENCODERS) – The encoders to use. If encoding_defaults or encoding_overrides are dicts, then these will be used to map from column encoding classes to the encoders themselves. If existing_encodings is a dict, then it must be passed in the ‘simple’ format which is a lookup from encoder keys to the columns using that encoder.
deduplicate (bool, default=True) – If True, deduplicate identical features and name the resulting columns using the shortest common prefix of the merged columns.
verbose (bool, default=False) – If True, log detailed information about config composition and conflicts.

Returns:

A tuple containing: - encoded_array : np.ndarray

Transformed numpy array with encoded features. The number of columns may differ from the input due to transformations like OneHotEncoder.

feature_namesList[str]
List of feature names corresponding to the columns in encoded_array. Names follow sklearn’s convention: ‘transform_name__column_name’.
feature_aliasesDict[str, str]
Mapping from feature names to their aliases. If deduplicate is True, this will be a mapping from feature names to their canonical names. If deduplicate is False, this will be an empty dictionary.

Return type:

tuple[np.ndarray, List[str]]

Raises:

ValueError – If encoding configurations are invalid, have column conflicts, or if required columns are missing from the input DataFrame.
KeyError – If the input DataFrame is missing columns specified in the encoding config.

Examples

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder, StandardScaler
>>>
>>> # Sample data
>>> df = pd.DataFrame({
...     'category': ['A', 'B', 'A', 'C'],
...     'value': [1.0, 2.0, 3.0, 4.0]
... })
>>>
>>> # Encoding configuration
>>> defaults = {
...     'categorical': {
...         'columns': ['category'],
...         'transformer': OneHotEncoder(sparse_output=False)
...     },
...     'numerical': {
...         'columns': ['value'],
...         'transformer': StandardScaler()
...     }
... }
>>>
>>> # Encode the DataFrame (fit and transform in one step)
>>> encoded_array, feature_names = encode_dataframe(df, defaults)
>>> print(f"Encoded shape: {encoded_array.shape}")
>>> print(f"Feature names: {feature_names}")
>>>
>>> # For train/test split, use the two-step approach:
>>> fitted_transformer = fit_encoders(train_df, defaults)
>>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer)
>>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)

napistu_torch.load.encoding.expand_deduplicated_features(encoded_array: ndarray, feature_names: List[str], feature_aliases: Dict[str, str]) → Tuple[ndarray, List[str]]

Expand deduplicated features back to original form.

Takes a deduplicated feature matrix and expands it by duplicating columns for all aliased features. The expanded array will have the same number of columns as the original pre-deduplication array (though column order may differ).

Parameters:

encoded_array (np.ndarray) – Deduplicated feature matrix, shape (n_samples, n_deduplicated_features)
feature_names (List[str]) – Canonical feature names corresponding to encoded_array columns
feature_aliases (Dict[str, str]) – Mapping from removed feature names to canonical names (output from deduplicate_features). This can be a subset of the dictionary to only restore specific aliases.

Returns:

expanded_array (np.ndarray) – Array with aliased columns duplicated, shape (n_samples, n_original_features)
expanded_names (List[str]) – Feature names for expanded array (includes all original names)

Examples

>>> # After deduplication
>>> deduplicated = np.array([[1, 0], [0, 1]])
>>> names = ['is_string', 'value_weight']
>>> aliases = {'is_string_x': 'is_string', 'is_string_y': 'is_string'}
>>>
>>> expanded, expanded_names = expand_deduplicated_features(
...     deduplicated, names, aliases
... )
>>> expanded.shape
(2, 4)  # 2 samples, 4 features (is_string, is_string_x, is_string_y, value_weight)
>>> expanded_names
['is_string', 'is_string_x', 'is_string_y', 'value_weight']

napistu_torch.load.encoding.fit_encoders(df: DataFrame, encoding_defaults: Dict[str, Dict] | EncodingManager, encoding_overrides: Dict[str, Dict] | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, verbose: bool = False) → ColumnTransformer

Fit encoding transformers on a DataFrame.

This function creates and fits a ColumnTransformer based on encoding configurations. The fitted transformer can then be used to transform this DataFrame or other DataFrames with the same schema.

Parameters:

df (pd.DataFrame) – Input DataFrame to fit encoders on. Must contain all columns specified in the encoding configurations.
encoding_defaults (Union[Dict[str, Dict], EncodingManager]) – Base encoding configuration. Each key is a transform name and each value is a dict with ‘columns’ and ‘transformer’ keys.
encoding_overrides (Optional[Union[Dict[str, Dict], EncodingManager]], default=None) – Optional override configuration that will be merged with encoding_defaults. For column conflicts, the override configuration takes precedence.
verbose (bool, default=False) – If True, log detailed information about config composition and conflicts.

Returns:

Fitted sklearn ColumnTransformer ready to transform data.

Return type:

ColumnTransformer

Raises:

ValueError – If encoding configurations are invalid or if the DataFrame is empty.
KeyError – If the input DataFrame is missing columns specified in the encoding config.

Examples

>>> # Fit encoders on training data
>>> fitted_transformer = fit_encoders(train_df, encoding_defaults)
>>>
>>> # Use the fitted transformer on train and test data
>>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer)
>>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)

napistu_torch.load.encoding.transform_dataframe(df: DataFrame, fitted_transformer: ColumnTransformer, deduplicate: bool = True) → tuple[ndarray, List[str], Dict[str, str]]

Transform a DataFrame using a fitted ColumnTransformer.

This function applies pre-fitted transformations to a DataFrame. The transformer must have been fitted previously using fit_encoders() or by calling .fit() directly.

Parameters:

df (pd.DataFrame) – Input DataFrame to transform. Must contain all columns that the transformer expects.
fitted_transformer (ColumnTransformer) – A fitted sklearn ColumnTransformer instance.
deduplicate (bool = True,) – If True, deduplicate identical features and name the resulting columns using the shortest common prefix of the merged columns.

Returns:

A tuple containing: - encoded_array : np.ndarray

Transformed numpy array with encoded features.

feature_namesList[str]
List of feature names corresponding to columns in encoded_array.
feature_aliasesDict[str, str]
Mapping from feature names to their aliases. If deduplicate is True, this will be a mapping from feature names to their canonical names. If deduplicate is False, this will be an empty dictionary.

Return type:

tuple[np.ndarray, List[str]]

Raises:

ValueError – If the transformer is not fitted or if the DataFrame is empty.
KeyError – If the DataFrame is missing columns required by the transformer.

Examples

>>> # Fit on training data
>>> fitted_transformer = fit_encoders(train_df, encoding_config)
>>>
>>> # Transform multiple DataFrames with same fitted transformer
>>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer)
>>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)
>>> val_encoded, val_features = transform_dataframe(val_df, fitted_transformer)