napistu_torch.load.encoding

DataFrame encoding and transformation utilities.

This module provides functions for automatically selecting encodings, fitting transformers, and transforming DataFrames for use in machine learning pipelines.

Public Functions

auto_encode(graph_df, existing_encodings, encoders=DEFAULT_ENCODERS)

Select appropriate encodings for each column in a graph dataframe.

classify_encoding(series, max_categories=50)

Classify the appropriate encoding type for a pandas Series.

compose_encoding_configs(config1, config2)

Compose two encoding configurations.

deduplicate_features(feature_names)

Deduplicate feature names by grouping identical features.

config_to_column_transformer(config, encoders=DEFAULT_ENCODERS)

Convert encoding config to sklearn ColumnTransformer.

encode_dataframe(df, config, encoders=DEFAULT_ENCODERS, fit=True)

Encode a DataFrame using the specified configuration.

expand_deduplicated_features(feature_names, feature_name_aliases)

Expand deduplicated feature names using aliases.

fit_encoders(df, config, encoders=DEFAULT_ENCODERS)

Fit encoders on a DataFrame using the specified configuration.

transform_dataframe(df, preprocessor, feature_names)

Transform a DataFrame using a fitted preprocessor.

Functions

auto_encode(graph_df, existing_encodings[, ...])

Select appropriate encodings for each column in a graph dataframe (either the vertex_df or edge_df)

classify_encoding(series[, max_categories])

Classify the encoding type for a pandas Series.

compose_encoding_configs(encoding_defaults)

Compose encoding configurations with optional overrides.

config_to_column_transformer(encoding_config)

Convert validated config dict to sklearn ColumnTransformer.

deduplicate_features(encoded_array, ...[, ...])

Deduplicate identical feature columns using shortest common prefix.

encode_dataframe(df, encoding_defaults[, ...])

Encode a DataFrame using sklearn transformers with configurable encoding rules.

expand_deduplicated_features(encoded_array, ...)

Expand deduplicated features back to original form.

fit_encoders(df, encoding_defaults[, ...])

Fit encoding transformers on a DataFrame.

transform_dataframe(df, fitted_transformer)

Transform a DataFrame using a fitted ColumnTransformer.

napistu_torch.load.encoding._build_deduplicated_array(encoded_array: ndarray, feature_names: List[str], canonical_mapping: Dict[int, str], unique_columns: set) Tuple[ndarray, List[str]]

Build deduplicated array and feature name list.

Parameters:
  • encoded_array (np.ndarray) – Original feature matrix

  • feature_names (List[str]) – Original feature names

  • canonical_mapping (Dict[int, str]) – Mapping from kept duplicate indices to canonical names

  • unique_columns (set) – Names of non-duplicate columns to keep as-is

Returns:

  • pruned_array (np.ndarray) – Array with duplicate columns removed

  • canonical_names (List[str]) – Feature names for pruned array

napistu_torch.load.encoding._get_feature_names(preprocessor: ColumnTransformer) List[str]

Get feature names from fitted ColumnTransformer using sklearn’s standard method.

Parameters:

preprocessor (ColumnTransformer) – Fitted ColumnTransformer instance.

Returns:

List of feature names in the same order as transform output columns.

Return type:

List[str]

Examples

>>> preprocessor = config_to_column_transformer(config)
>>> preprocessor.fit(data)  # Must fit first!
>>> feature_names = _get_feature_names(preprocessor)
>>> # ['cat__node_type_A', 'cat__node_type_B', 'num__weight']
napistu_torch.load.encoding._group_identical_columns(encoded_array: ndarray, feature_names: List[str]) Dict[int, List[Tuple[int, str]]]

Group columns by identical values using matrix operations.

Parameters:
  • encoded_array (np.ndarray) – Feature matrix

  • feature_names (List[str]) – Feature names

Returns:

Mapping from representative column index to list of (index, name) tuples

Return type:

Dict[int, List[Tuple[int, str]]]

napistu_torch.load.encoding._resolve_canonical_names(duplicate_groups: Dict[int, List[Tuple[int, str]]], unique_columns: set, min_prefix_length: int) Tuple[Dict[int, str], Dict[str, str]]

Resolve canonical names for duplicate groups with uniqueness guarantees.

Processes groups serially, checking each proposed canonical name against all previously assigned names and non-duplicate feature names to ensure uniqueness.

Parameters:
  • duplicate_groups (Dict[int, List[Tuple[int, str]]]) – Groups of duplicate columns, mapping representative index to (index, name) tuples

  • unique_columns (set) – Names of non-duplicate columns that must not be shadowed

  • min_prefix_length (int) – Minimum prefix length for canonical names

Returns:

  • canonical_mapping (Dict[int, str]) – Mapping from kept column index to its canonical name

  • alias_dict (Dict[str, str]) – Mapping from removed names to canonical names

napistu_torch.load.encoding._validate_feature_names(feature_names: List[str]) None

Check for duplicates in feature_names.

napistu_torch.load.encoding.auto_encode(graph_df: DataFrame, existing_encodings: Dict | EncodingManager, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}) EncodingManager

Select appropriate encodings for each column in a graph dataframe (either the vertex_df or edge_df)

Parameters:
  • graph_df (pd.DataFrame) – The dataframe to select encodings for.

  • existing_encodings (Union[Dict, EncodingManager]) – The existing encodings to use. This could be VERTEX_DEFAULT_TRANSFORMS or EDGE_DEFAULT_TRANSFORMS or any modified version of these.

  • encoders (Dict, default=ENCODERS) – The encoders to use. These will be used to map from column encoding classes to the encoders themselves. If existing_encodings is a dict, then it must be passed in the ‘simple’ format which is a lookup from encoder keys to the columns using that encoder.

Returns:

A new EncodingManager with the selected encodings.

Return type:

EncodingManager

napistu_torch.load.encoding.classify_encoding(series: Series, max_categories: int = 50) str | None

Classify the encoding type for a pandas Series.

Parameters:
  • series (pd.Series) – The column to classify

  • max_categories (int, default=50) – Maximum number of unique values for categorical encoding. If exceeded, logs a warning and returns None.

Returns:

One of: ‘binary’, ‘categorical’, ‘numeric’, ‘numeric_sparse’, or None Returns None for constant variables or high-cardinality features.

Return type:

Optional[str]

Examples

>>> classify_encoding(pd.Series([0, 1, 0, 1]))
'binary'
>>> classify_encoding(pd.Series([0, 1, np.nan]))
'categorical'
>>> classify_encoding(pd.Series([1.5, 2.3, 4.1]))
'numeric'
>>> classify_encoding(pd.Series([1.5, np.nan, 4.1]))
'numeric_sparse'
>>> classify_encoding(pd.Series([5, 5, 5, 5]))  # Constant
None
napistu_torch.load.encoding.compose_encoding_configs(encoding_defaults: Dict | EncodingManager, encoding_overrides: Dict | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, verbose: bool = False) EncodingManager

Compose encoding configurations with optional overrides.

Parameters:
  • encoding_defaults (Union[Dict, EncodingManager]) – Base encoding configuration.

  • encoding_overrides (Optional[Union[Dict, EncodingManager]], default=None) – Optional override configuration to merge with defaults. For column conflicts, overrides take precedence.

  • encoders (Dict, default=DEFAULT_ENCODERS) – Encoder instances to use when configs are in simple format.

  • verbose (bool, default=False) – If True, log config composition details.

Returns:

Composed configuration (or just defaults if no overrides).

Return type:

EncodingManager

Examples

>>> defaults = {ENCODINGS.NUMERIC: ['col1']}
>>> overrides = {ENCODINGS.CATEGORICAL: ['col2']}
>>> config = compose_encoding_configs(defaults, overrides)
napistu_torch.load.encoding.config_to_column_transformer(encoding_config: Dict[str, Dict] | EncodingConfig) ColumnTransformer

Convert validated config dict to sklearn ColumnTransformer.

Parameters:

encoding_config (Union[Dict[str, Dict], EncodingConfig]) – Configuration dictionary (will be validated first).

Returns:

sklearn ColumnTransformer ready for fit/transform.

Return type:

ColumnTransformer

Raises:

ValueError – If config is invalid.

Examples

>>> config = {
...     'categorical': {
...         'columns': ['node_type', 'species_type'],
...         'transformer': OneHotEncoder(handle_unknown='ignore')
...     },
...     'numerical': {
...         'columns': ['weight', 'score'],
...         'transformer': StandardScaler()
...     }
... }
>>> preprocessor = config_to_column_transformer(config)
>>> # Equivalent to:
>>> # ColumnTransformer([
>>> #     ('categorical', OneHotEncoder(handle_unknown='ignore'), ['node_type', 'species_type']),
>>> #     ('numerical', StandardScaler(), ['weight', 'score'])
>>> # ])
napistu_torch.load.encoding.deduplicate_features(encoded_array: ndarray, feature_names: List[str], min_prefix_length: int = 3) Tuple[ndarray, List[str], Dict[str, str]]

Deduplicate identical feature columns using shortest common prefix.

Ensures all canonical names are unique by checking against both other canonical names and non-duplicate feature names.

Parameters:
  • encoded_array (np.ndarray) – Feature matrix with potential duplicates, shape (n_samples, n_features)

  • feature_names (List[str]) – Names corresponding to columns in encoded_array

  • min_prefix_length (int, default=3) – Minimum prefix length for canonical names. If common prefix is shorter, falls back to alphabetically first name in the group.

Returns:

  • pruned_array (np.ndarray) – Array with duplicate columns removed

  • canonical_names (List[str]) – Names of kept features (using shortest common prefix for duplicates)

  • feature_aliases (Dict[str, str]) – Mapping from removed feature names to their canonical representatives

Raises:

ValueError – If feature_names contains duplicates

Examples

>>> array = np.array([[1, 1, 0], [0, 0, 1]])
>>> names = ['is_string_x', 'is_string_y', 'value_weight']
>>> pruned, canonical, aliases = deduplicate_features(array, names)
>>> canonical
['is_string', 'value_weight']
>>> aliases
{'is_string_y': 'is_string'}
napistu_torch.load.encoding.encode_dataframe(df: DataFrame, encoding_defaults: Dict[str, Dict] | EncodingManager, encoding_overrides: Dict[str, Dict] | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, deduplicate: bool = True, verbose: bool = False) tuple[ndarray, List[str]]

Encode a DataFrame using sklearn transformers with configurable encoding rules.

This is a convenience function that combines fitting and transforming in one step. For more control (e.g., fitting on training data and transforming test data), use fit_encoders() and transform_dataframe() separately.

This function applies a series of transformations to a DataFrame based on encoding configurations. It supports both default encoding rules and optional overrides that can modify or extend the default behavior.

Parameters:
  • df (pd.DataFrame) – Input DataFrame to be encoded. Must contain all columns specified in the encoding configurations.

  • encoding_defaults (Union[Dict[str, Dict], EncodingManager]) –

    Base encoding configuration dictionary. Each key is a transform name and each value is a dict with ‘columns’ and ‘transformer’ keys. Example: {

    ’categorical’: {

    ‘columns’: [‘col1’, ‘col2’], ‘transformer’: OneHotEncoder()

    }, ‘numerical’: {

    ’columns’: [‘col3’], ‘transformer’: StandardScaler()

    }

    }

  • encoding_overrides (Optional[Union[Dict[str, Dict], EncodingManager]], default=None) – Optional override configuration that will be merged with encoding_defaults. For column conflicts, the override configuration takes precedence. If None, only encoding_defaults will be used.

  • encoders (Dict, default=ENCODERS) – The encoders to use. If encoding_defaults or encoding_overrides are dicts, then these will be used to map from column encoding classes to the encoders themselves. If existing_encodings is a dict, then it must be passed in the ‘simple’ format which is a lookup from encoder keys to the columns using that encoder.

  • deduplicate (bool, default=True) – If True, deduplicate identical features and name the resulting columns using the shortest common prefix of the merged columns.

  • verbose (bool, default=False) – If True, log detailed information about config composition and conflicts.

Returns:

A tuple containing: - encoded_array : np.ndarray

Transformed numpy array with encoded features. The number of columns may differ from the input due to transformations like OneHotEncoder.

  • feature_namesList[str]

    List of feature names corresponding to the columns in encoded_array. Names follow sklearn’s convention: ‘transform_name__column_name’.

  • feature_aliasesDict[str, str]

    Mapping from feature names to their aliases. If deduplicate is True, this will be a mapping from feature names to their canonical names. If deduplicate is False, this will be an empty dictionary.

Return type:

tuple[np.ndarray, List[str]]

Raises:
  • ValueError – If encoding configurations are invalid, have column conflicts, or if required columns are missing from the input DataFrame.

  • KeyError – If the input DataFrame is missing columns specified in the encoding config.

Examples

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder, StandardScaler
>>>
>>> # Sample data
>>> df = pd.DataFrame({
...     'category': ['A', 'B', 'A', 'C'],
...     'value': [1.0, 2.0, 3.0, 4.0]
... })
>>>
>>> # Encoding configuration
>>> defaults = {
...     'categorical': {
...         'columns': ['category'],
...         'transformer': OneHotEncoder(sparse_output=False)
...     },
...     'numerical': {
...         'columns': ['value'],
...         'transformer': StandardScaler()
...     }
... }
>>>
>>> # Encode the DataFrame (fit and transform in one step)
>>> encoded_array, feature_names = encode_dataframe(df, defaults)
>>> print(f"Encoded shape: {encoded_array.shape}")
>>> print(f"Feature names: {feature_names}")
>>>
>>> # For train/test split, use the two-step approach:
>>> fitted_transformer = fit_encoders(train_df, defaults)
>>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer)
>>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)
napistu_torch.load.encoding.expand_deduplicated_features(encoded_array: ndarray, feature_names: List[str], feature_aliases: Dict[str, str]) Tuple[ndarray, List[str]]

Expand deduplicated features back to original form.

Takes a deduplicated feature matrix and expands it by duplicating columns for all aliased features. The expanded array will have the same number of columns as the original pre-deduplication array (though column order may differ).

Parameters:
  • encoded_array (np.ndarray) – Deduplicated feature matrix, shape (n_samples, n_deduplicated_features)

  • feature_names (List[str]) – Canonical feature names corresponding to encoded_array columns

  • feature_aliases (Dict[str, str]) – Mapping from removed feature names to canonical names (output from deduplicate_features). This can be a subset of the dictionary to only restore specific aliases.

Returns:

  • expanded_array (np.ndarray) – Array with aliased columns duplicated, shape (n_samples, n_original_features)

  • expanded_names (List[str]) – Feature names for expanded array (includes all original names)

Examples

>>> # After deduplication
>>> deduplicated = np.array([[1, 0], [0, 1]])
>>> names = ['is_string', 'value_weight']
>>> aliases = {'is_string_x': 'is_string', 'is_string_y': 'is_string'}
>>>
>>> expanded, expanded_names = expand_deduplicated_features(
...     deduplicated, names, aliases
... )
>>> expanded.shape
(2, 4)  # 2 samples, 4 features (is_string, is_string_x, is_string_y, value_weight)
>>> expanded_names
['is_string', 'is_string_x', 'is_string_y', 'value_weight']
napistu_torch.load.encoding.fit_encoders(df: DataFrame, encoding_defaults: Dict[str, Dict] | EncodingManager, encoding_overrides: Dict[str, Dict] | EncodingManager | None = None, encoders: Dict = {'binary': 'passthrough', 'categorical': OneHotEncoder(drop='if_binary', sparse_output=False), 'numeric': StandardScaler(), 'sparse_categorical': OneHotEncoder(handle_unknown='ignore', sparse_output=False), 'sparse_numeric': SparseContScaler()}, verbose: bool = False) ColumnTransformer

Fit encoding transformers on a DataFrame.

This function creates and fits a ColumnTransformer based on encoding configurations. The fitted transformer can then be used to transform this DataFrame or other DataFrames with the same schema.

Parameters:
  • df (pd.DataFrame) – Input DataFrame to fit encoders on. Must contain all columns specified in the encoding configurations.

  • encoding_defaults (Union[Dict[str, Dict], EncodingManager]) – Base encoding configuration. Each key is a transform name and each value is a dict with ‘columns’ and ‘transformer’ keys.

  • encoding_overrides (Optional[Union[Dict[str, Dict], EncodingManager]], default=None) – Optional override configuration that will be merged with encoding_defaults. For column conflicts, the override configuration takes precedence.

  • verbose (bool, default=False) – If True, log detailed information about config composition and conflicts.

Returns:

Fitted sklearn ColumnTransformer ready to transform data.

Return type:

ColumnTransformer

Raises:
  • ValueError – If encoding configurations are invalid or if the DataFrame is empty.

  • KeyError – If the input DataFrame is missing columns specified in the encoding config.

Examples

>>> # Fit encoders on training data
>>> fitted_transformer = fit_encoders(train_df, encoding_defaults)
>>>
>>> # Use the fitted transformer on train and test data
>>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer)
>>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)
napistu_torch.load.encoding.transform_dataframe(df: DataFrame, fitted_transformer: ColumnTransformer, deduplicate: bool = True) tuple[ndarray, List[str], Dict[str, str]]

Transform a DataFrame using a fitted ColumnTransformer.

This function applies pre-fitted transformations to a DataFrame. The transformer must have been fitted previously using fit_encoders() or by calling .fit() directly.

Parameters:
  • df (pd.DataFrame) – Input DataFrame to transform. Must contain all columns that the transformer expects.

  • fitted_transformer (ColumnTransformer) – A fitted sklearn ColumnTransformer instance.

  • deduplicate (bool = True,) – If True, deduplicate identical features and name the resulting columns using the shortest common prefix of the merged columns.

Returns:

A tuple containing: - encoded_array : np.ndarray

Transformed numpy array with encoded features.

  • feature_namesList[str]

    List of feature names corresponding to columns in encoded_array.

  • feature_aliasesDict[str, str]

    Mapping from feature names to their aliases. If deduplicate is True, this will be a mapping from feature names to their canonical names. If deduplicate is False, this will be an empty dictionary.

Return type:

tuple[np.ndarray, List[str]]

Raises:
  • ValueError – If the transformer is not fitted or if the DataFrame is empty.

  • KeyError – If the DataFrame is missing columns required by the transformer.

Examples

>>> # Fit on training data
>>> fitted_transformer = fit_encoders(train_df, encoding_config)
>>>
>>> # Transform multiple DataFrames with same fitted transformer
>>> train_encoded, train_features = transform_dataframe(train_df, fitted_transformer)
>>> test_encoded, test_features = transform_dataframe(test_df, fitted_transformer)
>>> val_encoded, val_features = transform_dataframe(val_df, fitted_transformer)