napistu_torch.utils.pd_utils

Utilities for Pandas operations.

Public Functions

calculate_ranks(df, value_col, by_absolute_value=True, grouping_vars=None)

Compute integer ranks for values in a DataFrame, ranking within groups.

reorder_multindex_by_categorical_and_numeric(multindex, categorical_order, categorical_level, numeric_level)

Reorder MultiIndex by categorical order (from reference) then by numeric value.

Functions

calculate_ranks(df, value_col[, ...])

Compute integer ranks for values in a DataFrame, ranking within groups.

filter_and_reorder_df(df, target_ids, id_column)

Filter and reorder a DataFrame to match a target ID list.

reorder_multindex_by_categorical_and_numeric(...)

Reorder MultiIndex by categorical order (from reference) then by numeric value.

napistu_torch.utils.pd_utils.calculate_ranks(df: DataFrame, value_col: str, by_absolute_value: bool = True, grouping_vars: str | List[str] | None = None) Series

Compute integer ranks for values in a DataFrame, ranking within groups.

Since all entries are already top N, ranks them directly based on values within each group. Rank 1 = highest value, rank 2 = second highest, etc.

Parameters:
  • df (pd.DataFrame) – DataFrame containing values to rank

  • value_col (str) – Name of the column containing values to rank

  • by_absolute_value (bool, optional) – If True, rank by absolute value (default: True). If False, rank by raw value.

  • grouping_vars (str or List[str], optional) – Column name(s) to group by when calculating ranks. If None, ranks globally. If a single string, ranks within each value of that column. If a list of strings, ranks within each combination of those columns. Example: [‘model’] or [‘model’, ‘layer’] (default: None)

Returns:

Series of integer ranks with same index as df. Rank 1 = highest value, rank 2 = second highest, etc. Ranks are calculated within each group if grouping_vars is provided.

Return type:

pd.Series

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'model': ['A', 'A', 'B', 'B'],
...     'attention': [0.9, 0.8, 0.7, 0.6]
... })
>>> ranks = calculate_ranks(df, 'attention', grouping_vars='model')
>>> # Ranks within each model: A gets [1, 2], B gets [1, 2]
napistu_torch.utils.pd_utils.filter_and_reorder_df(df: DataFrame, target_ids: List[str], id_column: str) DataFrame

Filter and reorder a DataFrame to match a target ID list.

Selects rows from df where id_column matches an entry in target_ids, then reorders to match the order of target_ids.

Parameters:
  • df (pd.DataFrame) – DataFrame to filter and reorder. Must contain id_column.

  • target_ids (List[str]) – Ordered list of identifiers to filter and reorder by.

  • id_column (str, optional) – Column in df to match against target_ids (default: ‘ensembl_gene’).

Returns:

Filtered and reordered DataFrame with reset index. Has exactly len(filtered_target_ids) rows, where filtered_target_ids is the subset of target_ids found in the DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If id_column is not in df. If no target_ids are found in the DataFrame. If id_column contains duplicates among the matched rows (would make reordering ambiguous).

Examples

>>> full_df = pd.DataFrame({
...     'ensembl_gene': ['ENSG1', 'ENSG2', 'ENSG3'],
...     'symbol': ['TP53', 'BRAF', 'PTEN'],
...     'vocab_name': ['TP53', 'BRAF', 'PTEN'],
... })
>>> filtered = filter_and_reorder_df(
...     full_df,
...     target_ids=['ENSG3', 'ENSG1'],
...     id_column='ensembl_gene',
... )
>>> filtered['symbol'].tolist()
['PTEN', 'TP53']
napistu_torch.utils.pd_utils.reorder_multindex_by_categorical_and_numeric(multindex: MultiIndex, categorical_order: List, categorical_level: int = 0, numeric_level: int = 1) MultiIndex

Reorder MultiIndex by categorical order (from reference) then by numeric value.

This function takes a MultiIndex and reorders it to match a desired categorical ordering, then sorts by numeric values within each categorical group.

Parameters:
  • multindex (pd.MultiIndex) – MultiIndex to reorder

  • categorical_order (List) – Desired order for categorical values. All values in this list must be present in the MultiIndex at categorical_level. If some are missing, a warning is logged. If extra values are present in the MultiIndex that aren’t in categorical_order, a ValueError is raised.

  • categorical_level (int, optional) – Level index for categorical variable in MultiIndex (default: 0)

  • numeric_level (int, optional) – Level index for numeric variable in MultiIndex (default: 1)

Returns:

Reordered MultiIndex

Return type:

pd.MultiIndex

Raises:

ValueError – If the MultiIndex contains categorical values not in categorical_order

Examples

>>> import pandas as pd
>>> # MultiIndex to reorder
>>> idx = pd.MultiIndex.from_tuples([
...     ('model_B', 2), ('model_A', 1), ('model_A', 0), ('model_B', 0)
... ], names=['model', 'layer'])
>>> # Desired categorical order
>>> categorical_order = ['model_A', 'model_B']
>>> # Reorder
>>> idx_reordered = reorder_multindex_by_categorical_and_numeric(
...     idx, categorical_order, categorical_level=0, numeric_level=1
... )
>>> # Result: ('model_A', 0), ('model_A', 1), ('model_B', 0), ('model_B', 2)