napistu_torch.utils.pd_utils
Utilities for Pandas operations.
Public Functions
- calculate_ranks(df, value_col, by_absolute_value=True, grouping_vars=None)
Compute integer ranks for values in a DataFrame, ranking within groups.
- reorder_multindex_by_categorical_and_numeric(multindex, categorical_order, categorical_level, numeric_level)
Reorder MultiIndex by categorical order (from reference) then by numeric value.
Functions
|
Compute integer ranks for values in a DataFrame, ranking within groups. |
|
Filter and reorder a DataFrame to match a target ID list. |
Reorder MultiIndex by categorical order (from reference) then by numeric value. |
- napistu_torch.utils.pd_utils.calculate_ranks(df: DataFrame, value_col: str, by_absolute_value: bool = True, grouping_vars: str | List[str] | None = None) Series
Compute integer ranks for values in a DataFrame, ranking within groups.
Since all entries are already top N, ranks them directly based on values within each group. Rank 1 = highest value, rank 2 = second highest, etc.
- Parameters:
df (pd.DataFrame) – DataFrame containing values to rank
value_col (str) – Name of the column containing values to rank
by_absolute_value (bool, optional) – If True, rank by absolute value (default: True). If False, rank by raw value.
grouping_vars (str or List[str], optional) – Column name(s) to group by when calculating ranks. If None, ranks globally. If a single string, ranks within each value of that column. If a list of strings, ranks within each combination of those columns. Example: [‘model’] or [‘model’, ‘layer’] (default: None)
- Returns:
Series of integer ranks with same index as df. Rank 1 = highest value, rank 2 = second highest, etc. Ranks are calculated within each group if grouping_vars is provided.
- Return type:
pd.Series
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'model': ['A', 'A', 'B', 'B'], ... 'attention': [0.9, 0.8, 0.7, 0.6] ... }) >>> ranks = calculate_ranks(df, 'attention', grouping_vars='model') >>> # Ranks within each model: A gets [1, 2], B gets [1, 2]
- napistu_torch.utils.pd_utils.filter_and_reorder_df(df: DataFrame, target_ids: List[str], id_column: str) DataFrame
Filter and reorder a DataFrame to match a target ID list.
Selects rows from
dfwhereid_columnmatches an entry intarget_ids, then reorders to match the order oftarget_ids.- Parameters:
df (pd.DataFrame) – DataFrame to filter and reorder. Must contain
id_column.target_ids (List[str]) – Ordered list of identifiers to filter and reorder by.
id_column (str, optional) – Column in
dfto match againsttarget_ids(default: ‘ensembl_gene’).
- Returns:
Filtered and reordered DataFrame with reset index. Has exactly
len(filtered_target_ids)rows, wherefiltered_target_idsis the subset oftarget_idsfound in the DataFrame.- Return type:
pd.DataFrame
- Raises:
ValueError – If
id_columnis not indf. If notarget_idsare found in the DataFrame. Ifid_columncontains duplicates among the matched rows (would make reordering ambiguous).
Examples
>>> full_df = pd.DataFrame({ ... 'ensembl_gene': ['ENSG1', 'ENSG2', 'ENSG3'], ... 'symbol': ['TP53', 'BRAF', 'PTEN'], ... 'vocab_name': ['TP53', 'BRAF', 'PTEN'], ... }) >>> filtered = filter_and_reorder_df( ... full_df, ... target_ids=['ENSG3', 'ENSG1'], ... id_column='ensembl_gene', ... ) >>> filtered['symbol'].tolist() ['PTEN', 'TP53']
- napistu_torch.utils.pd_utils.reorder_multindex_by_categorical_and_numeric(multindex: MultiIndex, categorical_order: List, categorical_level: int = 0, numeric_level: int = 1) MultiIndex
Reorder MultiIndex by categorical order (from reference) then by numeric value.
This function takes a MultiIndex and reorders it to match a desired categorical ordering, then sorts by numeric values within each categorical group.
- Parameters:
multindex (pd.MultiIndex) – MultiIndex to reorder
categorical_order (List) – Desired order for categorical values. All values in this list must be present in the MultiIndex at categorical_level. If some are missing, a warning is logged. If extra values are present in the MultiIndex that aren’t in categorical_order, a ValueError is raised.
categorical_level (int, optional) – Level index for categorical variable in MultiIndex (default: 0)
numeric_level (int, optional) – Level index for numeric variable in MultiIndex (default: 1)
- Returns:
Reordered MultiIndex
- Return type:
pd.MultiIndex
- Raises:
ValueError – If the MultiIndex contains categorical values not in categorical_order
Examples
>>> import pandas as pd >>> # MultiIndex to reorder >>> idx = pd.MultiIndex.from_tuples([ ... ('model_B', 2), ('model_A', 1), ('model_A', 0), ('model_B', 0) ... ], names=['model', 'layer']) >>> # Desired categorical order >>> categorical_order = ['model_A', 'model_B'] >>> # Reorder >>> idx_reordered = reorder_multindex_by_categorical_and_numeric( ... idx, categorical_order, categorical_level=0, numeric_level=1 ... ) >>> # Result: ('model_A', 0), ('model_A', 1), ('model_B', 0), ('model_B', 2)