napistu_torch.utils.statistics

Utilities for calculating statistics.

Public Functions

calculate_rank_shift(df, n_genes, grouping_vars=”layer”, alternative=”greater”, test_method=”wilcoxon”)

Calculate rank percentile shift for a given DataFrame.

compare_top_k_union_ranks(top_k_union, grouping_vars, defining_vars, top_k, max_rank, rank_col, test_method=”wilcoxon”, alternative=”two-sided”)

Calculate the rank agreement between the top-k attention pairs of a given partition and all other partitions.

Functions

calculate_rank_shift(df, rank_col, max_rank)

Calculate rank shift statistics for a given DataFrame.

compare_top_k_union_ranks(top_k_union, ...)

Calculate the rank agreement between the top-k attention pairs of a given partition and all other partitions.

napistu_torch.utils.statistics._compute_rank_shift_for_group(group_df: DataFrame, rank_col: str, max_rank: int, alternative: str, test_method: str) Series

Compute rank shift statistics for a single group.

napistu_torch.utils.statistics.calculate_rank_shift(df: DataFrame, rank_col: str, max_rank: int, grouping_vars: str | List[str] | None = None, alternative: str = 'two-sided', test_method: str = 'wilcoxon') DataFrame

Calculate rank shift statistics for a given DataFrame.

For each group (e.g., layer), calculates the rank shift statistics.

Parameters:
  • df (pd.DataFrame) – DataFrame containing rank column and grouping variable.

  • rank_col (str) – Column name containing ranks.

  • max_rank (int) – Maximum possible rank.

  • grouping_var (str, optional) – Column name(s) to group by when calculating ranks. If None, ranks globally. If a single string, ranks within each value of that column. If a list of strings, ranks within each combination of those columns. Example: [‘model’] or [‘model’, ‘layer’] (default: None) Tests are performed separately for each unique value.

  • alternative (str, optional) – Alternative hypothesis for the test (default: “greater”). - “greater”: selected queries have higher quantiles than 0.5 - “less”: selected queries have lower quantiles than 0.5 - “two-sided”: selected queries differ from 0.5

  • test_method (str, optional) – Statistical test to use (default: “wilcoxon”). - “wilcoxon”: Wilcoxon signed-rank test (non-parametric) - “ttest”: One-sample t-test (parametric)

Returns:

Results with columns: - <grouping_var> : Grouping variable value - n_queries : Number of queries in this group - mean_quantile : Mean quantile (0.5 = null expectation) - median_quantile : Median quantile - statistic : Test statistic - p_value : P-value for the test

Return type:

pd.DataFrame

Examples

>>> # Test if top-k queries occupy high quantiles within each layer
>>> results = calculate_rank_shift(
...     df=top_k_union,
...     rank_col="rank",
...     max_rank=len(common_ids)**2,
...     grouping_var="layer"
... )
>>> # Significant layers have p_value < 0.05
>>> significant = results[results['p_value'] < 0.05]
>>> # Compare across models instead of layers
>>> results = calculate_rank_shift(
...     multi_model_edges,
...     n_genes=len(common_ids),
...     grouping_var="model"
... )
napistu_torch.utils.statistics.compare_top_k_union_ranks(top_k_union: DataFrame, grouping_vars: list, defining_vars: list, top_k: int, max_rank: int, rank_col: str, test_method: str = 'wilcoxon', alternative: str = 'two-sided') DataFrame

Calculate the rank agreement between the top-k attention pairs of a given partition and all other partitions.

Parameters:
  • top_k_union – The top-k attention pairs of a given partition.

  • grouping_vars – The variables to group by.

  • defining_vars – The variables to define the top-k attention pairs.

  • top_k – The number of top-k attention pairs considered (this should match the value used to create top_k_union)

  • max_rank – The maximum rank considered.

  • rank_col – The column name of the ranks.

  • test_method – Statistical test to use (default: “wilcoxon”). - “wilcoxon”: Wilcoxon signed-rank test (non-parametric) - “ttest”: One-sample t-test (parametric)

  • alternative – Alternative hypothesis for the test (default: “two-sided”). - “greater”: selected queries have higher quantiles than 0.5 - “less”: selected queries have lower quantiles than 0.5 - “two-sided”: selected queries differ from 0.5

Return type:

A DataFrame containing the rank agreement between the top-k attention pairs of a given partition and all other partitions.