napistu_torch.utils.statistics

Utilities for calculating statistics.

Public Functions

calculate_rank_shift(df, n_genes, grouping_vars=”layer”, alternative=”greater”, test_method=”wilcoxon”): Calculate rank percentile shift for a given DataFrame.
compare_top_k_union_ranks(top_k_union, grouping_vars, defining_vars, top_k, max_rank, rank_col, test_method=”wilcoxon”, alternative=”two-sided”): Calculate the rank agreement between the top-k attention pairs of a given partition and all other partitions.

Functions

`calculate_rank_shift`(df, rank_col, max_rank)	Calculate rank shift statistics for a given DataFrame.
`compare_top_k_union_ranks`(top_k_union, ...)	Calculate the rank agreement between the top-k attention pairs of a given partition and all other partitions.

napistu_torch.utils.statistics._compute_rank_shift_for_group(group_df: DataFrame, rank_col: str, max_rank: int, alternative: str, test_method: str) → Series: Compute rank shift statistics for a single group.

napistu_torch.utils.statistics.calculate_rank_shift(df: DataFrame, rank_col: str, max_rank: int, grouping_vars: str | List[str] | None = None, alternative: str = 'two-sided', test_method: str = 'wilcoxon') → DataFrame

Calculate rank shift statistics for a given DataFrame.

For each group (e.g., layer), calculates the rank shift statistics.

Parameters:

df (pd.DataFrame) – DataFrame containing rank column and grouping variable.
rank_col (str) – Column name containing ranks.
max_rank (int) – Maximum possible rank.
grouping_var (str, optional) – Column name(s) to group by when calculating ranks. If None, ranks globally. If a single string, ranks within each value of that column. If a list of strings, ranks within each combination of those columns. Example: [‘model’] or [‘model’, ‘layer’] (default: None) Tests are performed separately for each unique value.
alternative (str, optional) – Alternative hypothesis for the test (default: “greater”). - “greater”: selected queries have higher quantiles than 0.5 - “less”: selected queries have lower quantiles than 0.5 - “two-sided”: selected queries differ from 0.5
test_method (str, optional) – Statistical test to use (default: “wilcoxon”). - “wilcoxon”: Wilcoxon signed-rank test (non-parametric) - “ttest”: One-sample t-test (parametric)

Returns:

Results with columns: - <grouping_var> : Grouping variable value - n_queries : Number of queries in this group - mean_quantile : Mean quantile (0.5 = null expectation) - median_quantile : Median quantile - statistic : Test statistic - p_value : P-value for the test

Return type:

pd.DataFrame

Examples

>>> # Test if top-k queries occupy high quantiles within each layer
>>> results = calculate_rank_shift(
...     df=top_k_union,
...     rank_col="rank",
...     max_rank=len(common_ids)**2,
...     grouping_var="layer"
... )
>>> # Significant layers have p_value < 0.05
>>> significant = results[results['p_value'] < 0.05]

>>> # Compare across models instead of layers
>>> results = calculate_rank_shift(
...     multi_model_edges,
...     n_genes=len(common_ids),
...     grouping_var="model"
... )

napistu_torch.utils.statistics.compare_top_k_union_ranks(top_k_union: DataFrame, grouping_vars: list, defining_vars: list, top_k: int, max_rank: int, rank_col: str, test_method: str = 'wilcoxon', alternative: str = 'two-sided') → DataFrame

Calculate the rank agreement between the top-k attention pairs of a given partition and all other partitions.

Parameters:

top_k_union – The top-k attention pairs of a given partition.
grouping_vars – The variables to group by.
defining_vars – The variables to define the top-k attention pairs.
top_k – The number of top-k attention pairs considered (this should match the value used to create top_k_union)
max_rank – The maximum rank considered.
rank_col – The column name of the ranks.
test_method – Statistical test to use (default: “wilcoxon”). - “wilcoxon”: Wilcoxon signed-rank test (non-parametric) - “ttest”: One-sample t-test (parametric)
alternative – Alternative hypothesis for the test (default: “two-sided”). - “greater”: selected queries have higher quantiles than 0.5 - “less”: selected queries have lower quantiles than 0.5 - “two-sided”: selected queries differ from 0.5

Return type:

A DataFrame containing the rank agreement between the top-k attention pairs of a given partition and all other partitions.