CompStats.performance

performance(data: ~pandas.core.frame.DataFrame, gold: str = 'y', score: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function accuracy_score>, num_samples: int = 500, n_jobs: int = -1, BiB: bool = True, statistic_samples: ~CompStats.bootstrap.StatisticSamples = None) StatisticSamples[source]

Calculate bootstrap samples of a performance score for a given dataset.

Parameters: data (pd.DataFrame): Input dataset. gold (str, optional): Column name of the ground truth or target variable. Defaults to ‘y’. score (Callable, optional): Performance score function. Defaults to accuracy_score. num_samples (int, optional): Number of bootstrap samples. Defaults to 500. n_jobs (int, optional): Number of jobs to run in parallel. Defaults to -1. BiB (bool, optional): Whether the metric is Bigger is Better. Defaults to True. statistic_samples (StatisticSamples, optional): Pre-initialized StatisticSamples object. Defaults to None.

Returns: StatisticSamples: Object containing the bootstrap samples of the performance score.

Example usage:

>>> from sklearn.metrics import accuracy_score
>>> import pandas as pd
>>> from CompStats import performance
>>> df = pd.read_csv('path/to/data.csv')
>>> perf = performance(df, gold='y', score=accuracy_score, num_samples=1000)
difference(statistic_samples: StatisticSamples)[source]

Computes the difference in performance between the best performing algorithm and others using bootstrap samples.

Parameters: statistic_samples (StatisticSamples): An instance of StatisticSamples containing the performance data.

Returns: StatisticSamples: A new instance of StatisticSamples with the computed differences and information about the best algorithm.

The function works as follows: 1. Determines the index of the best performing algorithm based on the BiB attribute. 2. Extracts and calculates the mean performance for each algorithm. 3. Sorts the algorithms by their mean performance. 4. Identifies the best performing algorithm. 5. Computes the difference in performance between the best algorithm and each other algorithm. 6. Returns a new StatisticSamples instance with the computed differences and the name of the best performing algorithm.

Example usage:

>>> from CompStats import performance, difference
>>> from CompStats.tests.test_performance import DATA
>>> from sklearn.metrics import f1_score
>>> import pandas as pd
>>> df = pd.read_csv(DATA)
>>> score = lambda y, hy: f1_score(y, hy, average='weighted')
>>> perf = performance(df, score=score)
>>> diff = difference(perf)
all_differences(statistic_samples: StatisticSamples)[source]

Calculates all possible differences in performance among algorithms and sorts them by average performance.

Parameters: statistic_samples (StatisticSamples): An instance of StatisticSamples containing the performance data.

Returns: StatisticSamples: A new instance of StatisticSamples with the computed performance differences among all algorithms.

The function works as follows: 1. Extracts the performance data for each algorithm. 2. Calculates the mean performance for each algorithm and sorts the algorithms based on their mean performance. 3. Iterates over all possible pairs of algorithms. 4. Computes the difference in performance for each pair and stores it in a dictionary. 5. Returns a new StatisticSamples instance with the computed differences.

Example usage:

>>> from CompStats import performance, all_differences
>>> from CompStats.tests.test_performance import DATA
>>> from sklearn.metrics import f1_score
>>> import pandas as pd
>>> df = pd.read_csv(DATA)
>>> score = lambda y, hy: f1_score(y, hy, average='weighted')
>>> perf = performance(df, score=score)
>>> all_diff = all_differences(perf)
plot_performance(statistic_samples: StatisticSamples, CI: float = 0.05, var_name='Algorithm', value_name='Score', capsize=0.2, linestyle='none', kind='point', sharex=False, **kwargs)[source]

Plots the performance of algorithms with confidence intervals.

Parameters:
  • statistic_samples (StatisticSamples or pd.DataFrame) – An instance of StatisticSamples containing the performance data, or a DataFrame in long format.

  • CI (float) – Confidence interval level (default is 0.05).

  • var_name (str) – Variable name for algorithms (default is ‘Algorithm’).

  • value_name (str) – Variable name for scores (default is ‘Score’).

  • capsize (float) – Size of the caps on error bars (default is 0.2).

  • linestyle (str) – Line style for the plot (default is ‘none’).

  • kind (str) – Type of plot (default is ‘point’).

  • sharex (bool) – Whether to share the x-axis among subplots (default is False).

  • kwargs – Additional keyword arguments passed to seaborn’s catplot function.

Returns:

A seaborn FacetGrid object containing the plot.

Return type:

sns.axisgrid.FacetGrid

The function works as follows: 1. If statistic_samples is an instance of StatisticSamples, it extracts and sorts the performance data. 2. Converts the data into a long format DataFrame. 3. Computes the confidence intervals if CI is provided as a float. 4. Plots the performance data with confidence intervals using seaborn’s catplot.

>>> from CompStats import performance, plot_performance
>>> from CompStats.tests.test_performance import DATA
>>> from sklearn.metrics import f1_score
>>> import pandas as pd
>>> df = pd.read_csv(DATA)
>>> score = lambda y, hy: f1_score(y, hy, average='weighted')
>>> perf = performance(df, score=score)
>>> ins = plot_performance(perf)
plot_difference(statistic_samples: StatisticSamples, CI: float = 0.05, var_name='Comparison', value_name='Difference', set_refline=True, set_title=True, hue='Significant', palette=None, **kwargs)[source]

Plot the difference in performance with its confidence intervals.

Parameters: statistic_samples (StatisticSamples): An instance of StatisticSamples containing the performance data. CI (float, optional): Confidence interval level. Defaults to 0.05. var_name (str, optional): Variable name for the comparisons. Defaults to ‘Comparison’. value_name (str, optional): Variable name for the differences. Defaults to ‘Difference’. set_refline (bool, optional): Whether to set a reference line at x=0. Defaults to True. set_title (bool, optional): Whether to set the title of the plot with the best performing algorithm. Defaults to True. hue (str or None, optional): Column name for hue encoding. Defaults to ‘Significant’. palette (list or None, optional): Colors to use for different hue levels. Defaults to None. **kwargs: Additional keyword arguments passed to the plot_performance function.

Returns: sns.axisgrid.FacetGrid: A seaborn FacetGrid object containing the plot.

The function works as follows: 1. Converts the differences stored in statistic_samples into a long format DataFrame. 2. Adds a ‘Significant’ column to indicate whether the confidence interval includes zero. 3. Plots the differences with confidence intervals using the plot_performance function. 4. Optionally sets a reference line at x=0 and a title indicating the best performing algorithm.

>>> from CompStats import performance, difference, plot_difference
>>> from CompStats.tests.test_performance import DATA
>>> from sklearn.metrics import f1_score
>>> import pandas as pd
>>> df = pd.read_csv(DATA)
>>> score = lambda y, hy: f1_score(y, hy, average='weighted')
>>> perf = performance(df, score=score)
>>> diff = difference(perf)
>>> ins = plot_difference(diff)
performance_multiple_metrics(data: DataFrame, gold: str, scores: List[dict], num_samples: int = 500, n_jobs: int = -1)[source]

Calculate bootstrap samples of multiple performance metrics for a given dataset.

Parameters: data (pd.DataFrame): Input dataset. gold (str): Column name of the ground truth or target variable. scores (List[dict]): A list of dictionaries, each containing:

  • “func”: The performance score function.

  • “args” (optional): Arguments to pass to the score function.

  • “BiB”: Whether the metric is Bigger is Better.

num_samples (int, optional): Number of bootstrap samples. Defaults to 500. n_jobs (int, optional): Number of jobs to run in parallel. Defaults to -1.

Returns: dict: A dictionary containing the results for each metric, including:

  • ‘samples’: Bootstrap samples of the performance scores.

  • ‘performance’: Calculated performance scores for each algorithm.

  • ‘compg’: General performance comparison metrics, including:
    • ‘n’: Number of samples.

    • ‘m’: Number of algorithms.

    • ‘cv’: Coefficient of variation for each metric.

    • ‘dist’: Distance metric for each metric.

    • ‘PPI’: Performance potential index for each metric.

  • ‘BiB’: Whether each metric is Bigger is Better.

The function works as follows: 1. Defines auxiliary functions for calculating additional performance metrics. 2. Iterates over the list of score functions and their respective arguments. 3. Initializes a StatisticSamples object for each score function. 4. Calculates the performance scores for each column in the dataset (excluding the ground truth column). 5. Computes additional performance metrics (CV, distance, PPI) for each score function. 6. Compiles the results into a dictionary and returns it.

Example usage:

>>> from sklearn.metrics import accuracy_score, f1_score
>>> import pandas as pd
>>> from CompStats import performance_multiple_metrics
>>> df = pd.read_csv('path/to/data.csv')
>>> scores = [
>>>     {"func": accuracy_score, "BiB": True},
>>>     {"func": f1_score, "args": {"average": "weighted"}, "BiB": True}
>>> ]
>>> results = performance_multiple_metrics(df, gold='target', scores=scores, num_samples=1000)
plot_performance2(results: dict, CI: float = 0.05, var_name='Algorithm', value_name='Score', capsize=0.2, linestyle='none', kind='point', sharex=False, **kwargs)[source]

Plot the performance with confidence intervals. This function is used by plot_difference_multiple

Parameters: results (dict): A dictionary where keys are algorithm names and values are lists of performance scores. CI (float, optional): Confidence interval level for error bars. Defaults to 0.05. var_name (str, optional): Variable name for the algorithms. Defaults to ‘Algorithm’. value_name (str, optional): Variable name for the scores. Defaults to ‘Score’. capsize (float, optional): Cap size for error bars. Defaults to 0.2. linestyle (str, optional): Line style for the plot. Defaults to ‘none’. kind (str, optional): Type of the plot, e.g., ‘point’, ‘bar’. Defaults to ‘point’. sharex (bool, optional): Whether to share the x-axis among subplots. Defaults to False. **kwargs: Additional keyword arguments for seaborn.catplot.

Returns: sns.axisgrid.FacetGrid: A seaborn FacetGrid object containing the plot.

The function works as follows: 1. If results is a dictionary, it sorts the algorithms by their mean performance scores. 2. Converts the sorted data into a long format DataFrame. 3. Computes the confidence intervals if CI is provided as a float. 4. Uses seaborn’s catplot to create and display the performance plot with confidence intervals.

difference_multiple(results_dict, CI: float = 0.05)[source]

Calculate performance differences for multiple metrics, excluding the comparison of the best with itself. Additionally, identify the best performing algorithm for each metric.

Parameters: results_dict (dict): A dictionary where keys are metric names and values are dictionaries.

Each sub-dictionary has algorithm names as keys and lists of performance scores as values.

CI (float, optional): Confidence interval level. Defaults to 0.05.

Returns: dict: A dictionary with the same structure, but where the scores for each algorithm are replaced

by their differences to the scores of the best performing algorithm for that metric, excluding the best performing algorithm comparing with itself. Also includes the best algorithm name for each metric.

The function works as follows: 1. Iterates over each metric in the results dictionary. 2. Converts performance scores to numpy arrays for efficient computations. 3. Identifies the best performing algorithm for each metric based on the mean performance scores. 4. Calculates the differences in performance scores relative to the best performing algorithm. 5. Computes confidence intervals and p-values for these differences. 6. Stores the differences, confidence intervals, p-values, and the best algorithm for each metric. 7. Returns a dictionary with these calculated differences and additional information.

Example usage:

>>> from CompStats import performance, difference_multiple
>>> from CompStats.tests.test_performance import DATA
>>> from sklearn.metrics import f1_score
>>> import pandas as pd
>>> df = pd.read_csv(DATA)
>>> score = lambda y, hy: f1_score(y, hy, average='weighted')
>>> perf = performance(df, score=score)
>>> diff_mult = difference_multiple(perf, CI=0.05)
plot_difference2(diff_dictionary: dict, CI: float = 0.05, var_name='Comparison', value_name='Difference', set_refline=True, set_title=True, hue='Significant', palette=None, BiB: bool = True, **kwargs)[source]

Plot the difference in performance with its confidence intervals

>>> from CompStats import performance, difference, plot_difference
>>> from CompStats.tests.test_performance import DATA
>>> from sklearn.metrics import f1_score
>>> import pandas as pd
>>> df = pd.read_csv(DATA)
>>> score = lambda y, hy: f1_score(y, hy, average='weighted')
>>> perf = performance(df, score=score)
>>> diff = difference(perf)
>>> ins = plot_difference(diff)
plot_performance_multiple(results_dict: dict, CI: float = 0.05, capsize: float = 0.2, linestyle: str = 'none', kind: str = 'point', **kwargs)[source]

Create multiple performance plots, one for each performance metric in the results dictionary.

Parameters: results_dict (dict): A dictionary where keys are metric names and values are dictionaries

with algorithm names as keys and lists of performance scores as values.

CI (float, optional): Confidence interval level for error bars. Defaults to 0.05. capsize (float, optional): Cap size for error bars. Defaults to 0.2. linestyle (str, optional): Line style for the plot. Defaults to ‘none’. kind (str, optional): Type of the plot, e.g., ‘point’, ‘bar’. Defaults to ‘point’. **kwargs: Additional keyword arguments for seaborn.catplot.

Returns: None: The function creates and displays plots.

The function works as follows: 1. Iterates over each metric in the results dictionary. 2. Uses the plot_performance2 function to create and display the plot for each metric. 3. Sets the title of each plot to the metric name and the best performing algorithm.

Example usage:

>>> from CompStats import plot_performance_multiple
>>> results = {
>>>     'accuracy': {
>>>         'alg1': [0.1, 0.2, 0.15], 
>>>         'alg2': [0.05, 0.1, 0.07]
>>>     },
>>>     'f1_score': {
>>>         'alg1': [0.3, 0.25, 0.2], 
>>>         'alg2': [0.2, 0.15, 0.1]
>>>     }
>>> }
>>> plot_performance_multiple(results, CI=0.05)
plot_difference_multiple(results_dict, CI=0.05, capsize=0.2, linestyle='none', kind='point', **kwargs)[source]

Create multiple performance plots, one for each performance metric in the results dictionary.

Parameters:
  • results_dict – A dictionary where keys are metric names and values are dictionaries with algorithm names as keys and lists of scores as values.

  • CI – Confidence interval level for error bars.

  • capsize – Cap size for error bars.

  • linestyle – Line style for the plot.

  • kind – Type of the plot, e.g., ‘point’, ‘bar’.

  • kwargs – Additional keyword arguments for seaborn.catplot.

plot_scatter_matrix(perf)[source]

Generate a scatter plot matrix comparing the performance of the same algorithm across different metrics contained in the ‘perf’ dictionary.

Parameters:

perf – A dictionary where keys are metric names and values are dictionaries with algorithm names as keys and lists of performance scores as values.

all_differences_multiple(results_dict, alpha: float = 0.05)[source]

Calculate performance differences for unique pairs of algorithms for multiple metrics. Also, calculates the confidence interval for the differences.

Parameters:

results_dict – A dictionary where keys are metric names and values are dictionaries. Each sub-dictionary has algorithm names as keys and lists of performance scores as values.

Returns:

A dictionary where each metric name maps to another dictionary. This dictionary contains keys for unique pairs of algorithms and their performance differences, including the confidence interval for these differences.