orion.benchmark.benchmark

orion.benchmark.benchmark(pipelines=None, datasets=None, hyperparameters=None, metrics={'accuracy': <function contextual_accuracy>, 'f1': <function contextual_f1_score>, 'precision': <function contextual_precision>, 'recall': <function contextual_recall>}, rank='f1', test_split=False, detrend=False, iterations=1, workers=1, show_progress=False, cache_dir=None, resume=False, output_path=None, pipeline_dir=None, anomaly_dir=None)[source]

Run pipelines on the given datasets and evaluate the performance.

The pipelines are used to analyze the given signals and later on the detected anomalies are scored against the known anomalies using the indicated metrics.

Finally, the scores obtained with each metric are averaged accross all the signals, ranked by the indicated metric and returned on a pandas.DataFrame.

Parameters
  • pipelines (dict or list) – dictionary with pipeline names as keys and their JSON paths as values. If a list is given, it should be of JSON paths, and the paths themselves will be used as names. If not give, all verified pipelines will be used for evaluation.

  • datasets (dict or list) – dictionary of dataset name as keys and list of signals as values. If a list is given then it will be under a generic name dataset. If not given, all benchmark datasets will be used used.

  • hyperparameters (dict or list) – dictionary with pipeline names as keys and their hyperparameter JSON paths or dictionaries as values. If a list is given, it should be of corresponding order to pipelines.

  • metrics (dict or list) – dictionary with metric names as keys and scoring functions as values. If a list is given, it should be of scoring functions, and they __name__ value will be used as the metric name. If not given, all the available metrics will be used.

  • rank (str) – Sort and rank the pipelines based on the given metric. If not given, rank using the first metric.

  • test_split (bool or float) – Whether to use the prespecified train-test split. If float, then it should be between 0.0 and 1.0 and represent the proportion of the signal to include in the test split. If not given, use False.

  • detrend (bool) – Whether to use scipy.detrend. If not given, use False.

  • iterations (int) – Number of iterations to perform over each signal and pipeline. Defaults to 1.

  • workers (int or str) – If workers is given as an integer value other than 0 or 1, a multiprocessing Pool is used to distribute the computation across the indicated number of workers. If the string dask is given, the computation is distributed using dask. In this case, setting up the dask cluster and client is expected to be handled outside of this function.

  • show_progress (bool) – Whether to use tqdm to keep track of the progress. Defaults to True.

  • cache_dir (str) – If a cache_dir is given, intermediate results are stored in the indicated directory as CSV files as they get computted. This allows inspecting results while the benchmark is still running and also recovering results in case the process does not finish properly. Defaults to None.

  • resume (bool) – Whether to continue running the experiments in the benchmark from the current progress in cache_dir.

  • output_path (str) – Location to save the intermediatry results. If not given, intermediatry results will not be saved.

  • pipeline_dir (str) – If a pipeline_dir is given, pipelines will get dumped in the specificed directory as pickle files. Defaults to None.

  • anomaly_dir (str) – If a anomaly_dir is given, detected anomalies will get dumped in the specificed directory as csv files. Defaults to None.

Returns

A table containing the scores obtained with each scoring function accross all the signals for each pipeline.

Return type

pandas.DataFrame