Orion
The main component in the Orion project are the Orion Pipelines, which consist of MLBlocks Pipelines specialized in detecting anomalies in time series.
As MLPipeline instances, Orion Pipelines:
MLPipeline
consist of a list of one or more mlstars
can be fitted on some data and later on used to predict anomalies on more data
can be scored by comparing their predictions with some known anomalies
have hyperparameters that can be tuned to improve their anomaly detection performance
can be stored as a JSON file that includes all the primitives that compose them, as well as other required configuration options.
As previously mentioned, a pipeline is composed of a sequence of Primitives. In Orion, we store pipelines as annotated JSON files. Let’s view the structure of a pipeline JSON for anomaly detection in orion/pipelines. For example let’s consider the ARIMA pipeline. There are four main categories defined in the JSON:
{ "primitives": [ ... ], "init_params": { ... }, "input_names": { ... }, "output_names": { ... } }
primitives: this is where we list the necessary primitives for the pipeline, where we can notice a series of preprocessing, modeling, and postprocessing primitives.
init_params: this is where we initialize our pipeline parameters. If the parameters are not specified, then they will use default values. Notice how to change a value, we first specify the primitive as key then we change the parameter value.
input_names: this is where we map the input name of a parameter in a primitive to a variable within the Context.
output_names: this is where we assign the output of the primitive a variable name other than the default one within the primitive JSON.
Optionally there is an addition category outputs. In the case where you would like to view intermediatry outputs from the pipeline we can use this category to define it. Continuing on the previous example, ARIMA. To view the output of orion.primitives.timeseries_anomalies.regression_errors primitive, we include an additional visualization key with the output of interest:
orion.primitives.timeseries_anomalies.regression_errors
visualization
{ "outputs": { "default": [ { "name": "events", "variable": "orion.primitives.timeseries_anomalies.find_anomalies#1.y" } ], "visualization": [ { "name": "errors", "variable": "orion.primitives.timeseries_anomalies.regression_errors#1.errors" }, ] } }
Then orion.detect(.., visualization=True) will return two outputs: the first output being the detected anomalies, and the second output is a dictionary containing the specified outputs in visualization. You can read more about the specifications of orion API.
orion.detect(.., visualization=True)
To read more about the intricacies of composing a pipeline, please refer to MLBlocks Pipelines.
In the Orion Project, the pipelines are included as JSON files, which can be found in the subdirectories inside the orion/pipelines folder.
orion/pipelines
This is the list of pipelines available so far, which will grow over time:
name
description
ARIMA
ARIMA based pipeline
LSTM
LSTM based pipeline inspired by the NASA
Dummy
Dummy pipeline for testing
TadGAN
GAN based pipeline with reconstruction based errors
LSTM AE
Autoencoder based pipeline with LSTM layers
Dense AE
Autoencoder based pipeline with Dense layers
VAE
Variational autoencoder
AER
Autoencoder with Regression based pipeline
Azure
Azure API for Anomaly Detector
For each pipeline, there is a dedicated folder that stores: * the pipeline itself * the hyperparameter settings used for this pipeline to produce the results in the benchmark. To Learn more about it, we detail the process of benchmarking here.
We store a pipeline json within the pipeline subfolder. In addition, the hyperparameters would be called pipeline_dataset.json within the same folder. For example:
json
pipeline_dataset.json
├── tadgan/ ├── tadgan.json └── tadgan_dataset.json └── arima/ ├── arima.json └── arima_dataset.json
Note
the pipeline name must follow the subfolder name.
In Orion, we organize pipelines into verified and sandbox. The distinction between verified and sandbox is kept until several tests and verifications are made. We consider two cases when pipelines are inspected before transferring:
When a new pipeline is proposed.
When a new set of hyperparameters are suggested.
In both of these cases, the user is expected to open a new PR and pass tests before considering its merge and storage in sandbox. Next, we test the new pipeline/hyperparameters in the benchmark and verify that they perform as expected and indicated by the user. Once these checks have passed, we make the transfer.
To know more about our process in contributing and testing, read our Contributing guidelines and Benchmarking.
To use a pipeline on your own data, first you need to make sure that it follows the data format described in Data. Additionally, some hyperparameter changes in the pipeline might be necessary.
Since pipelines are composed of Primitives, you can discover the interpretation of each hyperparameter by visiting the primitive’s documentation. One of the most used primitives is time_segments_aggregate which makes your signal equi-spaced. You need to set the interval hyperparameter to the frequency of your data. For example, if your data has a frequency of 5 minutes then interval=300 would be most appropriate since we are dealing with second intervals. A hands on example is shown here:
time_segments_aggregate
interval
interval=300
In [1]: import numpy as np In [2]: import pandas as pd In [3]: from orion import Orion In [4]: np.random.seed(0) In [5]: custom_data = pd.DataFrame({"timestamp": np.arange(0, 150000, 300), ...: "value": np.random.randint(0, 10, 500)}) ...: In [6]: hyperparameters = { ...: "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": { ...: "interval": 300 ...: }, ...: 'keras.Sequential.LSTMTimeSeriesRegressor#1': { ...: 'epochs': 5, ...: 'verbose': True ...: } ...: } ...: In [7]: orion = Orion( ...: pipeline='lstm_dynamic_threshold', ...: hyperparameters=hyperparameters ...: ) ...: In [8]: orion.fit(custom_data) Epoch 1/5 1/4 [======>.......................] - ETA: 9s - loss: 0.4033 - mse: 0.4033 2/4 [==============>...............] - ETA: 0s - loss: 0.3734 - mse: 0.3734 3/4 [=====================>........] - ETA: 0s - loss: 0.3962 - mse: 0.3962 4/4 [==============================] - ETA: 0s - loss: 0.3945 - mse: 0.3945 4/4 [==============================] - 5s 519ms/step - loss: 0.3945 - mse: 0.3945 - val_loss: 0.4180 - val_mse: 0.4180 Epoch 2/5 1/4 [======>.......................] - ETA: 1s - loss: 0.4790 - mse: 0.4790 2/4 [==============>...............] - ETA: 0s - loss: 0.4204 - mse: 0.4204 3/4 [=====================>........] - ETA: 0s - loss: 0.3972 - mse: 0.3972 4/4 [==============================] - ETA: 0s - loss: 0.3939 - mse: 0.3939 4/4 [==============================] - 1s 345ms/step - loss: 0.3939 - mse: 0.3939 - val_loss: 0.4166 - val_mse: 0.4166 Epoch 3/5 1/4 [======>.......................] - ETA: 1s - loss: 0.4346 - mse: 0.4346 2/4 [==============>...............] - ETA: 0s - loss: 0.3812 - mse: 0.3812 3/4 [=====================>........] - ETA: 0s - loss: 0.3877 - mse: 0.3877 4/4 [==============================] - ETA: 0s - loss: 0.3921 - mse: 0.3921 4/4 [==============================] - 1s 343ms/step - loss: 0.3921 - mse: 0.3921 - val_loss: 0.4175 - val_mse: 0.4175 Epoch 4/5 1/4 [======>.......................] - ETA: 1s - loss: 0.3809 - mse: 0.3809 2/4 [==============>...............] - ETA: 0s - loss: 0.3774 - mse: 0.3774 3/4 [=====================>........] - ETA: 0s - loss: 0.3838 - mse: 0.3838 4/4 [==============================] - ETA: 0s - loss: 0.3933 - mse: 0.3933 4/4 [==============================] - 1s 344ms/step - loss: 0.3933 - mse: 0.3933 - val_loss: 0.4197 - val_mse: 0.4197 Epoch 5/5 1/4 [======>.......................] - ETA: 1s - loss: 0.3718 - mse: 0.3718 2/4 [==============>...............] - ETA: 0s - loss: 0.4017 - mse: 0.4017 3/4 [=====================>........] - ETA: 0s - loss: 0.3931 - mse: 0.3931 4/4 [==============================] - ETA: 0s - loss: 0.3946 - mse: 0.3946 4/4 [==============================] - 1s 342ms/step - loss: 0.3946 - mse: 0.3946 - val_loss: 0.4202 - val_mse: 0.4202
A Pipeline is a sequence of primitives that perform a number of operations to the given input to produce a desired output. In Orion, the pipeline takes a signal as input and produces a list of detected anomalies as output.
To create an Orion pipeline, we need to assemble a JSON representation of the pipeline. The representation contains four blocks mentioned above:
primitives
init_params
input_names
output_names
Let’s visit one of the existing Orion pipelines to help understand how it is working, as an example we will work with the lstm_dynamic_threshold pipeline.
lstm_dynamic_threshold
The first block in the pipeline JSON is a list containing the sequence of primitives applied to the signal in order to produce a list of detected anomalies.
"primitives": [ "mlstars.custom.timeseries_preprocessing.time_segments_aggregate", "sklearn.impute.SimpleImputer", "sklearn.preprocessing.MinMaxScaler", "mlstars.custom.timeseries_preprocessing.rolling_window_sequences", "keras.Sequential.LSTMTimeSeriesRegressor", "orion.primitives.timeseries_errors.regression_errors", "orion.primitives.timeseries_anomalies.find_anomalies" ]
The primitive time_segments_aggregate takes a raw signal to produce another signal where the time intervals between each sample and another is equal. Then the pipeline uses SimpleImputer to impute missing values in the signal with the mean value. Next, the pipeline scale the data into the range [-1, 1] using MinMaxScaler. After that, rolling_window_sequences create the training window sequences for the model. The processed signal is now ready to train the double-stacked LSTM network through LSTMTimeSeriesRegressor primitive. Once the network is trained, we generate the predicted signal and compute the discrepancies (error) using regression_errors, which is an absolute point-wise difference. Lastly, we use find_anomalies on error values to find anomalous regions.
SimpleImputer
[-1, 1]
MinMaxScaler
rolling_window_sequences
LSTMTimeSeriesRegressor
regression_errors
find_anomalies
Customizing pipelines and swapping one primitive with another can be done by changing and adding primitives in this block.
The second block in the pipeline JSON is a dictionary of primitive names and their associated initialized parameters.
"init_params": { "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": { "time_column": "timestamp", "interval": 21600, "method": "mean" }, "sklearn.preprocessing.MinMaxScaler#1": { "feature_range": [ -1, 1 ] }, "mlstars.custom.timeseries_preprocessing.rolling_window_sequences#1": { "target_column": 0, "window_size": 250 }, "keras.Sequential.LSTMTimeSeriesRegressor": { "epochs": 35 }, "orion.primitives.timeseries_anomalies.find_anomalies#1": { "window_size_portion": 0.33, "window_step_size_portion": 0.1, "fixed_threshold": true } }
For example, in time_segments_aggregate we are marking that the interval between one sample and another in the signal is 6 hours which is 21600. Another example is that the LSTMTimeSeriesRegressor model is trained for 35 epochs.
If the parameters are not present in this block, the default value in the original primitive JSON file will be used. For example if the interval was not specified, then it will use the default value (which is 3600) that is defined in the original primitive.
Block three is used for mapping name of argument to existing variables. Recall that primitives define input arguments and outputs. These variables names are how the pipeline knows which primitive uses which variables.
Sometimes we need to change the names of these input arguments to find the correct variable.
"input_names": { "orion.primitives.timeseries_anomalies.find_anomalies#1": { "index": "target_index" } }
Here the pipeline changes the input argument index to target_index which is basically denoting that the input index will now be the output target_index of a previous primitive which is the rolling_window_sequences primitive.
index
target_index
Similarly to input names, sometimes we need to change output names, which is what the pipeline specifies in block four.
"output_names": { "keras.Sequential.LSTMTimeSeriesRegressor#1": { "y": "y_hat" } }
Here we change the output of the primitive LSTMTimeSeriesRegressor from y to y_hat in order not to override the existing value of y and also to prepare it as input for the next primitive regression_errors.
y
y_hat
To add a new pipeline to Orion, follow the same steps mentioned in Contributing to Orion Benchmark page.