Commit 07fe0654 authored by Alessio Netti's avatar Alessio Netti
Browse files

README for Data Analytics Framework

parent e40a2002
...@@ -13,7 +13,6 @@ streaming true ...@@ -13,7 +13,6 @@ streaming true
average avg1 { average avg1 {
default def1 default def1
mqttPart FF0 mqttPart FF0
mqttStart 00
input { input {
...@@ -45,7 +44,6 @@ average avg2 { ...@@ -45,7 +44,6 @@ average avg2 {
default def1 default def1
interval 1500 interval 1500
mqttPart FF1 mqttPart FF1
mqttStart 00
input { input {
...@@ -77,7 +75,6 @@ average avg3 { ...@@ -77,7 +75,6 @@ average avg3 {
default def1 default def1
interval 1500 interval 1500
mqttPart FF2 mqttPart FF2
mqttStart 00
input { input {
# Table of contents
1. [Introduction](#introduction)
2. [DCDBAnalytics](#dcdbanalytics)
1. [Global Configuration](#globalConfiguration)
2. [Analyzers](#analyzers)
1. [The Sensor Tree](#sensorTree)
2. [The Unit System](#unitSystem)
3. [Operational Modes](#opModes)
4. [Analyzer Configuration](#analyzerConfiguration)
1. [Configuration Syntax](#configSyntax)
2. [Instantiating Units](#instantiatingUnits)
3. [MQTT Topics](#mqttTopics)
3. [Rest API](#restApi)
1. [Table of queries](#tableOfQueries)
2. [Ressources for GET request](#resourcesGET)
3. [Ressources for PUT request](#resourcesPUT)
4. [Examples](#restExamples)
3. [Plugins](#plugins)
1. [Average Plugin](#averagePlugin)
2. [Writing Plugins](#writingPlugins)
# Introduction <a name="introduction"></a>
In this Readme we describe the DCDB Data Analytics framework, and all data abstractions that are associated with it.
# DCDBAnalytics <a name="dcdbanalytics"></a>
The DCDBAnalytics framework is built on top of DCDB, and allows to perform data analytics based on sensor data
in a variety of ways. DCDBAnalytics can be deployed both in DCDBPusher and in DCDBCollectAgent, with some minor
* **DCDBPusher**: only sensor data that is sampled locally and that is contained within the sensor cache can be used for
data analytics. However, this is the preferable way to deploy simple models on a large-scale, as all computation is
performed within compute nodes, dramatically increasing scalability;
* **DCDBCollectAgent**: all available sensor data, in the local cache and in the Cassandra database, can be used for data
analytics. This approach is preferable for models that require data from multiple sources at once
(e.g., clustering-based anomaly detection), or for models that are deployed in [on-demand](#analyzerConfiguration) mode.
## Global Configuration <a name="globalConfiguration"></a>
DCDBAnalytics shares the same configuration structure as DCDBPusher and DCDBCollectAgent, using a global.conf configuration file.
All output sensors of the frameworks are therefore affected by configuration parameters described in the global Readme.
Additional parameters specific to this framework are the following:
| Value | Explanation |
|:----- |:----------- |
| global | Wrapper structure for the global values.
| hierarchy | Space-separated sequence of regular expressions used to infer the local (DCDBPusher) or global (DCDBCollectAgent) sensor hierarchy. This parameter should be wrapped in quotes to ensure proper parsing. See the Sensor Tree [section](#sensorTree) for more details.
| analyzerPlugins | Block containing the specification of all data analytics plugin to be instantiated.
| plugin _name_ | The plugin name is used to build the corresponding lib-name (e.g. average --> libdcdbanalyzer_average.1.0)
| path | Specify the path where the plugin (the shared library) is located. If left empty, DCDB will look in the default lib-directories (usr/lib and friends) for the plugin file.
| config | One can specify a separate config-file (including path to it) for the plugin to use. If not specified, DCDB will look up pluginName.conf (e.g. average.conf) in the same directory where global.conf is located.
| | |
## Analyzers <a name="analyzers"></a>
Analyzers are the basic building block in DCDBAnalytics. A Analyzer is instantiated within a plugin, performs a specific
task and acts on sets of inputs and outputs called _units_. Analyzers are functionally equivalent to _sensor groups_
in DCDBPusher, but instead of sampling data, they process such data and output new sensors. Some high-level examples
of analyzers are the following:
* An analyzer that performs time series regression on a particular input sensor, and outputs its prediction;
* An analyzer that aggregates a series of input sensors, builds feature vectors, and performs machine
learning-based tasks using a supervised model;
* An analyzer that performs clustering-based anomaly detection by using different sets of inputs associated to different
compute nodes;
* An analyzer that outputs statistical features related to the time series of a certain input sensor.
### The Sensor Tree <a name="sensorTree"></a>
Before diving into the configuration and instantiation of analyzers, we introduce the concept of _sensor tree_. A sensor
tree is simply a data structure expressing the hierarchy of sensors that are being sampled; internal nodes express
hierarchical entities (e.g. clusters, racks, nodes, cpus), whereas leaf nodes express actual sensors. In DCDBPusher,
a sensor tree refers only to the local hierarchy, while in DCDBCollectAgent it can capture the hierarchy of the entire
system being sampled.
A sensor tree is built at initialization time of DCDBAnalytics, and is implemented in the _SensorNavigator_ class.
Its construction is regulated by the _hierarchy_ global configuration parameter, which can be for example the following:
rack\d{2}. node\d{2}. cpu\d{2}.
Given the above example hierarchy string, we are enforcing the target sensor tree to have three levels, the topmost of
which expresses the racks to which sensors belong, and the lowest one the cpu core (if any). Such string could be used,
for example to build a sensor tree starting from the following set of sensor names:
Each sensor name is interpreted as a path within the sensor tree. Therefore, the _instr_ and _branch-misses_ sensors
will be placed as leaf nodes in the deepest level of the tree, as children of the respective cpu node they belong to.
Such cpu nodes will be in turn children of the nodes they belong to, and so on.
The generated sensor tree can then be used to navigate the sensor hierarchy, and perform actions such as _retrieving
all sensors belonging to a certain node, to a neighbor of a certain node, or to the rack a certain node belongs to_.
Please refer to the documentation of the _SensorNavigator_ class for more details.
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; If no hierarchy string has been specified in the configuration, the tree is built automatically by assuming that each
dot-separated part of the sensor name expresses a level in the hierarchy. The total depth of the tree is thus determined
at runtime as well.
> NOTE 2 &ensp;&ensp;&ensp; Sensor trees are always built from the names of sensors _as they are published_. Therefore,
please make sure to use the _-a_ option in DCDBPusher appropriately, to build sensor names that express the desired hierarchy.
### The Unit System <a name="unitSystem"></a>
Each analyzer operates on one or more _units_. A unit represents an abstract (or physical) entity in the current system that
is the target of analysis. A unit could be, for example, a rack, a node within a rack, a CPU within a node or an entire HPC system.
Units are identified by three components:
* **Name**: The name of this unit, that corresponds to the entity it represents. For example, _rack02.node03._ or _rack00.node05.cpu01._ could be unit names. A unit must always correspond to an existing internal node in the current sensor tree;
* **Input**: The set of sensors that constitute the input for analysis conducted on this unit. The sensors must share a hierarchical relationship with the unit: that is, they can either belong to the node represented by this unit, to its subtree, or to one of its ancestors;
* **Output**: The set of output sensors that are produced from any analysis conducted on this unit. The output sensors are always directly associated with the node represented by the unit.
Units are a way to define _patterns_ in the sensor tree and retrieve sensors that are associated to each other by a
hierarchical relationship. See the configuration [section](#instantiatingUnits) for more details on how to create
templates in order to define units suitable for analyzers.
### Operational Modes <a name="opModes"></a>
Analyzers can operate in two different modes:
* **Streaming**: streaming analyzers perform data analytics online and autonomously, processing incoming sensor data at regular intervals.
The units of streaming analyzers are completely resolved and instantiated at configuration time. The type of output of streaming
analyzers is identical to that of _sensors_ in DCDBPusher, which are pushed to DCDBCollectAgent and finally to the Cassandra database,
resulting in a time series representation;
* **On-demand**: on-demand analyzers do not perform data analytics autonomously, but only when queried by users. Unlike
for streaming analyzers, the units of on-demand analyzer are not instantiated at configuration, but only when a query is performed. When
such an event occurs, the analyzer verifies that the queried unit belongs to its _unit domain_, and then instantiates it,
resolving its inputs and outputs. Then, the unit is stored in a local cache for future re-use. The outputs of a on-demand
analyzer are exposed through the REST API, and are never pushed to the Cassandra database.
Use of streaming analyzers is advised when a time series-like output is required, whereas on-demand analyzers are effective
when data is required at specific times and for specific purposes, and when the unit domain's size makes the use of streaming
analyzers unfeasible.
### Analyzer Configuration <a name="analyzerConfiguration"></a>
Here we describe how to configure and instantiate analyzers in DCDBAnalytics. The configuration scheme is very similar
to that of _sensor groups_ in DCDBPusher, and a _global_ configuration block can be defined in each plugin configuration
file. The following is instead a list of configuration parameters that are available for the analyzers themselves:
| Value | Explanation |
|:----- |:----------- |
| default | Name of the template that must be used to configure this analyzer.
| interval | Specifies how often the analyzer will be invoked to perform computations, and thus the sampling interval of its output sensors. Only used for analyzers in _streaming_ mode.
| minValues | Minimum number of readings that need to be stored in output sensors before these are pushed as MQTT messages. Only used for analyzers in _streaming_ mode.
| mqttPart | Part of the MQTT topic associated to this analyzer. Only used when the Unit system is not employed (see this [section](#mqttTopics)).
| duplicate | If set to _false_, only one analyzer object will be instantiated. Such analyzer will perform computation over all units that are instantiated, at every interval, sequentially. If set to _true_, the analyzer object will be duplicated such that each copy will have one unit associated to it. This allows to exploit parallelism between units, but results in separate models to avoid race conditions.
| streaming | If set to _true_, the analyzer will operate in _streaming_ mode, pushing output sensors regularly. If set to _false_, the analyzer will instead operate in _on-demand_ mode.
| input | Block of input sensors that must be used to instantiate the units of this analyzer. These can both be a list of strings, or fully-qualified _Sensor_ blocks containing specific attributes (see DCDBPusher Readme).
| output | Block of output sensors that will be associated to this analyzer. These must be _Sensor_ blocks containing valid MQTT suffixes. Note that the number of output sensors is usually fixed depending on the type of analyzer.
| | |
#### Configuration Syntax <a name="configSyntax"></a>
In the following we show a sample configuration block for the _Average_ plugin. For the full version, please refer to the
default configuration file in the _config_ directory:
template_average def1 {
interval 1000
minValues 3
duplicate false
streaming true
average avg1 {
default def1
mqttPart FF0
input {
sensor col_user
sensor MemFree
output {
sensor sum {
mqttsuffix 76
sensor max {
mqttsuffix 77
sensor avg {
mqttsuffix 78
The configuration shown above uses a template _def1_ for some configuration parameters, which are then applied to the
_avg1_ analyzer. This analyzer takes the _col_user_ and _MemFree_ sensors as input (which must be available under this name),
and outputs _sum_, _max_, and _avg_ sensors. In this configuration, the Unit system and sensor hierarchy are not used,
and therefore only one generic unit (called the _root_ unit) will be instantiated.
#### Instantiating Units <a name="instantiatingUnits"></a>
Here we propose once again the configuration discussed above, this time making use of the Unit system to abstract from
the specific system being used and simplify configuration. The adjusted configuration block is the following:
template_average def1 {
interval 1000
minValues 3
duplicate false
streaming true
average avg1 {
default def1
mqttPart FF0
input {
sensor "<bottomup>col_user"
sensor "<bottomup 1>MemFree"
output {
sensor "<bottomup, filter cpu01>sum" {
mqttsuffix 76
sensor "<bottomup, filter cpu01>max" {
mqttsuffix 77
sensor "<bottomup, filter cpu01>avg" {
mqttsuffix 78
In each sensor declaration, the _< >_ block is a placeholder that will be replaced with the name of the units that will
be associated to the analyzer, thus resolving the sensor names. Such block allows to navigate the current sensor tree,
and select nodes that will constitute the units. Its syntax is the following:
< bottomup|topdown X, filter Y >SENSORNAME
The first section specified the _level_ in the sensor tree at which nodes must be selected. _bottomup X_ and _topdown X_
respectively mean _"search X levels up from the deepest level in the sensor tree"_, and _"search X levels down from the
topmost level in the sensor tree"_. The _X_ numerical value can be omitted as well.
The second section, on the other hand, allows to search the sensor tree _horizontally_. Within the level specified in the
first section of the configuration block, only the nodes whose names match with the regular expression Y will be selected.
This way, we can navigate the current sensor tree both vertically and horizontally, and easily instantiate units starting
from nodes in the tree. The set of nodes in the current sensor tree that match with the specified configuration block is
defined as the _unit domain_ of the analyzer.
The configuration algorithm then works in two steps:
1. The _output_ block of the analyzer is read, and its unit domain is determined; this implies that all sensors in the
output block must share the same _< >_ block, and therefore match the same unit domain;
2. For each unit in the domain, its input sensors are identified. We start from the _unit_ node in the sensor tree, and
navigate to the corresponding sensor node according to its _< >_ block, which identifies its level in the tree. Each
unit, once its inputs and outputs are defined, is then added to the analyzer.
According to the sensor tree built in the previous [section](#sensorTree), the configuration above would result in
an analyzer with the following set of units:
rack00.node05.cpu00. {
Inputs {
Outputs {
rack02.node03.cpu00. {
Inputs {
Outputs {
#### MQTT Topics <a name="mqttTopics"></a>
The MQTT topics associated to output sensors of a certain analyzer are constructed in different ways depending
on the unit they belong to:
* **Root unit**: if the output sensors belong to the _root_ unit, that is, they do not belong to any level in the sensor
hierarchy and are uniquely defined, the respective topics are constructed like in DCDBPusher sensors, by concatenating
the MQTT prefix, analyzer part and sensor suffix that are defined;
* **Any other unit**: if the output sensor belongs to any other unit in the sensor tree, its MQTT topic is constructed
by concatenating the MQTT prefix associated to the unit (which is defined as _the portion of the MQTT topic shared by all sensors
belonging to such unit_) and the sensor suffix. The middle part of the topic is padded accordingly to ensure a fixed length.
## Rest API <a name="restApi"></a>
DCDBAnalytics provides a REST API that can be used to perform various management operations on the framework. The
API is functionally identical to that of DCDBPusher, and is hosted at the same address. All requests that are targeted
at the data analytics framework must have a resource path starting with _/analytics_.
### Table of Queries <a name="tableOfQueries"></a>
The DCDBAnalytics REST API shares the same query types as DCDBPusher. In addition, the following queries specific to
this framework are defined:
| Key | Value | Explanation |
|:--- |:----- |:----------- |
| unit | A unit name | Name of the target unit for PUT _compute_ queries
### Resources for GET Request <a name="resourcesGET"></a>
The following is the list of available GET resources for the DCDBAnalytics REST API:
| Ressource | Explanation |
|:--------- |:----------- |
| /analytics/help | Responds with a small cheatsheet which presents all currently supported resources.
| /analytics/plugins | (Discovery) Returns a list of all currently loaded data analytics plugins.
| /analytics/[plugin]/analyzers | (Discovery) Returns a list of all active analyzers for the given plugin.
| /analytics/[plugin]/[analyzer]/units | (Discovery) Returns a list of all units instantiated in the given analyzer. The [analyzer] argument is optional, in which case all units instantiated in all analyzers will be returned.
| /analytics/[plugin]/[analyzer]/sensors | (Discovery) Returns a list of all output sensors associated to the given analyzer. The [analyzer] argument is optional, in which case all output sensors in all analyzers will be returned.
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; The value of analyzer output sensors can be retrieved with the _compute_ resource
described in the PUT section, or with the [plugin]/[sensor]/avg resource defined in the DCDBPusher REST API.
### Resources for PUT Request <a name="resourcesPUT"></a>
The following is the list of available PUT resources for the DCDBAnalytics REST API:
| Ressource | Explanation |
|:--------- |:----------- |
| /analytics/help | Responds with a small cheatsheet which presents all currently supported resources.
| /analytics/[plugin]/[analyzer]/start | Starts the (stopped) analyzer in the specified plugin. The [analyzer] argument is optional. If not specified, all analyzers in the given plugin will be started.
| /analytics/[plugin]/[analyzer]/start | Stops the (started) analyzer in the specified plugin. The [analyzer] argument is optional. If not specified, all analyzers in the given plugin will be stopped. This resource, together with _start_, can only be used on _streaming_ analyzers.
| /analytics/[plugin]/reload | Reload the configuration and initialization of the given plugin.
| /analytics/[plugin]/[analyzer]/compute | Queries the given analyzer for a certain input unit, and returns the result. This resource should be used with _on-demand_ analyzers, but works with _streaming_ analyzers as well. The target unit must be specified in a query, and if not present, the _root_ unit will be queried instead.
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; Developers can integrate their custom REST API resources that are plugin-specific,
by implementing the _REST_ method in _AnalyzerTemplate_. To know more about plugin-specific resources, please refer
to the respective documentation.
### Rest Examples <a name="restExamples"></a>
In the following are some examples of REST requests over HTTPS:
* Listing the units associated to the _avgAnalyzer1_ analyzer in the _average_ plugin:
GET https://localhost:8000/analytics/average/avgAnalyzer1/units?authkey=myToken
* Listing the output sensors associated to all analyzers in the _average_ plugin:
GET https://localhost:8000/analytics/average/sensors?authkey=myToken
* Reloading the _average_ plugin:
PUT https://localhost:8000/analytics/average/reload?authkey=myToken
* Stopping the _avgAnalyzer1_ analyzer in the _average_ plugin:
PUT https://localhost:8000/analytics/average/avgAnalyzer1/stop?authkey=myToken
* Performing a query for unit _node00.cpu03._ to the _avgAnalyzer1_ analyzer in the _average_ plugin:
PUT https://localhost:8000/analytics/average/avgAnalyzer1/compute?authkey=myToken;unit=node00.cpu03.
# Plugins <a name="plugins"></a>
Here we describe available plugins in DCDBAnalytics, and how to configure them.
## Average Plugin <a name="averagePlugin"></a>
The _Average_ plugin was developed for testing purposes and implements a simple data processing algorithm. Specifically,
this plugin computes and outputs the _sum_, _maximum_ and _average_ of its input sensors. Analyzers in the _Average_
plugin must have three outputs, which are mapped to the three metrics described above.
The configuration parameters specific to the _Average_ plugin are the following:
| Value | Explanation |
|:----- |:----------- |
| window | Length in milliseconds of the time window that is used to retrieve recent readings for the input sensors, starting from the latest one.
## Writing DCDBAnalytics Plugins <a name="writingPlugins"></a>
Generating a DCDBAnalytics plugin requires implementing a _Analyzer_ and _Configurator_ class which contain all logic
tied to the specific plugin. Such classes should be derived from _AnalyzerTemplate_ and _AnalyzerConfiguratorTemplate_
respectively, which contain all plugin-agnostic configuration and runtime features. Please refer to the documentation
of the _Average_ plugin for an overview of how a basic plugin can be implemented.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment