Commit 48af5716 authored by Alessio Netti's avatar Alessio Netti
Browse files

Analytics: Readme for Regressor plugin

- Also fixed a few compilation warnings
parent 5549140f
......@@ -20,7 +20,8 @@
3. [Plugins](#plugins)
1. [Aggregator Plugin](#averagePlugin)
2. [Job Aggregator Plugin](#jobaveragePlugin)
3. [Writing Plugins](#writingPlugins)
3. [Regressor Plugin](#regressorPlugin)
4. [Writing Plugins](#writingPlugins)
# Introduction <a name="introduction"></a>
In this Readme we describe the DCDB Data Analytics framework, and all data abstractions that are associated with it.
......@@ -634,7 +635,38 @@ The _Job Aggregator_ plugin offers the same functionality as the _Aggregator_ pl
it performs aggregation of the specified input sensors across all nodes on which each job is running. Please refer
to the corresponding [section](#jobanalyzers) for more details.
## Writing DCDBAnalytics Plugins <a name="writingPlugins"></a>
## Regressor Plugin <a name="regressorPlugin"></a>
The _Regressor_ plugin is able to perform regression of sensors, by using _random forest_ machine learning predictors. The algorithmic backend is provided by the OpenCV library.
For each input sensor in a certain unit, statistical features from recent values of its time series are computed. These are the average, standard deviation, sum of differences, 25th quantile and 75th quantile. These statistical features are then combined together in a single _feature vector_.
In order to operate correctly, the models used by the regressor plugin need to be trained: this procedure is performed automatically online when using the _streaming_ mode, and can be triggered arbitrarily over the REST API. In _on demand_ mode, automatic training cannot be performed, and as such a pre-trained model must be loaded from a file.
The following are the configuration parameters available for the _Regressor_ plugin:
| Value | Explanation |
|:----- |:----------- |
| window | Length in milliseconds of the time window that is used to retrieve recent readings for the input sensors, starting from the latest one.
| trainingSamples | Number of samples necessary to perform training of the current model.
| targetDistance | Temporal distance (in terms of lags) of the sample that is to be predicted.
| inputPath | Path of a file from which a pre-trained random forest model must be loaded.
| outputPath | Path of a file to which the random forest model trained at runtime must be saved.
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; When the _duplicate_ option is enabled, the _outputPath_ field is ignored to avoid file collisions from multiple regressors.
Additionally, input sensors in analyzers of the Regressor plugin accept the following parameter:
| Value | Explanation |
|:----- |:----------- |
| target | Boolean value. If true, this sensor represents the target for regression. Every unit in analyzers of the regressor plugin must have excatly one target sensor.
Finally, the Regressor plugin supports the following additional REST API action:
| Action | Explanation |
|:----- |:----------- |
| train | Triggers a new training phase for the random forest model. Feature vectors are temporarily collected in-memory until _trainingSamples_ vectors are obtained. Until this moment, the old random forest model is still used to perform prediction.
## Writing DCDB Analytics Plugins <a name="writingPlugins"></a>
Generating a DCDBAnalytics plugin requires implementing a _Analyzer_ and _Configurator_ class which contain all logic
tied to the specific plugin. Such classes should be derived from _AnalyzerTemplate_ and _AnalyzerConfiguratorTemplate_
respectively, which contain all plugin-agnostic configuration and runtime features. Please refer to the documentation
......@@ -78,7 +78,7 @@ void RegressorAnalyzer::init(boost::asio::io_service& io) {
if(_modelIn!="") {
try {
_rForest = cv::ml::RTrees::load(_modelIn);
if(!_rForest->isTrained() || _units.empty() || _units[0]->getInputs().size()*REG_NUMFEATURES!=_rForest->getVarCount())
if(!_rForest->isTrained() || _units.empty() || _units[0]->getInputs().size()*REG_NUMFEATURES!=(uint64_t)_rForest->getVarCount())
LOG(error) << "Analyzer " + _name + ": incompatible model, falling back to default!";
else {
_trainingPending = false;
......@@ -110,7 +110,7 @@ void RegressorAnalyzer::compute(U_Ptr unit) {
_responseSet = new cv::Mat();
if (_trainingSet->size().height >= _trainingSamples + _targetDistance)
if ((uint64_t)_trainingSet->size().height >= _trainingSamples + _targetDistance)
if(_rForest.empty() || !(_rForest->isTrained() || (_trainingPending && _streaming)))
......@@ -118,8 +118,6 @@ void RegressorAnalyzer::compute(U_Ptr unit) {
if(_rForest->isTrained()) {
reading_t predict;
predict.value = (int64_t) _rForest->predict(*_currentfVector);
//LOG(error) << "FVector" << *_currentfVector;
//LOG(error) << "Prediction: " << _rForest->predict(*_currentfVector);
predict.timestamp = getTimestamp();
......@@ -128,15 +126,11 @@ void RegressorAnalyzer::compute(U_Ptr unit) {
void RegressorAnalyzer::trainRandomForest() {
if(!_trainingSet || _rForest.empty())
throw std::runtime_error("Analyzer " + _name + ": cannot perform training, missing model!");
if(_responseSet->size().height <= _targetDistance)
if((uint64_t)_responseSet->size().height <= _targetDistance)
throw std::runtime_error("Analyzer " + _name + ": cannot perform training, insufficient data!");
// Shifting the training and response sets so as to obtain the desired prediction distance
*_responseSet = _responseSet->rowRange(_targetDistance, _responseSet->size().height-1);
*_trainingSet = _trainingSet->rowRange(0, _trainingSet->size().height-1-_targetDistance);
//LOG(error) << "Training set:";
//LOG(error) << *_trainingSet;
//LOG(error) << "Response set:";
//LOG(error) << *_responseSet;
if(!_rForest->train(*_trainingSet, cv::ml::ROW_SAMPLE, *_responseSet))
throw std::runtime_error("Analyzer " + _name + ": model training failed!");
delete _trainingSet;
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment