Commit e139e2ce authored by Alessio Netti's avatar Alessio Netti
Browse files

Analytics: improving model management in Regressor and Classifier plugins

- Specifying a target sensor is not required anymore when loading an
external model (e.g., Python OpenCV front-end)
- The structure of feature vectors now faithfully reflects the
specified configuration, its order, and is easily reproducible
parent 08880247
......@@ -845,7 +845,7 @@ Additionally, output sensors in operators of the _Smoothing_ plugin accept the f
## Regressor Plugin <a name="regressorPlugin"></a>
The _Regressor_ plugin is able to perform regression of sensors, by using _random forest_ machine learning predictors. The algorithmic backend is provided by the OpenCV library.
For each input sensor in a certain unit, statistical features from recent values of its time series are computed. These are the average, standard deviation, sum of differences, 25th quantile and 75th quantile. These statistical features are then combined together in a single _feature vector_.
For each input sensor in a certain unit, statistical features from recent values of its time series are computed. These are, in this specific order, the average, standard deviation, sum of differences, 25th quantile, 75th quantile and latest value. These statistical features are then combined together in a single _feature vector_.
In order to operate correctly, the models used by the regressor plugin need to be trained: this procedure is performed automatically online when using the _streaming_ mode, and can be triggered arbitrarily over the REST API. In _on demand_ mode, automatic training cannot be performed, and as such a pre-trained model must be loaded from a file.
The following are the configuration parameters available for the _Regressor_ plugin:
......@@ -886,6 +886,21 @@ The _Classifier_ plugin, as the name implies, performs machine learning classifi
Regressor plugin, the target sensor is always excluded from the feature vectors.
* The _targetDistance_ parameter has a default value of 0. It can be set to higher values to perform predictive classification.
### Managing Models with the Classifier and Regressor Plugins <a name="classifierPluginModels"></a>
The _Classifier_ and _Regressor_ Wintermute plugins support saving models generated via online training, as well as loading pre-trained models from files.
The models to be loaded to not need to necessarily come from the Wintermute plugins themselves: the OpenCV pipeline is fully supported, and users can load _RTrees_ models
that have been generated and trained offline using the Python front-end. However, users need to make sure that the loaded models are _compatible_ with the specification in the corresponding Wintermute configuration.
The following aspects should be considered:
* The trained model must have the same number of inputs as the input specification in the Wintermute configuration file. Hence, the two models must use the same number of sensors and the same number of features (see _rawMode_ parameter).
* Currently, only models with a single output are supported.
* Both the scale and numerical format of the inputs and outputs should be taken into consideration. DCDB and Wintermute only support integers, hence this should be taken into account when training the model.
* Wintermute respects the order of sensors specified in the configuration files to build feature vectors. These are divided in consecutive blocks, each containing all features for a given sensor (again, see _rawMode_ parameter). In the case of classifier models, the _target_ sensor is never included in the feature vectors and is always skipped.
* If using a custom data processing step with different features from those available in the Wintermute plugins, it is advisable to set the _rawMode_ parameter to true and perform the necessary data processing in a separate plugin, constructing a pipeline.
* When loading a model and thus online training is not required, there is no need to specify a _target_ sensor.
## Clustering Plugin <a name="clusteringPlugin"></a>
The _Clustering_ plugin implements a gaussian mixture model for performance variation analysis and outlier detection. The plugin is based on the OpenCV library, similarly to the _Regressor_ plugin.
......
......@@ -80,7 +80,10 @@ bool ClassifierConfigurator::unit(UnitTemplate<RegressorSensorBase>& u) {
}
}
if(!targetSet) {
LOG(error) << " " << _operatorName << ": No classification target was specified!";
LOG(warning) << " " << _operatorName << ": No classification target was specified. Online model training will be unavailable.";
}
if(u.getInputs().empty() || (targetSet && u.getInputs().size() < 2)) {
LOG(error) << " " << _operatorName << ": Insufficient amount of input sensors!";
return false;
}
if(u.getOutputs().size()!=1) {
......
......@@ -58,10 +58,12 @@ void ClassifierOperator::compute(U_Ptr unit) {
_trainingSet = new cv::Mat();
if (!_responseSet)
_responseSet = new cv::Mat();
_trainingSet->push_back(*_currentfVector);
// Using an int instead of a float for the responses makes OpenCV interpret the variable as categorical
_currentClass = (int)_currentTarget;
_responseSet->push_back(_currentClass);
if(_targetFound) {
_trainingSet->push_back(*_currentfVector);
// Using an int instead of a float for the responses makes OpenCV interpret the variable as categorical
_currentClass = (int) _currentTarget;
_responseSet->push_back(_currentClass);
}
if ((uint64_t)_trainingSet->size().height >= _trainingSamples + _targetDistance)
trainRandomForest(true);
}
......
......@@ -82,7 +82,10 @@ bool RegressorConfigurator::unit(UnitTemplate<RegressorSensorBase>& u) {
}
}
if(!targetSet) {
LOG(error) << " " << _operatorName << ": No regression target was specified!";
LOG(warning) << " " << _operatorName << ": No regression target was specified. Online model training will be unavailable.";
}
if(u.getInputs().empty()) {
LOG(error) << " " << _operatorName << ": Insufficient amount of input sensors!";
return false;
}
if(u.getOutputs().size()!=1) {
......
......@@ -34,6 +34,7 @@ RegressorOperator::RegressorOperator(const std::string& name) : OperatorTemplate
_targetDistance = 1;
_smoothResponses = false;
_numFeatures = REG_NUMFEATURES;
_numInputs = 0;
_trainingSamples = 256;
_trainingPending = true;
_importances = false;
......@@ -50,6 +51,7 @@ RegressorOperator::RegressorOperator(const RegressorOperator& other) : OperatorT
_targetDistance = other._targetDistance;
_smoothResponses = other._smoothResponses;
_numFeatures = other._numFeatures;
_numInputs = 0;
_trainingSamples = other._trainingSamples;
_importances = other._importances;
_includeTarget = true;
......@@ -85,11 +87,17 @@ restResponse_t RegressorOperator::REST(const string& action, const unordered_map
}
void RegressorOperator::execOnInit() {
// Is a target set at all?
bool targetSet = false;
for(const auto& s: _units[0]->getInputs()) {
targetSet = targetSet || s->getTrainingTarget();
}
_numInputs = (!targetSet || _includeTarget) ? _units[0]->getInputs().size() : (_units[0]->getInputs().size() - 1);
bool useDefault=true;
if(_modelIn!="") {
try {
_rForest = cv::ml::RTrees::load(_modelIn);
if(!_rForest->isTrained() || _units.empty() || _units[0]->getInputs().size()*_numFeatures!=(uint64_t)_rForest->getVarCount())
if(!_rForest->isTrained() || _units.empty() || _numInputs*_numFeatures!=(uint64_t)_rForest->getVarCount())
LOG(error) << "Operator " + _name + ": incompatible model, falling back to default!";
else {
_trainingPending = false;
......@@ -125,8 +133,10 @@ void RegressorOperator::compute(U_Ptr unit) {
_trainingSet = new cv::Mat();
if (!_responseSet)
_responseSet = new cv::Mat();
_trainingSet->push_back(*_currentfVector);
_responseSet->push_back(_currentTarget);
if(_targetFound) {
_trainingSet->push_back(*_currentfVector);
_responseSet->push_back(_currentTarget);
}
if ((uint64_t)_trainingSet->size().height >= _trainingSamples + _targetDistance)
trainRandomForest(false);
}
......@@ -215,9 +225,10 @@ void RegressorOperator::shuffleTrainingSet() {
bool RegressorOperator::computeFeatureVector(U_Ptr unit) {
if(!_currentfVector)
_currentfVector = new cv::Mat(1, unit->getInputs().size()*_numFeatures, CV_32F);
_currentfVector = new cv::Mat(1, _numInputs*_numFeatures, CV_32F);
_targetFound = false;
int64_t val;
size_t qId, qMod, idx, fIdx;
size_t qId, qMod, idx=0, fIdx=0;
uint64_t endTs = getTimestamp();
uint64_t startTs = endTs - _aggregationWindow;
std::vector<RegressorSBPtr>& inputs = unit->getInputs();
......@@ -229,8 +240,10 @@ bool RegressorOperator::computeFeatureVector(U_Ptr unit) {
return false;
}
_latest = _buffer.back().value;
if (inputs[idx]->getTrainingTarget())
_currentTarget = (float)_latest;
if (inputs[idx]->getTrainingTarget()) {
_currentTarget = (float) _latest;
_targetFound = true;
}
if(!inputs[idx]->getTrainingTarget() || _includeTarget) {
// Computing MEAN and SUM OF DIFFERENCES
......@@ -240,7 +253,9 @@ bool RegressorOperator::computeFeatureVector(U_Ptr unit) {
_diffsum += v.value - val;
val = v.value;
}
// Casting and storing the statistical features
_mean /= _buffer.size();
_currentfVector->at<float>(fIdx++) = (float) _mean;
// Computing additional features only if we are not in "raw" mode
if(_numFeatures == REG_NUMFEATURES) {
......@@ -262,27 +277,12 @@ bool RegressorOperator::computeFeatureVector(U_Ptr unit) {
qId = ((_buffer.size() - 1) * 75) / 100;
qMod = ((_buffer.size() - 1) * 75) % 100;
_qtl75 = (qMod == 0 || qId == _buffer.size() - 1) ? _buffer[qId].value : (_buffer[qId].value + _buffer[qId + 1].value) / 2;
}
fIdx = idx * _numFeatures;
// Casting and storing the statistical features
_currentfVector->at<float>(fIdx) = (float) _mean;
if(_numFeatures == REG_NUMFEATURES) {
_currentfVector->at<float>(fIdx + 1) = (float) _std;
_currentfVector->at<float>(fIdx + 2) = (float) _diffsum;
_currentfVector->at<float>(fIdx + 3) = (float) _qtl25;
_currentfVector->at<float>(fIdx + 4) = (float) _qtl75;
_currentfVector->at<float>(fIdx + 5) = (float) _latest;
}
} else {
fIdx = idx * _numFeatures;
_currentfVector->at<float>(fIdx) = 0.0f;
if(_numFeatures == REG_NUMFEATURES) {
_currentfVector->at<float>(fIdx + 1) = 0.0f;
_currentfVector->at<float>(fIdx + 2) = 0.0f;
_currentfVector->at<float>(fIdx + 3) = 0.0f;
_currentfVector->at<float>(fIdx + 4) = 0.0f;
_currentfVector->at<float>(fIdx + 5) = 0.0f;
_currentfVector->at<float>(fIdx++) = (float) _std;
_currentfVector->at<float>(fIdx++) = (float) _diffsum;
_currentfVector->at<float>(fIdx++) = (float) _qtl25;
_currentfVector->at<float>(fIdx++) = (float) _qtl75;
_currentfVector->at<float>(fIdx++) = (float) _latest;
}
}
}
......@@ -299,12 +299,24 @@ std::string RegressorOperator::getImportances() {
std::vector<ImportancePair> impLabels;
cv::Mat_<float> impValues;
_rForest->getVarImportance().convertTo(impValues, CV_32F);
if(impValues.empty() || _units.empty() || impValues.total()!=_numFeatures*_units[0]->getInputs().size())
if(impValues.empty() || _units.empty() || impValues.total()!=_numFeatures*_numInputs)
return "Operator " + _name + ": error when computing feature importances.";
size_t sidx=0;
for(size_t idx=0; idx<impValues.total(); idx++) {
// Iterating over the vector of sensors in _numFeatures blocks
// For classifier models, the target sensor is not included in the feature vectors: in these cases,
// we simply skip that sensor in the original array
if(idx > 0 && idx%_numFeatures == 0) {
sidx += (_units[0]->getInputs().at(sidx+1)->getTrainingTarget() && !_includeTarget) ? 2 : 1;
}
if(sidx >= _units[0]->getInputs().size()) {
return "Operator " + _name + ": Mismatch between the model and input sizes.";
}
ImportancePair pair;
pair.name = _units[0]->getInputs().at(idx/_numFeatures)->getName();
pair.name = _units[0]->getInputs().at(sidx)->getName();
pair.value = impValues.at<float>(idx);
switch(idx%_numFeatures) {
case 0:
......
......@@ -92,6 +92,7 @@ protected:
unsigned long long _trainingSamples;
unsigned long long _targetDistance;
unsigned long long _numFeatures;
unsigned long long _numInputs;
bool _smoothResponses;
bool _trainingPending;
bool _importances;
......@@ -103,6 +104,7 @@ protected:
cv::Mat *_responseSet;
cv::Mat *_currentfVector;
float _currentTarget;
bool _targetFound;
// Misc buffers
int64_t _mean;
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment