Commit 9c2f4406 authored by Alessio Netti's avatar Alessio Netti
Browse files

Analytics: regressor plugin can now compute feature importances

- If getImportances is set to true, the regressor will print a sorted
list of features and their importances, after performing training
- A REST action to retrieve the same list has been added
parent 96ae7909
......@@ -581,15 +581,13 @@ Prefix `/analytics` left out!
> NOTE       Opt. = Optional
> NOTE       The value of analyzer output sensors can be retrieved with the _compute_ resource, or with the [plugin]/[sensor]/avg resource defined in the DCDBPusher REST API.
> NOTE 2       The value of analyzer output sensors can be retrieved with the _compute_ resource, or with the [plugin]/[sensor]/avg resource defined in the DCDBPusher REST API.
> NOTE       Developers can integrate their custom REST API resources that are plugin-specific, by implementing the _REST_ method in _AnalyzerTemplate_. To know more about plugin-specific resources, please refer to the respective documentation.
> NOTE 3       Developers can integrate their custom REST API resources that are plugin-specific, by implementing the _REST_ method in _AnalyzerTemplate_. To know more about plugin-specific resources, please refer to the respective documentation.
### Rest Examples <a name="restExamples"></a>
In the following are some examples of REST requests over HTTPS:
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; The analytics RestAPI requires authentication credentials as well.
* Listing the units associated to the _avgAnalyzer1_ analyzer in the _average_ plugin:
```bash
GET https://localhost:8000/analytics/units?plugin=average;analyzer=avgAnalyzer1
......@@ -611,6 +609,8 @@ PUT https://localhost:8000/analytics/stop?plugin=average;analyzer=avgAnalyzer1
PUT https://localhost:8000/analytics/compute?plugin=average;analyzer=avgAnalyzer;unit=node00.cpu03
```
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; The analytics RestAPI requires authentication credentials as well.
# Plugins <a name="plugins"></a>
Here we describe available plugins in DCDBAnalytics, and how to configure them.
......@@ -651,9 +651,12 @@ The following are the configuration parameters available for the _Regressor_ plu
| targetDistance | Temporal distance (in terms of lags) of the sample that is to be predicted.
| inputPath | Path of a file from which a pre-trained random forest model must be loaded.
| outputPath | Path of a file to which the random forest model trained at runtime must be saved.
| getImportances | If true, the random forest will also compute feature importance values when trained, which are printed.
> NOTE &ensp;&ensp;&ensp;&ensp;&ensp; When the _duplicate_ option is enabled, the _outputPath_ field is ignored to avoid file collisions from multiple regressors.
> NOTE 2 &ensp;&ensp;&ensp;&ensp;&ensp; When loading the model from a file and getImportances is set to true, importance values will be printed only if the original model had this feature enabled upon training.
Additionally, input sensors in analyzers of the Regressor plugin accept the following parameter:
| Value | Explanation |
......@@ -665,6 +668,7 @@ Finally, the Regressor plugin supports the following additional REST API action:
| Action | Explanation |
|:----- |:----------- |
| train | Triggers a new training phase for the random forest model. Feature vectors are temporarily collected in-memory until _trainingSamples_ vectors are obtained. Until this moment, the old random forest model is still used to perform prediction.
| importances | Returns the sorted importance values for the input features, together with the respective labels, if available.
## Tester Plugin <a name="testerPlugin"></a>
The _Tester_ plugin can be used to test the functionality and performance of the query engine, as well as of the unit system. It will perform a specified number of queries over the set of input sensors for each unit, and then output as a sensor the total number of retrieved readings. The following are the configuration parameters for analyzers in the _Tester_ plugin:
......
......@@ -34,6 +34,7 @@ RegressorAnalyzer::RegressorAnalyzer(const std::string& name) : AnalyzerTemplate
_targetDistance = 1;
_trainingSamples = 256;
_trainingPending = true;
_importances = false;
_trainingSet = nullptr;
_responseSet = nullptr;
_currentfVector = nullptr;
......@@ -46,6 +47,7 @@ RegressorAnalyzer::RegressorAnalyzer(const RegressorAnalyzer& other) : AnalyzerT
_aggregationWindow = other._aggregationWindow;
_targetDistance = other._targetDistance;
_trainingSamples = other._trainingSamples;
_importances = other._importances;
_trainingPending = true;
_trainingSet = nullptr;
_responseSet = nullptr;
......@@ -72,8 +74,10 @@ RegressorAnalyzer::~RegressorAnalyzer() {
restResponse_t RegressorAnalyzer::REST(const string& action, const unordered_map<string, string>& queries) {
restResponse_t resp;
if(action=="train") {
resp.response = "Re-training triggered for regressor " + this->_name + "!";
resp.response = "Re-training triggered for regressor " + this->_name + "!\n";
this->_trainingPending = true;
} else if(action=="importances") {
resp.response = getImportances();
} else
throw invalid_argument("Unknown plugin action " + action + " requested!");
return resp;
......@@ -93,8 +97,10 @@ void RegressorAnalyzer::init(boost::asio::io_service& io) {
} catch(const std::exception& e) {
LOG(error) << "Analyzer " + _name + ": cannot load model from file, falling back to default!"; }
}
if(useDefault)
if(useDefault) {
_rForest = cv::ml::RTrees::create();
_rForest->setCalculateVarImportance(_importances);
}
AnalyzerTemplate<RegressorSensorBase>::init(io);
}
......@@ -104,6 +110,7 @@ void RegressorAnalyzer::printConfig(LOG_LEVEL ll) {
LOG_VAR(ll) << " Training Sample: " << _trainingSamples;
LOG_VAR(ll) << " Input Path: " << (_modelIn!="" ? _modelIn : std::string("none"));
LOG_VAR(ll) << " Output Path: " << (_modelOut!="" ? _modelOut : std::string("none"));
LOG_VAR(ll) << " Importances: " << (_importances ? "enabled" : "disabled");
AnalyzerTemplate<RegressorSensorBase>::printConfig(ll);
}
......@@ -145,6 +152,7 @@ void RegressorAnalyzer::trainRandomForest() {
_responseSet = nullptr;
_trainingPending = false;
LOG(info) << "Analyzer " + _name + ": model training performed.";
LOG(info) << getImportances();
if(_modelOut!="") {
try {
_rForest->save(_modelOut);
......@@ -210,3 +218,45 @@ void RegressorAnalyzer::computeFeatureVector(U_Ptr unit) {
//for(idx=0; idx<_currentfVector->size().width;idx++)
// LOG(error) << _currentfVector->at<float>(idx);
}
std::string RegressorAnalyzer::getImportances() {
if(!_importances || _rForest.empty() || !_rForest->getCalculateVarImportance())
return "Analyzer " + _name + ": feature importances are not available.";
std::vector<ImportancePair> impLabels;
cv::Mat_<float> impValues;
_rForest->getVarImportance().convertTo(impValues, CV_32F);
if(impValues.empty() || _units.empty())
return "Analyzer " + _name + ": error when computing feature importances.";
for(size_t idx=0; idx<impValues.total(); idx++) {
ImportancePair pair;
pair.name = _units[0]->getInputs().at(idx/REG_NUMFEATURES)->getName();
pair.value = impValues.at<float>(idx);
switch(idx%REG_NUMFEATURES) {
case 0:
pair.name += " - mean";
break;
case 1:
pair.name += " - std";
break;
case 2:
pair.name += " - diffsum";
break;
case 3:
pair.name += " - qtl25";
break;
case 4:
pair.name += " - qtl75";
break;
default:
break;
}
impLabels.push_back(pair);
}
std::sort(impLabels.begin(), impLabels.end(), [ ](const ImportancePair& lhs, const ImportancePair& rhs) { return lhs.value > rhs.value; });
std::string outString="Analyzer " + _name + ": listing feature importances for unit " + _units[0]->getName() + ":\n";
for(const auto& imp : impLabels)
outString += " " + imp.name + " - " + std::to_string(imp.value) + "\n";
return outString;
}
\ No newline at end of file
......@@ -61,12 +61,14 @@ public:
void setAggregationWindow(unsigned long long a) { _aggregationWindow = a; }
void setTrainingSamples(unsigned long long t) { _trainingSamples = t; }
void setTargetDistance(unsigned long long d) { _targetDistance = d; }
void setComputeImportances(bool i) { _importances = i; }
void triggerTraining() { _trainingPending = true;}
std::string getInputPath() { return _modelIn;}
std::string getOutputPath() { return _modelOut; }
unsigned long long getAggregationWindow() { return _aggregationWindow; }
unsigned long long getTrainingSamples() { return _trainingSamples; }
bool getComputeImportances() { return _importances; }
void printConfig(LOG_LEVEL ll) override;
......@@ -75,6 +77,7 @@ protected:
virtual void compute(U_Ptr unit) override;
void computeFeatureVector(U_Ptr unit);
void trainRandomForest();
std::string getImportances();
std::string _modelOut;
std::string _modelIn;
......@@ -82,6 +85,7 @@ protected:
unsigned long long _trainingSamples;
unsigned long long _targetDistance;
bool _trainingPending;
bool _importances;
vector<reading_t> *_buffer;
cv::Ptr<cv::ml::RTrees> _rForest;
......@@ -96,7 +100,12 @@ protected:
int64_t _diffsum;
int64_t _qtl25;
int64_t _qtl75;
// Helper struct to store importance pairs
struct ImportancePair {
std::string name;
float value;
};
};
......
......@@ -57,6 +57,8 @@ void RegressorConfigurator::analyzer(RegressorAnalyzer& a, CFG_VAL config) {
a.setInputPath(val.second.data());
else if(boost::iequals(val.first, "outputPath"))
a.setOutputPath(val.second.data());
else if(boost::iequals(val.first, "getImportances"))
a.setComputeImportances(to_bool(val.second.data()));
}
}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment