Collectagent scalability issues
Collectagent does not work properly when a relatively high incoming message rate is reached.
On a Macbook Pro 2016 (2.9Ghz Intel i5), issues can be consistently reproduced with the following series of actions:
- Launch dcdbpusher with full ProcFS instantiation and 1000ms sampling period (circa 2600 sensors/sec)
- Launch second dcdbpusher with same configuration as No.1, but with a sampling period of 100ms (circa 26000 sensors/sec)
- Kill second dcdbpusher after some time
Collectagent will work fine after step 1, with all sensors being pushed correctly to Cassandra and a message rate of circa 900/s, reflecting the number of metrics. After step 2, the message rate will slightly decrease, meaning collectagent cannot keep up with the number of incoming messages, and data in Cassandra will be added at an extremely low rate. After step 3, the message rate will sharply increase once again, and after a few seconds the collectagent is able to flush all incoming messages from dcdbpusher 1, whose recent data is once again reflected within Cassandra.
This problem is very likely scalability-related. Average processing time for a message with 10 sensor readings in collectagent is between 2 and 3 ms, which is too high. Issue is dramatically worsened when MQTT QoS features are enabled (level 1 or 2), leading to messages constantly timing out, being resent by dcdbpusher, and piling up indefinitely.
It should also be noted that all dcdbpusher connections are handled by one thread - sequentially - until the maximum number of connection slots (16) is reached. Setting a maximum number of connection slots of 1 does not lead to issues in the scenario presented above.
Hypothetical solutions:
- Switch to Cassandra async operations (likely main culprit)
- Aggregate multiple Cassandra insert operations in batches
- Avoid heap allocations when processing messages
- Pre-allocate cache vectors to avoid memory reallocations