This article describes in detail how the yuuvis® RAD metrics-manager works, on what technologies it is based, and what it is used for.
In the past, users often asked "how can I tell if yuuvis® RAD is running properly?". Also, even if it is running, some systems are running better and some systems worse. Apart from the obvious differences in hardware sizing, we had a hard time to measure the quality of a system's performance. In the lack of informative metrics and criteria, thorough analysis of the log files and strenuous tracing of the path of the data and messages from the gained information had to be done to get a grip on the bottlenecks.
To make this process easier and more convenient, we envisioned a framework or platform that does most of the work of condensing and aggregating the log files for us and presenting the results in a visually comprehensible way. This is what the yuuvis® RAD metrics-manager does.
To be able to do this, the yuuvis® RAD components needed to be adapted in the following way:
Since all our components communicate exclusively via REST webservices, we were able to implement a call filter that complied with the above requirements and thanks to frameworks lots of useful information was already available. A new log file, called the metrics log, was introduced in the components and when activated writes this information to a metric log file.
This information includes the following fields: start and end time, duration, service(name), port, endpoint, URL, parameters, http method, http response code, request and response header, user (authorization) information, origin (IP) address, CPU load, RAM utilization, disk I/O.
Of course, when a system is heavily used, these metric log files will quickly contain thousands or even millions of lines. A manual analysis is not feasible anymore - or only covers a very small fraction of the actual data. This is not representative and might not even contain the data that leads you to the cause of your problems. Hence, the data must be processed and aggregated to be able to effectively work with it.
We decided to use the Elasticsearch database and its stack for this purpose as it has great aggregation functions and can handle queries very fast, even on millions of records. In detail, the tools that yuuvis® RAD metrics-manager comprises are:
So, this is how yuuvis® RAD metrics-manager works:
We use filebeat to read the metrics log files from the dms-service and service-manager and create documents in Elasticsearch for each logged REST call by sending it to logstash. Also, we use metricbeat to collect system metrics like CPU load, RAM utilization and disk I/O. We let metricbeat write the collected data into a file so that filebeat can read it and sent it to Elasticsearch via logstash as well. Logstash takes all data and puts it into an index called logstash-<datestamp>. So a new index is created every day containing all the logged calls of that day. As said before, on a heavily used system the amount of data can quickly become very large. In order not to let the hard drive run full, make use of the elasticsearch index lifecycle management (ILM) to rollover and delete indices after a predefined period. Per default, the indices are rolled over after 1 day and deleted after 45 days. We find this to be a sufficiently long period of being able to look back on the things happening in the system. But, of course, this can be configured to suit your needs. The longer you want to be able to look back, the more indices and thus space you'll need and vice versa.
Finally we use kibana to visualize the data in graphs, diagrams and timelines. Here, we take full advantage of the aggregation and condensing abilities that Elasticsearch and kibana offer us. Here are some examples of what you can find out with the available data:
In combination with elastalert, you can send notifications to admins or managers to inform them about conditions like reached maximums, (too many) errors or dropping response times.
The yuuvis® RAD metrics-manager is an optional extension to the yuuvis® RAD system. As such, it is not installed by default. To run it, you might have to extend your hardware resources to support the extra load.
The installation - as described in the installation guide - is basically divided into two parts. The first one is the activation of the metrics log files and letting filebeat (+metricbeat) send the data to logstash. The second one is installing elasticsearch, logstash and kibana on a machine to store and display the data received by filebeat and metricbeat. While the first part "only" adds the load of writing (lots of) lines to a file, the second part adds an entire Elasticsearch database with potentially millions of records plus the kibana backend. The machine hosting this part should have at least 8 GB of free RAM, the equivalent of about 2 free CPUs and enough free hard drive space for the new data. Depending on the load of the system, this can range from a couple to 20-30 GB per day. If possible, an exclusive machine with 4 CPUs, 16GB RAM and about 300 GB hard drive space would surely be the best choice.
*You can find more information about Elasticsearch and the elastic stack at https://www.elastic.co and its subsidiary sites. Information about elastalert can be found at https://elastalert.readthedocs.io/en/latest/