metrics-manager Whitepaper

This article describes in detail how yuuvis® RAD metrics-manager works, on what technologies it is based, and what it is used for.

Introduction

In the past, customers often asked "how can I tell if yuuvis® RAD is running properly?". Also, even if it is running, some systems are running better and some systems worse. Apart from the obvious differences in hardware sizing, we had a hard time to measure the quality of a system's performance. In the lack of informative metrics and criteria, thorough analysis of the log files and strenuous tracing of the path of the data and messages from the gained information had to be done to get a grip on the bottlenecks. 
To make this process easier and more convenient, we envisioned a framework or platform that does most of the work of condensing and aggregating the log files for us and presenting the results in a visually comprehensible way. This is what the yuuvis® RAD metrics manager does.

Basics

To be able to do this, the yuuvis® RAD components needed to be adapted in the following way:

  • every single request and answer that was received or sent needed to be logged into a file
  • the information about these requests or answers needed to be comprehensive, i.e., it should contain as much information as possible - without just dumping the entire request/answer to the file.
  • the format of the log file needed to conform to a standardized format 

Since all our components communicate exclusively via REST webservices, we were able to implement a call filter that complied with the above requirements and thanks to frameworks lots of useful information was already available. A new log file, called the metrics log, was introduced in the components and when activated writes this information to a metric log file.

This information includes the following fields: start and end time, duration, service(name), port, endpoint, URL, parameters, http method, http response code, request and response header, user (authorization) information, origin (IP) address, CPU load, RAM utilization, disk I/O.

Of course, when a system is heavily used, these metric log files will quickly contain thousands or even millions of lines. A manual analysis is not feasible anymore - or only covers a very small fraction of the actual data. This is not representative and might not even contain the data that leads you to the cause of your problems. Hence, the data must be processed and aggregated to be able to effectively work with it.

The metrics-manager and the elastic(search) stack*

We decided to use the Elasticsearch database and its stack for this purpose as it has great aggregation functions and can handle queries very fast, even on millions of records. In detail, the tools that yuuvis® RAD metrics-manager comprises are:

  • elasticsearch

    Elasticsearch is a distributed, JSON-based search and analytics engine.
  • logstash

    Logstash is a server-side data processing pipeline that ingests data from many sources like tcp or one of the elastic beats, transforms it, and then sends it to Elasticsearch. All metrics-manager tools use logstash to send data to Elasticsearch.
  • filebeat

    Filebeat is a small and simple tool that reads log files and sends the data line by line to Elasticsearch using logstash.
  • metricbeat

    Metricbeat is another tool of the beats family that can read system metrics like CPU load or disc I/O and sends the data to Elasticsearch using logstash.
  • kibana

    Kibana is a frontend application that lets you visualize the data in Elasticsearch indices by running aggregations or similar queries and plotting the results in diagrams, graphs, timelines, etc. You can restrict the visualization to specific time ranges or view the entire data at once. 
  • elastalert2

    Elastalert2 is a third-party tool that can be used to alert users over various channels on anomalies, spikes, or other patterns of interest from data in Elasticsearch. This can be done by creating definition files that specify the conditions that need to be met for an alert to trigger.
  • Network Share Monitor

    The Network Share Monitor monitors SMB (Samba) and CIFS shares and reports the drive usage information (free / used percentage / bytes) to a file in the metricbeat syntax, so it can be processed by filebeat and merged with the metricbeat data.

So, this is how yuuvis® RAD metrics manager works:

We use filebeat to read the metrics log files from the dms-service and service-manager and create documents in Elasticsearch for each logged REST call by sending it to logstash. Also, we use metricbeat to collect system metrics like CPU load, RAM utilization and disk I/O. We let metricbeat write the collected data into a file so that filebeat can read it and sent it to Elasticsearch via logstash as well. Logstash takes all data and puts it into an index called logstash-<datestamp>. So a new index is created every day containing all the logged calls of that day. As said before, on a heavily used system the amount of data can quickly become very large. In order not to let the hard drive run full, make use of the elasticsearch index lifecycle management (ILM) to rollover and delete indices after a predefined period. Per default, the indices are rolled over after 1 day and deleted after 45 days. We find this to be a sufficiently long period of being able to look back on the things happening in the system. But, of course, this can be configured to suit your needs. The longer you want to be able to look back, the more indices and thus space you'll need and vice versa.
Finally we use kibana to visualize the data in graphs, diagrams and timelines. Here, we take full advantage of the aggregation and condensing abilities that Elasticsearch and kibana offer us. Here are some examples of what you can find out with the available data:

  • Aggregate over all http response codes and find out if and how many 4xx or 5xx errors happened
  • Aggregate over all calls and endpoints and find out what endpoints are used the most / exactly how many times. In addition, you can see the percentage of successful vs. erroneous calls.
  • Aggregate over all calls and find out how many calls each user made. You can further refine to see the number of calls per user per endpoint, or who caused the most 4xx/5xx calls and maybe identify users who need more training or are trying to abuse the system.
  • Aggregate over the call duration per endpoint and identify endpoints that need performance tuning.
  • Aggregate over the system metrics and identify systems that are running low on resources or over their targeted utilization level. You can also set this data in relation to the long running endpoints to get more insight on what causes the bottleneck.
  • You can evaluate the messaging service queue lengths and see if messages pile up or how long it takes to process messages created in a batch import or similar.
  • Aggregate over all calls and evaluate main usage times, average response time (e.g., for search-requests) or relate the calls to users and their companies for billing/licensing information

In combination with elastalert, you can send notifications to admins or managers to inform them about conditions like reached maximums, (too many) errors or dropping response times. 

Integrating the metrics manager into the yuuvis® RAD environment

The yuuvis® RAD metrics manager is an optional extension to the yuuvis® RAD system. As such, it is not installed by default. To run it, you might have to extend your hardware resources to support the extra load. 
The installation - as described in the installation guide - is basically divided into two parts. The first one is the activation of the metrics log files and letting filebeat (+metricbeat) send the data to logstash. The second one is installing elasticsearch, logstash and kibana on a machine to store and display the data received by filebeat and metricbeat. While the first part "only" adds the load of writing (lots of) lines to a file, the second part adds an entire Elasticsearch database with potentially millions of records plus the kibana backend. The machine hosting this part should have at least 12 GB of free RAM, the equivalent of about 4 free CPUs and enough free hard drive space for the new data. Depending on the load of the system, this can range from a couple to 20-30 GB per day. If possible, an exclusive machine with 4 CPUs, 16GB RAM and about 300 GB hard drive space would surely be the best choice.



*You can find more information about Elasticsearch and the elastic stack at https://www.elastic.co and its subsidiary sites. Information about elastalert can be found at https://elastalert.readthedocs.io/en/latest/