metrics manager dashboard

Once you have succesfully installed the metrics manager and opened Kibana in your browser, you can open the "metrics-dashboard" in the "Dashboards" section of Kibana. Here you will find a predefinded set of visualisations of yuuvis® metrics as well as the status of the hardware-resources of your servers. The follwing picture shows you an examplary view of the metrics-dashboard. Below each visualisation highlighted in the yellow boxes is explained in detailed.

(1) This is the Dasboards section. Klick on it to see all available dashboards or, if already opened before, restore the previously openend dashboard. In the latter case click on the button again to get back to the overview of all dashboards.

(2) This is the time range selector. Click on the arrow-down symbol to choose your time range. It is very important to keep in mind that all visualisations shown only represent the data that corresponds to the here selected time range. If you have emtpy visualisations or you think some data is missing, your time range might be too small.

(3) This visualisation shows the CPU utilisation of each of your server that runs metricbeat. The value shown is the average over the entire chosen time range ((2)). Be aware that 100% does not necessarily depict the maximum utilisation because metricbeat/windows treat 100% as the maximum utilisation of one CPU core. So the maximum utilisation is the number of CPU cores that the server has times 100. (You can change intervals for the green, yellow and red zones in the visualisation editor, but only for the entire visualisation and not for each tacho.)

(4) This visualisation shows the RAM utilisation of each of your server that runs metricbeat. The value shown is the average over the entire chosen time range ((2)). 100% means the RAM is completely utilised. (You can change intervals for the green, yellow and red zones in the visualisation editor.)

(5) This visualisation shows the course of the CPU utilisation of each of your server that runs metricbeat over the chosen time range. The time range is subdivided into single buckets that aggregate its data. The amount of time that each bucket represents is automatically determined by kibana so that the entire diagram fits the available space. Each dot in the diagram respresents one bucket. The actual bucket size is the difference between two dots. The value of each dot is the average CPU utilisation of the depicted server over the time that the bucket represents.

Handling the diagrams

Zoom in the time range:
You can zoom in to a smaller time range by clicking on the desired start of the time range in the diagram, holding the button clicked and dragging the mouse to the end of the desired time range. The time range selector will be adjusted to the new values and the diagram (actually the entire page) will be reloaded to show only this time range. Be aware that all visualisations are affected by this change. You can return to the original time range by clicking on the back button of the browser or manually adjusting the time range selector ((2)).

Course Line Filtering / Color:
With a click on the name of the desired course line in the legend of the visualisation, a sub-menu with two magnifying glasses and a color palette will show up. If you click on one of the colors you will change the color of the corresponding course line. If you click on the magnifying glass with the plus symbol in it, you will set a filter so that only this specific course line is shown and all others are taken out of the diagram. By clicking on the magnifying glass with the minus symbol in it the corresponding course line will be filtered out of the diagram, leaving all others there.
To go back to the previous color, either click on it in the palette or click the back button of the browser. To get the filtered course line back, click the back button of your browser or scroll to the top and click on the "x" of the filter that is now present (right above the CPU tachos visualisation).

(6) This visualisation shows the course of the RAM utilisation of each of your server that runs metricbeat over the chosen time range. The visualisation behaves analogous to (5).

(7) This visualisation shows the course of how many bytes (on average over the bucket size) were read on each of the hard drive partitions of each server running metricbeat.

(8) This visualisation shows the course of how many bytes (on average over the bucket size) were written on each of the hard drive partitions of each server running metricbeat.

(9) This visualisation shows how many different users and sessions (on average over the bucket size) were active during the chosen time range. With users, the unique login-names are meant. With sessions, the unique session-ids are meant. So for example, if one user logs into the client with his/her browser and also has an agent that is logged in, then there will be one active user but two active sessions.

(10) This visualisation shows the course of how many REST-API requests each user (login-name) has made (summed up per bucket). This means either implicitly by using the client, the agent, etc. or explicitly by directly sending a request to the API. It is ordered by the total amount of calls made, descending.

(11) This metric shows the count of how many different users (i.e. login-names) made at least one call over the entire time range chosen.

(12) This metric shows the count of how many different sessions (i.e. session-ids) appeared in at least one request over the entire time range chosen.

(13) This visualisation shows the course of how many messages each queue of the messaging service contained - on average over the bucket size. This visualisation is especially suitable for monitoring if bursts of actions - e.g. importing or updating many objects at once - can be processed before another / the next burst starts. In this case, for example, you would see a sudden raise in the FULLTEXTINDEX and RENDITION queues and a gradual descent of them. If the queues reach 0 before the next burst the performance is sufficient, if not there might be a performance / hardware ressources problem.

(14) This set of metrics shows the average processing duration in milliseconds of a) all requests b) all search requests (including aggregation requests) c) all object creation requests (with and without content-file) d) all object read requests (equal to opening the object in the client) made during the entire time range chosen. This is only to get a rough overview of the duration relations between creating, reading and searching objects in comparisson to the overall average.

(15) This visualisation shows the course of the processing duration in milliseconds of a) all requests b) all search requests (including aggregation requests) c) all object creation requests (with and without content-file) d) all object read requests (equal to opening the object in the client) made during the time range chosen (averaged over the bucket size).

(16) This set of metrics shows the total number of requests were send to the a) dms-service b) search service and c) index service. This is only to get an understanding of how much load each components is under.

(17) This visualisation shows the course of how many requests resulted what type of http response code where 2xx means ok, 3xx means forwarded but ok, 4xx means logical error (user-error) and 5xx means internal server error. The values are the sum over each bucket. This visualisation is especially suitable for monitoring if, for example after a change or update, something does not work correctly anymore. In this case you will see the 4xx and/or 5xx request count rising up and (logically) the 2xx request count descending. The Y-axis is logarithmic for once so that course of the usually very few 4xx and maybe 5xx requests can still be seen in sufficient detail.

The metrics (18) - (32) show the total number of the following actions that succeeded (left side) and failed (right side):

(18) object (document) creations with a conent-file (the endpoint can also be called without actually passing a content-file, this can not be distinguished here)

(19) object creations without a content-file (documents and folders)

(20) subsequent additions of a content-file to an existing object (document)

(21) batch creations of objects (only the number of batch requests. The actual number of objects created using this endpoint is not shown here.)

(22) updates to the indexdata of an object

(23) batch updates to the indexdata of objects (only the number of batch requests. The actual number of objects updated using this endpoint is not shown here.)

(24) object deletions (includes soft (to the recycle bin) and hard (actual deletion) deletions)

(25) batch deletions of objects (only the number of batch requests. The actual number of objects deleted using this endpoint is not shown here.)

(26) retrievals of the PDF renditon of an object (document with content-file)

(27) retrievals of the original content-file of an object (document with content-file)

(28) search requests excluding aggregations

(29) aggregation requests (previews the number of objects that the current search configuration would yield)

(30) creations of objects using the prepared state of the client (i.e. clicking on "file" or "file and open" in the prepared state of the client)

(31) executed ETL configurations (Be aware that ETL is already deprecated and will soon not be supported anymore. We recommend using Talend Open Studio.)

(32) Starts of workflow instances

(33) This metric set shows the count of requests sent to the top 10 - i.e. most often called - endpoints of the dms services REST-API. The value is the sum of requests made during the entire time range chosen. This is to get a rough estimate of which endpoints are very important to the system.

(34) This metrics set is analogous to (33) only that it is restricted to the DmsService subset of endpoints. This is to get an estimate of how many and what kind of object manipulations are done.

(35) This visualisation corresponds to the "API Trace" site of the REST-API website. It shows one table for each of the http response code categories (2xx, 3xx, 4xx, 5xx). In the tables you see the endpoints that responded at least once with the corresponding http result code - orderd by the total amount of requests sent to this endpoint (total invocations). Besides the count you can see what the minimum, maximum, total (summed up) and average response duration was so for - for the time range chosen. If a table doesn't show up, like in this example the 5xx table, it means there were no requests that resulted in this response code.

For more information on how to handle kibana (manually setting filters, editing existing or creating new visualisations, etc.) please refer to the kibana documentation at https://www.elastic.co/guide/en/kibana/7.1/index.html