ML Training Pipeline
Responsible for preparing data for training, training of machine learning models, evaluation of trained models, and preparing for the deployment to the production.
Table of Contents
Characteristics
Beta Version
The Machine Learning (ML) Training Pipeline is a component of the Artificial Intelligence Platform. This platform is not included in yuuvis® Momentum installations and is available as a beta version only on request.
Function
The ML Training Pipeline is part of the Artificial Intelligence Platform responsible for data ingestion, data validation, transformation, machine learning training, and model evaluation. The pipeline is based on MLflow – an open-source platform for managing ML lifecycles.
ML Training Pipeline is used via the command line application Kairos CLI.
Data Export
The source of data for machine learning is a document management system, e.g., yuuvis® Momentum. The data exported from yuuvis® Momentum are stored on local storage devices, S3 or Azure Blob Storage, in the format suitable for data ingestion.
Machine Learning Pipelines
The machine learning pipelines are components developed and shipped by OPTIMAL SYSTEMS GmbH. They contain all necessary procedures and algorithms to train machine learning models for different purposes (e.g., document classification and metadata extraction).
Machine learning pipelines are shipped as Docker containers that run inside the ML Training Pipeline.
At the moment, there are two different types of pipelines, one for document classification and one for metadata extraction. In the future, more types of pipelines will be added.
Document Classification
In the context of the AI platform, classification means the determination of suitable typification classes fitting for an object based on its full-text rendition. For one object, one prediction is provided that contains mappings of classes and their corresponding relevance probability
as well as a reference on the object in yuuvis® Momentum via objectId
.
Instead of the class names used internally in the ML Pipeline, the prediction response bodies provide the object types as referenced in the Inference Schema.
Metadata Extraction
ML Pipeline can analyze the PDF rendition of binary content files assigned to objects in yuuvis® Momentum in order to extract specific metadata. Based on the trained models, predictions for values of specific object properties can be determined. The object properties have to be listed in the Inference Schema where conditions for the values and settings for the prediction responses are also specified.
Machine Learning Training
The training of machine learning models can be run using Kairos CLI App to define what data to use and which ML Pipeline to run in order to get a model for the desired purpose, for example, invoice metadata extraction.
Model Evaluation
After the machine learning training is done, the model is evaluated. By examining training results, the user decides whether the model is suitable for use or needs further tuning of hyperparameters, bigger data set, etc.
Model Registry
Models that are suitable for further use are stored in the Model Registry component. From the Model Registry component, models can be built and deployed to the Model Serving infrastructure.