Page Properties

hidden	true
id	PROGRESSDONE

Product Version	2021 Autumn
Report Note
Assignee	Antje

Resources & Remarks

Modification History

The service of the Auto ML platform is responsable for the training of models as well as the determination of predictions based on those models

Name	Date	Product Version	Action
Antje	21 JUL 2021	2021 Autumn	created
Goran	18 OCT 2021	2021 Autumn	updated
Antje

Excerpt

25 NOV 2022

2022 Winter

remove beta label, update content

Excerpt
Responsible for preparing data for training, training of machine learning models, evaluation of trained models, and preparing for the deployment to the production.

Section

border	true

Column

Characteristics

Note

title	Beta Version

The ML Pipeline is a component of the Auto ML Platform. This platform is not included in yuuvis® Momentum installations and is available as a beta version only on request.

Function

The Machine Learning ( ML ) Training Pipeline is the heart part of the Auto ML Platform and as such Artificial Intelligence Platform responsible for data ingestion, data validation, transformation, machine learning training, and model evaluation, and model serving. The pipeline is based on MLFlow, but other providers, such as Google TFX, Kubeflow, etc. can be used alternatively.

Requests for the ML Pipeline are managed via separate services that provide an own API:

The PREDICT-API service provides prediction and status endpoints that can be called by client applications.
>> PREDICT-API Endpoints

The management of the ML Pipeline is done via the command line application Kairos CLI.

Machine Learning Training

The ML Pipeline needs to be trained by means of reference objects stored in MLflow – an open-source platform for managing ML lifecycles.

Data Export

The source of data for machine learning is a document management system, e e.g., yuuvis® Momentum, and for which users manually defined the individual object type. The data exported from yuuvis® Momentum is stored in the format suitable for data ingestion on local storage or S3. This data is used to train the models for the determination of predictions.. The data shall be exported in a predefined format and shall be made available to the provided training pipelines.

Machine Learning Pipelines

The machine learning pipelines are components developed and shipped by OPTIMAL SYSTEMS GmbH. They contain all necessary procedures and algorithms to train machine learning models for different purposes (e.g., document classification and metadata extraction).

At the moment, pipelines can be used for document classification (for instance it can determine whether a document is an invoice, a contract, a sick-leave or something else) and for metadata extraction (for instance, extract the issuing date, total amount and invoice number from an invoice).

Document Classification

In the context of the AI platform, classification means the determination of suitable typification classes fitting for an object based on its full-text rendition. For one object, one prediction is provided that contains mappings of classes and their corresponding relevance probability as well as a reference on the object in yuuvis® Momentum via objectId.

Instead of the class names used internally in the ML Pipeline, the prediction response bodies provide the object types as referenced in the inference schema described below Inference Schema.

Metadata Extraction

ML Pipeline can analyze the PDF rendition of binary content files assigned to objects in yuuvis® Momentum in order to extract specific metadata. Based on the trained models, predictions for values of specific object properties can be determined. The object properties have to be listed in the inference schema where also the Inference Schema where conditions for the values and settings for the prediction responses are also specified. AnchorInferenceSchemaInferenceSchema

Inference Schema

The inference schema is a JSON configuration file defining the object types that will be available for the classification as well as the properties for which the metadata extraction should determine suitable values. At the same time, each internal aiObjectTypeId (aiPropertyId) gets a corresponding objectTypeId (propertyId) that will be used in the response bodies of the classification (extraction) endpoints to be compatible with the connected client application.

The inference schema is defined for a specific tenant. It is also possible to further limit the usage of the inference schema to an app by specifying appName (e.g., to distinct between a client app for single uploads and batch processing apps).

Code Block

language	yml
title	Example for an inference schema

{
    "tenant" : "mytenant",
    "appName" : "AIInvoiceClient",
    "classification" : {
        "enabled" : true,
        "timeout" : 2,
        "aiClassifierId" : "DOCUMENT_CLASSIFICATION",
        "objectTypes": [
            {
                "objectTypeId" : "appImpulse:receiptsot|appImpulse:receiptType|Rechnung",
                "aiObjectTypeId" : "INVOICE"
            },
            {
                "objectTypeId" : "appImpulse:receiptsot|appImpulse:receiptType|Angebot",
                "aiObjectTypeId" : "DOCUMENT_TYPE_2"
            },
            {
                "objectTypeId" : "appImpulse:hrsot|appImpulse:receiptType|Bewerbung",
                "aiObjectTypeId" : "DOCUMENT_TYPE_3"
            }
        ]
    },
    "extraction" : {
        "enabled" : true,
        "timeout" : 5,
        "objects" : [
            {
                "objectTypeId" : "invoice",
                "enabled" : true,
                "timeout" : 10,
                "propertyReference" : [
                    {
                        "propertyId" : "companyName",
                        "aiPropertyId" : "INVOICE_COMPANY_NAME",
                        "allowedValues" : ["Company1", "Company2", "Company3"],
                        "pattern" : "/^[a-z]|\\d?[a-zA-Z0-9]?[a-zA-Z0-9\\s&@.]+$",
                        "validationService" : "my_company_name_validation_service",
                        "maxNumberOfPredictions" : 5
                    },
                    {
                        "propertyId" : "totalAmount",
                        "aiPropertyId" : "INVOICE_TOTAL_AMOUNT",
                        "pattern" : "^[0-9]*[.][0-9]*$",
                        "validationService" : "my_amounts_validation_service",
                        "maxNumberOfPredictions" : 1
                    }
                ]
            }
        ]
    }
}

The following parameters are available in the inference schema:

...

Optional parameter: name of the app that uses the inference schema. Other apps within the tenant cannot use this inference schema but only their own app-specific inference schema or the tenant-wide inference schema.

...

Time limit for the determination of a classification predictions in seconds.

An error will be thrown if the calculation process could not be finished before the timeout was reached.

...

A list of mappings, each of them containing the following keys. This list defines the object types that are available for the classification prediction.

...

Time limit for the determination of extraction predictions in seconds.

The result will be returned even if the calculation process is still running for some models. Those models will be excluded from the response.

...

Boolean value specifying whether the metadata extraction is activated (true) or deactivated (false) for the specific object type.

Ignored if extraction.enabled is set to false.

...

Optional time limit in seconds overwriting extraction.timeout for the determination of extraction predictions for properties belonging to the object type specified by objectTypeId.

The result will be returned even if the calculation process is still running for some models. Those models will be excluded from the response.

...

The identification of a property as it will appear in prediction response bodies. You can define a concatenation of several secondary object type IDs, catalog values etc. that can be interpreted by your client application to show the prediction results in proper format.

...

Optional parameter: URL of an endpoint for further validation of the value determined for the property specified by propertyId.

Note: Not available in the beta version where the connection of an additional validation service needs more configuration steps.

...

Optional parameter: An integer value defining the maximum number of values included in the prediction response for the property propertyId.

If not specified, the default value 1 will be used.

In order to combine the AI platform with yuuvis® client as reference implementation, the following inference schema is required:

Code Block

language	yml
title	Inference schema for the combination with yuuvis® Momentum CLIENT service
collapse	true

{
    "tenant" : "os__papi",
    "appName" : "AIInvoiceClient",
    "classification" : {
        "enabled" : true,
        "timeout" : 10,
        "aiClassifierId" : "DOCUMENT_CLASSIFICATION",
		"objectTypes": [ 
			{ 
				"objectTypeId" : "appImpulse:hrdocsot|appImpulse:hrDocumentType|Bescheinigung",
				"aiObjectTypeId" : "appImpulse:contractsot"
			},
			{ 
				"objectTypeId" : "appImpulse:receiptsot",
				"aiObjectTypeId" : "appImpulse:hrdocsot"
			},
			{ 
				"objectTypeId" : "appImpulse:contractsot|appImpulse:contractType|Arbeitsvertrag",
				"aiObjectTypeId" : "appImpulse:receiptsot"
			},
			{ 
				"objectTypeId" : "appImpulse:hrdocsot|appImpulse:hrDocumentType|Arbeitsvertrag",
				"aiObjectTypeId" : "appImpulse:basedocumentsot"
			}
		]
    }
}

Requirements

The Auto ML Pipeline is a part of the Auto ML Platform and can run only in combination with the other included services.

ML Pipeline furthermore requires:

s3 or local storage

If you want to use the ML Pipeline for the AI integration in yuuvis® client as reference implementation, also the requirements of the CLIENT Service have to be considered.

Installation

The Auto ML Platform services including the ML Pipeline are not yet included in yuuvis® Momentum installations but only available on request.

Configuration

...

Model Evaluation

After the machine learning training is done, the model is evaluated. By examining training results, the user decides whether the model is suitable for use or needs longer training, larger data set, etc.

Model Registry

Models that are suitable for further use are stored in the Model Registry component. From the Model Registry component, models can be dockerized and deployed to the serving infrastructure (typically, to the same Kubernetes cluster where yuuvis Momentum is running).

Info

icon	false

Read on

Section

Column

width	25%

Inference Schema

Insert excerpt

	Inference Schema
	Inference Schema
nopanel	true

Keep reading

Column

width	25%

KAIROS-API Service

Insert excerpt

	KAIROS-API Service
	KAIROS-API Service
nopanel	true

Keep reading

Column

width	25%

PREDICT-API Service

Insert excerpt

	PREDICT-API Service
	PREDICT-API Service
nopanel	true

Keep reading

Versions Compared

Old Version 6

New Version Current

Key

Table of Contents

Characteristics

Function

Machine Learning Training

Data Export

Machine Learning Pipelines

Document Classification

Metadata Extraction

Inference Schema

Requirements

Installation

Configuration

Model Evaluation

Model Registry

Read on

Inference Schema

KAIROS-API Service

PREDICT-API Service

Page Comparison

Versions Compared

Old Version 6

New Version Current

Key

Table of Contents

Characteristics

Function

Machine Learning Training

Data Export

Machine Learning Pipelines

Document Classification

Metadata Extraction

Inference Schema

Requirements

Installation

Configuration

Model Evaluation

Model Registry

Read on

Inference Schema

KAIROS-API Service

PREDICT-API Service