Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Page Properties
hiddentrue
idPROGRESSDONE

Product Version2021 Autumn
Report Note
AssigneeAntje

Resources & Remarks

Modification History

The service of the Auto ML platform is responsable for the training of models as well as the determination of predictions based on those models

NameDateProduct VersionAction
Antje21 JUL 20212021 Autumncreated
Goran18 OCT 20212021 Autumnupdated
Antje
Excerpt
25 NOV 20222022 Winterremove beta label, update content



Excerpt

Responsible for preparing data for training, training of machine learning models, evaluation of trained models, and preparing for the deployment to the production.


Section
bordertrue


Column

Table of Contents

Table of Contents
exclude(Table of Contents|Read on|Basic Use Case Flows|Preprocessing Metadata using Webhooks|Concept of System Hooks|Another interesting Tutorial|System Hooks|Graphical Overview \/ Use Cases \(Flows\))

Characteristics

Note
titleBeta Version
The ML Pipeline is a component of the Auto ML Platform. This platform is not included in yuuvis® Momentum installations and is available as a beta version only on request.


Function

The Machine Learning ( ML ) Training Pipeline is the heart part of the Auto ML Platform and as such Artificial Intelligence Platform responsible for data ingestion, data validation, transformation, machine learning training, and model evaluation, and model serving. The pipeline is based on MLFlow, but other providers, such as Google TFX, Kubeflow, etc. can be used alternatively.

Requests for the ML Pipeline are managed via separate services that provide an own API:

  • The PREDICT-API service provides prediction and status endpoints that can be called by client applications.
    >> PREDICT-API Endpoints

The management of the ML Pipeline is done via the command line application Kairos CLI.

Machine Learning Training

The ML Pipeline needs to be trained by means of reference objects stored in MLflow – an open-source platform for managing ML lifecycles.

Data Export

The source of data for machine learning is a document management system, e e.g., yuuvis® Momentum, and for which users manually defined the individual object type. The data exported from yuuvis® Momentum is stored in the format suitable for data ingestion on local storage or S3. This data is used to train the models for the determination of predictions.. The data shall be exported in a predefined format and shall be made available to the provided training pipelines. 

Machine Learning Pipelines

The machine learning pipelines are components developed and shipped by OPTIMAL SYSTEMS GmbH. They contain all necessary procedures and algorithms to train machine learning models for different purposes (e.g., document classification and metadata extraction).

At the moment, pipelines can be used for document classification (for instance it can determine whether a document is an invoice, a contract, a sick-leave or something else) and for metadata extraction (for instance, extract the issuing date, total amount and invoice number from an invoice). 

Document Classification

In the context of the AI platform, classification means the determination of suitable typification classes fitting for an object based on its full-text rendition. For one object, one prediction is provided that contains mappings of classes and their corresponding relevance probability as well as a reference on the object in yuuvis® Momentum via objectId.

Instead of the class names used internally in the ML Pipeline, the prediction response bodies provide the object types as referenced in the inference schema described below Inference Schema.

Metadata Extraction

ML Pipeline can analyze the PDF rendition of binary content files assigned to objects in yuuvis® Momentum in order to extract specific metadata. Based on the trained models, predictions for values of specific object properties can be determined. The object properties have to be listed in the inference schema where also the Inference Schema where conditions for the values and settings for the prediction responses are also specified. AnchorInferenceSchemaInferenceSchema

Inference Schema

The inference schema is a JSON configuration file defining the object types that will be available for the classification as well as the properties for which the metadata extraction should determine suitable values. At the same time, each internal aiObjectTypeId (aiPropertyId) gets a corresponding objectTypeId (propertyId) that will be used in the response bodies of the classification (extraction) endpoints to be compatible with the connected client application.

The inference schema is defined for a specific tenant. It is also possible to further limit the usage of the inference schema to an app by specifying appName (e.g., to distinct between a client app for single uploads and batch processing apps).

Code Block
languageyml
titleExample for an inference schema
{
    "tenant" : "mytenant",
    "appName" : "AIInvoiceClient",
    "classification" : {
        "enabled" : true,
        "timeout" : 2,
        "aiClassifierId" : "DOCUMENT_CLASSIFICATION",
        "objectTypes": [
            {
                "objectTypeId" : "appImpulse:receiptsot|appImpulse:receiptType|Rechnung",
                "aiObjectTypeId" : "INVOICE"
            },
            {
                "objectTypeId" : "appImpulse:receiptsot|appImpulse:receiptType|Angebot",
                "aiObjectTypeId" : "DOCUMENT_TYPE_2"
            },
            {
                "objectTypeId" : "appImpulse:hrsot|appImpulse:receiptType|Bewerbung",
                "aiObjectTypeId" : "DOCUMENT_TYPE_3"
            }
        ]
    },
    "extraction" : {
        "enabled" : true,
        "timeout" : 5,
        "objects" : [
            {
                "objectTypeId" : "invoice",
                "enabled" : true,
                "timeout" : 10,
                "propertyReference" : [
                    {
                        "propertyId" : "companyName",
                        "aiPropertyId" : "INVOICE_COMPANY_NAME",
                        "allowedValues" : ["Company1", "Company2", "Company3"],
                        "pattern" : "/^[a-z]|\\d?[a-zA-Z0-9]?[a-zA-Z0-9\\s&@.]+$",
                        "validationService" : "my_company_name_validation_service",
                        "maxNumberOfPredictions" : 5
                    },
                    {
                        "propertyId" : "totalAmount",
                        "aiPropertyId" : "INVOICE_TOTAL_AMOUNT",
                        "pattern" : "^[0-9]*[.][0-9]*$",
                        "validationService" : "my_amounts_validation_service",
                        "maxNumberOfPredictions" : 1
                    }
                ]
            }
        ]
    }
}

The following parameters are available in the inference schema:

...

Optional parameter: name of the app that uses the inference schema. Other apps within the tenant cannot use this inference schema but only their own app-specific inference schema or the tenant-wide inference schema.

...

Time limit for the determination of a classification predictions in seconds.

An error will be thrown if the calculation process could not be finished before the timeout was reached.

...

A list of mappings, each of them containing the following keys. This list defines the object types that are available for the classification prediction.

...

Time limit for the determination of extraction predictions in seconds.

The result will be returned even if the calculation process is still running for some models. Those models will be excluded from the response.

...

Boolean value specifying whether the metadata extraction is activated (true) or deactivated (false) for the specific object type.

Ignored if extraction.enabled is set to false.

...

Optional time limit in seconds overwriting extraction.timeout for the determination of extraction predictions for properties belonging to the object type specified by objectTypeId.

The result will be returned even if the calculation process is still running for some models. Those models will be excluded from the response.

...

The identification of a property as it will appear in prediction response bodies. You can define a concatenation of several secondary object type IDs, catalog values etc. that can be interpreted by your client application to show the prediction results in proper format.

...

Optional parameter: URL of an endpoint for further validation of the value determined for the property specified by propertyId.

Note: Not available in the beta version where the connection of an additional validation service needs more configuration steps.

...

Optional parameter: An integer value defining the maximum number of values included in the prediction response for the property propertyId.

If not specified, the default value 1 will be used.

In order to combine the AI platform with yuuvis® client as reference implementation, the following inference schema is required:

Code Block
languageyml
titleInference schema for the combination with yuuvis® Momentum CLIENT service
collapsetrue
{
    "tenant" : "os__papi",
    "appName" : "AIInvoiceClient",
    "classification" : {
        "enabled" : true,
        "timeout" : 10,
        "aiClassifierId" : "DOCUMENT_CLASSIFICATION",
		"objectTypes": [ 
			{ 
				"objectTypeId" : "appImpulse:hrdocsot|appImpulse:hrDocumentType|Bescheinigung",
				"aiObjectTypeId" : "appImpulse:contractsot"
			},
			{ 
				"objectTypeId" : "appImpulse:receiptsot",
				"aiObjectTypeId" : "appImpulse:hrdocsot"
			},
			{ 
				"objectTypeId" : "appImpulse:contractsot|appImpulse:contractType|Arbeitsvertrag",
				"aiObjectTypeId" : "appImpulse:receiptsot"
			},
			{ 
				"objectTypeId" : "appImpulse:hrdocsot|appImpulse:hrDocumentType|Arbeitsvertrag",
				"aiObjectTypeId" : "appImpulse:basedocumentsot"
			}
		]
    }
}

Requirements

The Auto ML Pipeline is a part of the Auto ML Platform and can run only in combination with the other included services.

ML Pipeline furthermore requires:

  • s3 or local storage

If you want to use the ML Pipeline for the AI integration in yuuvis® client as reference implementation, also the requirements of the CLIENT Service have to be considered.

Installation

The Auto ML Platform services including the ML Pipeline are not yet included in yuuvis® Momentum installations but only available on request.

Configuration

...

Model Evaluation

After the machine learning training is done, the model is evaluated. By examining training results, the user decides whether the model is suitable for use or needs longer training, larger data set, etc. 

Model Registry

Models that are suitable for further use are stored in the Model Registry component. From the Model Registry component, models can be dockerized and deployed to the serving infrastructure (typically, to the same Kubernetes cluster where yuuvis Momentum is running).

Info
iconfalse

Read on


Section


Column
width25%

Inference Schema

Insert excerpt
Inference Schema
Inference Schema
nopaneltrue
 Keep reading


Column
width25%

KAIROS-API Service

Insert excerpt
KAIROS-API Service
KAIROS-API Service
nopaneltrue
 Keep reading



Column
width25%

PREDICT-API Service

Insert excerpt
PREDICT-API Service
PREDICT-API Service
nopaneltrue
 Keep reading