Page Properties | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||
Resources & Remarks Modification History The service of the Auto ML platform is responsable for the training of models as well as the determination of predictions based on those models
| |||||||||||||||||||
Excerpt | |||||||||||||||||||
|
Excerpt |
---|
Responsible for preparing data for training, training of machine learning models, evaluation of trained models, and preparing for the deployment to the production. |
Section | ||||||
---|---|---|---|---|---|---|
| ||||||
|
Characteristics
Note | |||
---|---|---|---|
| |||
The ML Pipeline is a component of the Auto ML Platform. This platform is not included in yuuvis® Momentum installations and is available as a beta version only on request. |
Function
The Machine Learning ( ML ) Training Pipeline is the heart part of the Auto ML Platform and as such Artificial Intelligence Platform responsible for data ingestion, data validation, transformation, machine learning training, and model evaluation, and model serving. The pipeline is based on MLFlow, but other providers, such as Google TFX, Kubeflow, etc. can be used alternatively.
Requests for the ML Pipeline are managed via separate services that provide an own API:
- The PREDICT-API service provides prediction and status endpoints that can be called by client applications.
>> PREDICT-API Endpoints
The management of the ML Pipeline is done via the command line application Kairos CLI.
Machine Learning Training
The ML Pipeline needs to be trained by means of reference objects stored in MLflow – an open-source platform for managing ML lifecycles.
Data Export
The source of data for machine learning is a document management system, e e.g., yuuvis® Momentum, and for which users manually defined the individual object type. The data exported from yuuvis® Momentum is stored in the format suitable for data ingestion on local storage or S3. This data is used to train the models for the determination of predictions.. The data shall be exported in a predefined format and shall be made available to the provided training pipelines.
Machine Learning Pipelines
The machine learning pipelines are components developed and shipped by OPTIMAL SYSTEMS GmbH. They contain all necessary procedures and algorithms to train machine learning models for different purposes (e.g., document classification and metadata extraction).
At the moment, pipelines can be used for document classification (for instance it can determine whether a document is an invoice, a contract, a sick-leave or something else) and for metadata extraction (for instance, extract the issuing date, total amount and invoice number from an invoice).
Document Classification
In the context of the AI platform, classification means the determination of suitable typification classes fitting for an object based on its full-text rendition. For one object, one prediction is provided that contains mappings of classes and their corresponding relevance probability
as well as a reference on the object in yuuvis® Momentum via objectId
.
Instead of the class names used internally in the ML Pipeline, the prediction response bodies provide the object types as referenced in the inference schema described below Inference Schema.
Metadata Extraction
ML Pipeline can analyze the PDF rendition of binary content files assigned to objects in yuuvis® Momentum in order to extract specific metadata. Based on the trained models, predictions for values of specific object properties can be determined. The object properties have to be listed in the inference schema where also the Inference Schema where conditions for the values and settings for the prediction responses are also specified. Anchor
Inference Schema
The inference schema is a JSON configuration file defining the object types that will be available for the classification as well as the properties for which the metadata extraction should determine suitable values. At the same time, each internal aiObjectTypeId
(aiPropertyId
) gets a corresponding objectTypeId
(propertyId
) that will be used in the response bodies of the classification (extraction) endpoints to be compatible with the connected client application.
The inference schema is defined for a specific tenant
. It is also possible to further limit the usage of the inference schema to an app by specifying appName
(e.g., to distinct between a client app for single uploads and batch processing apps).
Code Block | ||||
---|---|---|---|---|
| ||||
{
"tenant" : "mytenant",
"appName" : "AIInvoiceClient",
"classification" : {
"enabled" : true,
"timeout" : 2,
"aiClassifierId" : "DOCUMENT_CLASSIFICATION",
"objectTypes": [
{
"objectTypeId" : "appImpulse:receiptsot|appImpulse:receiptType|Rechnung",
"aiObjectTypeId" : "INVOICE"
},
{
"objectTypeId" : "appImpulse:receiptsot|appImpulse:receiptType|Angebot",
"aiObjectTypeId" : "DOCUMENT_TYPE_2"
},
{
"objectTypeId" : "appImpulse:hrsot|appImpulse:receiptType|Bewerbung",
"aiObjectTypeId" : "DOCUMENT_TYPE_3"
}
]
},
"extraction" : {
"enabled" : true,
"timeout" : 5,
"objects" : [
{
"objectTypeId" : "invoice",
"enabled" : true,
"timeout" : 10,
"propertyReference" : [
{
"propertyId" : "companyName",
"aiPropertyId" : "INVOICE_COMPANY_NAME",
"allowedValues" : ["Company1", "Company2", "Company3"],
"pattern" : "/^[a-z]|\\d?[a-zA-Z0-9]?[a-zA-Z0-9\\s&@.]+$",
"validationService" : "my_company_name_validation_service",
"maxNumberOfPredictions" : 5
},
{
"propertyId" : "totalAmount",
"aiPropertyId" : "INVOICE_TOTAL_AMOUNT",
"pattern" : "^[0-9]*[.][0-9]*$",
"validationService" : "my_amounts_validation_service",
"maxNumberOfPredictions" : 1
}
]
}
]
}
} |
The following parameters are available in the inference schema:
...
Optional parameter: name of the app that uses the inference schema. Other apps within the tenant cannot use this inference schema but only their own app-specific inference schema or the tenant-wide inference schema.
...
Time limit for the determination of a classification predictions in seconds.
An error will be thrown if the calculation process could not be finished before the timeout
was reached.
...
A list of mappings, each of them containing the following keys. This list defines the object types that are available for the classification prediction.
...
Time limit for the determination of extraction predictions in seconds.
The result will be returned even if the calculation process is still running for some models. Those models will be excluded from the response.
...
Boolean value specifying whether the metadata extraction is activated (true
) or deactivated (false
) for the specific object type.
Ignored if extraction.enabled
is set to false
.
...
Optional time limit in seconds overwriting extraction.timeout
for the determination of extraction predictions for properties belonging to the object type specified by objectTypeId
.
The result will be returned even if the calculation process is still running for some models. Those models will be excluded from the response.
...
The identification of a property as it will appear in prediction response bodies. You can define a concatenation of several secondary object type IDs, catalog values etc. that can be interpreted by your client application to show the prediction results in proper format.
...
Optional parameter: URL of an endpoint for further validation of the value determined for the property specified by propertyId
.
Note: Not available in the beta version where the connection of an additional validation service needs more configuration steps.
...
Optional parameter: An integer value defining the maximum number of values included in the prediction response for the property propertyId
.
If not specified, the default value 1
will be used.
In order to combine the AI platform with yuuvis® client as reference implementation, the following inference schema is required:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{
"tenant" : "os__papi",
"appName" : "AIInvoiceClient",
"classification" : {
"enabled" : true,
"timeout" : 10,
"aiClassifierId" : "DOCUMENT_CLASSIFICATION",
"objectTypes": [
{
"objectTypeId" : "appImpulse:hrdocsot|appImpulse:hrDocumentType|Bescheinigung",
"aiObjectTypeId" : "appImpulse:contractsot"
},
{
"objectTypeId" : "appImpulse:receiptsot",
"aiObjectTypeId" : "appImpulse:hrdocsot"
},
{
"objectTypeId" : "appImpulse:contractsot|appImpulse:contractType|Arbeitsvertrag",
"aiObjectTypeId" : "appImpulse:receiptsot"
},
{
"objectTypeId" : "appImpulse:hrdocsot|appImpulse:hrDocumentType|Arbeitsvertrag",
"aiObjectTypeId" : "appImpulse:basedocumentsot"
}
]
}
} |
Requirements
The Auto ML Pipeline is a part of the Auto ML Platform and can run only in combination with the other included services.
ML Pipeline furthermore requires:
- s3 or local storage
If you want to use the ML Pipeline for the AI integration in yuuvis® client as reference implementation, also the requirements of the CLIENT Service have to be considered.
Installation
The Auto ML Platform services including the ML Pipeline are not yet included in yuuvis® Momentum installations but only available on request.
Configuration
...
Model Evaluation
After the machine learning training is done, the model is evaluated. By examining training results, the user decides whether the model is suitable for use or needs longer training, larger data set, etc.
Model Registry
Models that are suitable for further use are stored in the Model Registry component. From the Model Registry component, models can be dockerized and deployed to the serving infrastructure (typically, to the same Kubernetes cluster where yuuvis Momentum is running).
Info | |||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||
Read on
|