The task of the controller service is to generate job messages for an asynchronous full-text indexing, to deliver the required binary content and to store the extracted text in Elasticsearch.

Table of Contents

Characteristics

port range: 7332-7335

service name: controller

profiles: prod,docker,es,oauth2,lc,mq,kubernetes

Function

Asynchronous Full-Text Indexing

Asynchronous Full Text Indexing

For compound documents, it makes sense to perform full-text analysis and indexing asynchronously to the import. For this purpose, the API-Gateway can generate a message during the import, containing the metadata of a certain number of single documents of the compound document (1.). The controller-service consumes these messages (2.) and generates another message (4.) for the textextractor-service, in which it writes a source and a target link (3.). A GET request to the sourceLink enables the textextractor-service to retrieve the content for the object belonging to the message (6. + 7.). After the text has been extracted, it can be saved/updated in Elasticsearch via a POST request to the targetLink (9.-12.).

Reading Messages

The queue from which the controller-service reads the messages is configured by the parameter textextraction.in-queue and has the default value lc.textextraction (lc - lifecycle) (2.).

application-lc.yml

textextraction.in-queue: lc.textextraction

These messages contain the metadata of a DmsApiObject for which the asynchronous full-text indexing is to be executed.

Creating Links

The Controller-Service then generates the links corresponding to the DmsApiObject contained in the message for the textextractor-service (3.). The aim is that the textextractor-service remains unaware of the rest of the system, i.e. the links contain all the information required to retrieve the content or save the extracted text in the form of query parameters. By default, the Controller-Service generates links which the textextractor-service must resolve at the Discovery-Service (how this works, can be read here). This provides a meaningful scaling of the Controller-Service, assuming that the textextractor-service is integrated into the services landscape.

Creating Messages

The Controller-Service generates job messages for the textextractor-service (4.). These messages contain two links and additional properties in a map.

job message

{
    "sourceLink": "http://controller/contents/file?objectId={objectID}&versionNr={versionNr}&tenant={tenant}",
    "targetLink": "http://controller/contents/renditions/text?objectId={objectID}&contentstreamId={contentstreamId}&contentstreamRange={contentstreamRange}",
    "properties": {
        "objectId": "{objectId}"       # optional
		"useDiscovery": "true" 
	}
}

The sourceLink retrieves the content associated with the DMS object and the targetLink stores the extracted text in Elasticsearch. Additional properties can be specified in a properties map. The property useDiscovery is required. It specifies whether the textextractor-service must resolve its links to the discovery-service before it can call them or not. By default, the value of this property is true. The parameter textextraction.job-queue is used to configure the queue name to which the controller-service writes the job messages. By default, the value is lc.textextraction.job.

application-lc.yml

textextraction.job-queue: lc.textextraction.job

Calling the `sourceLink`

The content of a DMS object can be retrieved using a GET request to the sourceLink (6.). The controller-service receives the object ID, version number and tenant via the query parameters of the sourceLink and can use this information to retrieve the content of the object from the API gateway and return it to the caller (7.).

Calling the `targetLink`

The extracted text of a Dms object can be saved using a POST request to the targetLink (9.). To do this, the text must be contained in the body of the request. From the query parameters of the targetLink, the controller-service receives the object ID, content stream ID, and content stream range of the corresponding DMS object. To ensure that the content of the object has not changed in the time between the creation of the job message and the current point in time, the Controller-Service retrieves the current metadata for the object ID from Elasticsearch (10.) and compares the content stream ID and content stream range from the targetLink with those from the current metadata (11.). If at least one of the two properties does not match, the Controller-Service terminates the update process and returns http status 409 CONFLICT.

If the object ID contained in the targetLink cannot be found in Elasticsearch, the Controller-Service terminates the update process and returns http status 404 NOT FOUND.

If the comparison of the content stream ID and the content stream range shows that the content has not changed in the meantime, the text sent in the body will be written in Elasticsearch in the field contentfile of the object with the corresponding object ID (12.).

Processing Error/ Success Messages

The textextractor-service writes a success or error message for each executed full-text extraction. These are read by the controller-service (14.) which logs the contained reports.

The queue names have the following format:

error queue: "<textextraction.job-queue>.error"
success queue: "<textextraction.job-queue>.success"

Configuration

There are two modes for creating the links: the textextractor-service must resolve the links at the discovery-service against a controller-service instance before calling them or not. This can be configured by using the parameter controller.links.useDiscovery, whose default value is true. If you want to change the default behavior, you can create a configuration called controller-prod.yml that must contain the following parameters:

controller-prod.yml

# controller configuration
controller:
  links:
    # specifies <ip>:<port> to use for link creation, if useDiscovery=false
    host: <ip>:<port>
    # indicates whether the links should be resolved on discovery or not
    useDiscovery: false

If the controller.links.useDiscovery parameter is set to false, the controller.links.host parameter must be set, because the controller-service uses this property to create the links. If the textextractor-service now calls one of the links, it explicitly addresses the controller-service instance configured by controller.links.host instead of being given one by the discovery-service. In this case, it is very important that the Controller-Service instance configured by controller.links.host can be reached in the system for the entire duration of the asynchronous full-text indexing so that the textextractor-service can successfully process its jobs.
To make the above configuration work for the controller-service, it must also be started with the profile prod if this is not already the case (if necessary, adjust the entry in the servicewatcher-sw.yml, place the configuration file in the /config directory, and refresh respectively restart ARGUS and CONTROLLER).

Profiles

Profile	Meaning
cloud	central configuration for all cloud services
es	contains configuration parameters for connecting to the Elasticsearch cluster.
oauth2	contains configuration parameters of the tenants of the configured authentication provider of the system.
lc	lifecycle configuration that contains the queue names for the asynchronous text extraction.
mq	messaging configuration, for the connection to the messaging system
prod	productive configuration, properties from `application-prod.yml` and `controller-prod.yml` are considered