Compound Documents

This tutorial is about compound documents and how they are supported by the Core API.

Table of contents

Introduction

Compound documents are in principle concatenations of the binary coding of several individual documents, which can be translated back into the individual documents by means of the intervals (ranges) in which the binary content of individual documents can be found. In this tutorial, a short Java application will be created that will form a simple example compound document and demonstrate the import of these documents. 

Requirements

To work through this tutorial, the following is required:

  • Set-up yuuvis® API system (see minikube setup, for example)

  • Configured user with appropriate permissions ("clouduser:secret") on tenant "default"
  • Simple Maven project

Maven Configuration

Our Java client will submit its requests to the Core API using OkHttp 3.12 by Square, Inc. To build up the metadata of a compound document and evaluate the responses of the Core API, it also requires a JSON library, with org.json selected in this tutorial. To work with these libraries, the following block must be added to the Maven dependencies in the pom.xml of the project:

pom.xml
<dependencies>
    <dependency>
        <groupId>com.squareup.okhttp3</groupId>
        <artifactId>okhttp</artifactId>
        <version>3.12.0</version>
    </dependency>
        <dependency>
        <groupId>org.json</groupId>
               <artifactId>json</artifactId>
        <version>20180813</version>
        </dependency>
</dependencies>

Client Configuration

The basis for accessing the Core API is an OkHttp3 client that can issue HTTP requests against reachable URLs. Additionally, we need to define some variables that our OkHttp3 client will use to reach and authenticate at the Core API.

OkHttp3 Client and Variables
String baseUrl = "http://127.0.0.1"; //baseUrl of gateway: "http://<host>:<port>"
String username = "clouduser";
String userpassword = "secret";
String tenant = "default";
String auth = java.util.Base64.getEncoder().encodeToString((username + ":" + userpassword).getBytes());
  
OkHttpClient.Builder builder = new OkHttpClient.Builder();
OkHttpClient client = builder.build();

For more information on setting up the OkHttp3 client with cookie handling, see this tutorial on logging into the Core API.

Structure of a Compound Document

Compound documents, like all documents in the yuuvis® API system, consist of content and the associated metadata. The content of a compound document consists of the binary content of the documents contained in the compound document, which we will call subdocuments for the sake of simplicity. To ease the retrieval of the individual subdocuments, an additional set of metadata for each subdocument is imported, each with reference to the specific intervals (ranges) of byte indices denoting the location of the content of the subdocument in its respective compound document. In order to learn how to construct a compound document, we must therefore take a look at both the structure of the binary content and the structure of the metadata.

Creating the Content

To compose the content of the compound document, we first need the binary content of each subdocument. FileUtils.readFileToByteArray (File file) allows you to convert the contents of a file into a ByteArray (transforming it into binary code) that can then be written into our compound document file, or rather its FileOutputStream representation. During that process it's important to save the intervals (ranges) denoting the position of the written ByteArray(s) within the compound document file for each subdocuments' content. To do this, set an auxiliary variable offset to 0 at the beginning of the compound document creation process. For each subdocument added to the compound document, increase offset by the length of the subdocuments' content ByteArray, saving a tuple of the previous and new offset value as the range for the subdocument. That tuple will then be written into the ContentStream object of the subdocuments' metadata.

In the following Java code, the content of a compound document is assembled from three simple text files.

Building Content of a Compound Document
//read ByteArrays from text files
byte[] document1BA = FileUtils.readFileToByteArray(new File("./src/main/resources/test.txt"));
byte[] document2BA = FileUtils.readFileToByteArray(new File("./src/main/resources/test1.txt"));
byte[] document3BA = FileUtils.readFileToByteArray(new File("./src/main/resources/test2.txt"));
 
//generate file for compound document content
File compoundFile = File.createTempFile("compound", ".bin");
OutputStream bos = new BufferedOutputStream(new FileOutputStream(compoundFile));
 
//partial document = Teildokument
String[] ranges = new String[3];                     //Byte ranges of the partial documents in the compound document
String[] partialNames = new String[3];        //Names of the partial documents
 
//write partial document bytestreams into binary compound file
long offset = 0;
long document1BAlength = document1BA.length;
String range1 = offset + "-" + (offset + document1BAlength - 1);
bos.write(document1BA);
ranges[0] = range1;
partialNames[0] = "test.txt";
 
offset += document1BAlength;
long document2BAlength = document2BA.length;
String range2 = offset + "-" + (offset + document2BAlength - 1);
bos.write(document2BA);
ranges[1] = range2;
partialNames[1] = "test1.txt";
 
offset += document2BAlength;
long document3BAlength = document3BA.length;
String range3 = (offset) + "-" + (offset + document3BAlength - 1);
bos.write(document3BA);
ranges[2] = range3;
partialNames[2] = "test2.txt";
 
IOUtils.closeQuietly(bos);

Creating the Metadata

Once the creation of the compound documents' content is complete, the corresponding metadata for the compound document and its subdocuments has to be generated for the import. The metadata of a compound document itself is no different from the metadata of any regular document - it consists of properties and a contentStreams object pointing to the binary file. The subdocuments, if they are to be imported together with the compound document, copy the contentStreams object of the compound document and add an attribute "range" to it, in which they enter the byte-digit interval recorded during content creation. 

Compound Documents and Subdocuments
{
    "objects": [
        {
            "contentStreams": [
                {
                    "fileName": "./compound.bin",
                    "mimeType": "application/octet-stream",
                    "cid": "cid_63apple"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "testCompound"
                }
            }
        },
        {
            "contentStreams": [
                {
                    "fileName": "./compound.bin",
                    "range": "0-1244",
                    "mimeType": "text/plain",
                    "cid": "cid_63apple"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "test.txt"
                }
            }
        },
        {
            "contentStreams": [
                {
                    "fileName": "./compound.bin",
                    "range": "1245-2516",
                    "mimeType": "text/plain",
                    "cid": "cid_63apple"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "test1.txt"
                }
            }
        }
    ]
}

If subdocuments are to be imported later, the contentStreams object in the subdocument’s metadata comprises the contentStreamId and repositoryId from the DMS API response of the import of the compound document, the "mimeType" attribute befitting that subdocument and the same "range" attribute as with the concurrent import.

Metadata for Subsequent Import of Subdocuments
{
    "objects": [
        {
            "contentStreams": [
                {
                    "contentStreamId": "8FF6DBAE-1969-11E9-83A4-DFA1C5E44BD0",
                    "repositoryId": "repo242",
                    "range": "2517-3811",
                    "mimeType": "text/plain"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "test2.txt"
                }
            }
        }
    ]
}

Importing Compound Documents

Importing a compound document with the DMS API works in the same way as regular imports via POST of a multipart body with metadata and content to the endpoint /api/dms/objects.

Importing a Compound Document
RequestBody compoundImportRequestBody = new MultipartBody.Builder()
        .setType(MultipartBody.FORM)
        .addFormDataPart("data", "metaData.json", RequestBody.create(JSON, compoundImportJsonString))
        .addFormDataPart("cid_63apple", "compound.bin", RequestBody.create(OCTETSTREAM, compoundFile))
        .build();
 
Request compoundImportRequest = new Request.Builder()
        .header("Authorization", "Basic " + auth)
        .header("X-ID-TENANT-NAME", "default")
        .url(baseUrl + "objects")
        .post(compoundImportRequestBody)
        .build();
 
Response compoundImportResponse = client.newCall(compoundImportRequest).execute();

Importing Large Compound Documents

Importing very large compound documents (5000+ subdocuments) can overload the Core API and cause some of the import operations to fail. Therefore, it is recommended to stage the import of such compound documents in several episodes: First, import the compound document itself as a single document and the contentStreamID and repositoryId are extracted from the response. From there, import the metadata of all subdocuments subsequently in a series of batch imports that limit the amount of subdocuments per import to less than 5000 to avoid overloading. 

When subsequently importing subdocuments, please note that the metadata must be sent in a multipart body when importing the subdocuments, even if this multipart body consists of only one part. 

Subsequent Import of a Subdocument
RequestBody postPartialDocumentImportRequestBody = new MultipartBody.Builder()
        .setType(MultipartBody.FORM)
        .addFormDataPart("data", "metaData.json", RequestBody.create(JSON, postPartialDocumentImportJsonString))
        .build();
 
Request postPartialDocumentImportRequest = new Request.Builder()
        .header("Authorization", "Basic " + auth)
        .header("X-ID-TENANT-NAME", "default")
        .url(baseUrl + "objects")
        .post(postPartialDocumentImportRequestBody)
        .build();
 
Response postPartialDocumentImportResponse = client.newCall(postPartialDocumentImportRequest).execute();

Deleting Compound Documents

Compound documents consist of one content document and several subdocuments that reference the content document through a content stream and one or more content ranges. When deleting these documents, it is generally avoided to remove any underlying content that may underlie other subdocuments. The subdocuments can be deleted as required, but the content remains accessible (unlike normal documents). 

Summary

In this tutorial, we used an OkHttpClient with cookie handling to import documents through the Core API.

A complete code example can be found in this git repository.

More Tutorials

Importing Documents via Core API

This tutorial shows how documents can be imported into a yuuvis® API system via the Core API. During this tutorial, a short Java application will be developed that implements the HTTP requests for importing documents. We additionally provide a JavaScript version of this tutorial.  Keep reading

Retrieving Documents

In this tutorial, we will discuss various ways to retrieve objects via the Core API from the yuuvis® API system using an OkHttp3 Java client. Keep reading

Deleting Documents

This tutorial explains how documents can be deleted using the Core API with the help of a Java client. This tutorial requires basic knowledge of importing documents using the Core API. Keep reading