Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


...

hiddentrue
idDONE

...

Resources & Remarks

Version 2019 Winter

Modification History

...

Excerpt

This tutorial is about compound documents and how they are supported by the Core API.

...

bordertrue
Column

Table of contents

Table of Contents
exclude(More Tutorials|Table of contents|Login to the Core API|Retrieving Documents|Deleting Documents|Importing Documents via Core API)

Introduction

Compound documents are in principle concatenations of the binary coding of several individual documents, which can be translated back into the individual documents by means of the intervals (ranges) in which the binary content of individual documents can be found. In this tutorial, a short Java application will be created that will form a simple example compound document and demonstrate the import of these documents. 

Requirements

To work through this tutorial, the following is required:

  • Set-up yuuvis® API system (see Installation Guide)

  • Configured user with appropriate permissions ("clouduser:secret") on tenant "default"
  • Simple Maven project

Maven Configuration

Our Java client will submit its requests to the Core API using OkHttp 3.12 by Square, Inc. To build up the metadata of a compound document and evaluate the responses of the Core API, it also requires a JSON library, with org.json selected in this tutorial. To work with these libraries, the following block must be added to the Maven dependencies in the pom.xml of the project:

Code Block
titlepom.xml
linenumberstrue
<dependencies>
    <dependency>
        <groupId>com.squareup.okhttp3</groupId>
        <artifactId>okhttp</artifactId>
        <version>3.12.0</version>
    </dependency>
        <dependency>
        <groupId>org.json</groupId>
               <artifactId>json</artifactId>
        <version>20180813</version>
        </dependency>
</dependencies>

Client Configuration

The basis for accessing the Core API is an OkHttp3 client that can issue HTTP requests against reachable URLs. Additionally, we need to define some variables that our OkHttp3 client will use to reach and authenticate at the Core API.

Code Block
titleOkHttp3 Client and Variables
linenumberstrue
String baseUrl = "http://127.0.0.1"; //baseUrl of gateway: "http://<host>:<port>"
String username = "clouduser";
String userpassword = "secret";
String tenant = "default";
String auth = java.util.Base64.getEncoder().encodeToString((username + ":" + userpassword).getBytes());
  
OkHttpClient.Builder builder = new OkHttpClient.Builder();
OkHttpClient client = builder.build();

For more information on setting up the OkHttp3 client with cookie handling, see this tutorial on logging into the Core API.

Structure of a Compound Document

Compound documents, like all documents in the yuuvis® API system, consist of content and the associated metadata. The content of a compound document consists of the binary content of the documents contained in the compound document, which we will call subdocuments for the sake of simplicity. To ease the retrieval of the individual subdocuments, an additional set of metadata for each subdocument is imported, each with reference to the specific intervals (ranges) of byte indices denoting the location of the content of the subdocument in its respective compound document. In order to learn how to construct a compound document, we must therefore take a look at both the structure of the binary content and the structure of the metadata.

Creating the Content

To compose the content of the compound document, we first need the binary content of each subdocument. FileUtils.readFileToByteArray (File file) allows you to convert the contents of a file into a ByteArray (transforming it into binary code) that can then be written into our compound document file, or rather its FileOutputStream representation. During that process it's important to save the intervals (ranges) denoting the position of the written ByteArray(s) within the compound document file for each subdocuments' content. To do this, set an auxiliary variable offset to 0 at the beginning of the compound document creation process. For each subdocument added to the compound document, increase offset by the length of the subdocuments' content ByteArray, saving a tuple of the previous and new offset value as the range for the subdocument. That tuple will then be written into the ContentStream object of the subdocuments' metadata.

In the following Java code, the content of a compound document is assembled from three simple text files.

...

titleBuilding Content of a Compound Document
linenumberstrue

...


Page Properties
hiddentrue
idDONE

Product Version
Report Notepresentable
Assignee

Resources & Remarks

Version 2019 Winter

Modification History

NameDateProduct VersionAction
Antje08 FEB 20212.4New page properties macro.



Excerpt

Concatenate multiple binary content files as byte arrays in one compound document. Create sub-documents that refer to specified ranges within the total byte array.


Section
bordertrue


Column

Table of contents

Table of Contents
exclude(Read On|Table of contents|Login to the Core API|Retrieving Documents|Deleting Documents|Importing Documents via Core API)


Introduction

Compound documents in yuuvis® Momentum are document objects with a byte array as binary content file. In the byte array, it is possible to concatenate the binary coding of multiple individual files. Sub-documents can be defined such that they refer exactly to the ranges within the byte array where the individual original files are located. A retrieval request for the content file of such sub-documents then returns exactly the binary coding of the individual original files. But also any other range or any combination of ranges can be referenced as content of sub-documents. A retrieval request for the content file of such sub-documents then returns the concatenation of the specified ranges.

In this tutorial, we provide an example scenario for the handling of compound documents and sub-documents.

Requirements

To work through this tutorial, the following is required:

  • Set-up yuuvis® Momentumsystem (see Installation Guide)

  • Configured user with appropriate permissions (in the example: clouduser:secret on tenant default)
  • Simple Maven project

Maven Configuration

Our Java client will submit its requests to the Core API using OkHttp 3.12 by Square, Inc. To build the metadata of a compound document and evaluate the responses of the Core API, it also requires a JSON library, with org.json selected in this tutorial. To work with these libraries, the following block must be added to the Maven dependencies in the pom.xml of the project:

Code Block
titlepom.xml
linenumberstrue
<dependencies>
    <dependency>
        <groupId>com.squareup.okhttp3</groupId>
        <artifactId>okhttp</artifactId>
        <version>3.12.0</version>
    </dependency>
        <dependency>
        <groupId>org.json</groupId>
               <artifactId>json</artifactId>
        <version>20180813</version>
        </dependency>
</dependencies>

Client Configuration

The basis for accessing the Core API is an OkHttp3 client that can issue HTTP requests against reachable URLs. Additionally, we need to define some variables that our OkHttp3 client will use to reach and authenticate at the Core API.

Code Block
titleOkHttp3 Client and Variables
linenumberstrue
String baseUrl = "http://127.0.0.1"; //baseUrl of gateway: "http://<host>:<port>"
String username = "clouduser";
String userpassword = "secret";
String tenant = "default";
String auth = java.util.Base64.getEncoder().encodeToString((username + ":" + userpassword).getBytes());
  
CookieJar cookieJar = new JavaNetCookieJar(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
OkHttpClient client = new OkHttpClient.Builder().cookieJar(cookieJar).build();

For more information on setting up the OkHttp3 client with cookie handling, see this tutorial on logging into the Core API.

Binary Content of the Compound Document

The binary content of a compound document must be a byte array. All individual files you want to concatenate have to be converted into this format. In the example code block below, FileUtils.readFileToByteArray (File file) is used to convert the contents of a file into a ByteArray (transforming it into binary code). If you want to reference the content of the individual original files lateron, you need to know their length within the byte array to determine their ranges.

Code Block
languagejava
titleconvert individual example files
linenumberstrue
byte[] document1BA = FileUtils.readFileToByteArray(new File("./src/main/resources/test.txt"));
byte[] document2BA = FileUtils.readFileToByteArray(new File("./src/main/resources/test1.txt"));
byte[] document3BA = FileUtils.readFileToByteArray(new File("./src/main/resources/test2.txt"));

long //generatedocument1BAlength file for compound document content
File compoundFile= document1BA.length;
long document2BAlength = File.createTempFile("compound", ".bin");
OutputStream bos = new BufferedOutputStream(new FileOutputStream(compoundFile));
 
//partial document = Teildokument
String[] ranges = new String[3];                     //Byte ranges of the partial documents in the compound document
String[] partialNames = new String[3];        //Names of the partial documents
 
//write partial document bytestreams into binary compound file
long offset = 0;
long document1BAlength = document1BA.length;
String range1 = offset + "-" + (offset + document1BAlength - 1);
bos.write(document1BA);
ranges[0] = range1;
partialNames[0] = "test.txt";
 
offset += document1BAlength;
long document2BAlength = document2BA.length;
String range2document2BA.length;
long document3BAlength = document3BA.length;

The converted individual files can easily be concatenated to an output stream as shown in the next example code block. Pay attention to set the offset correctly in order to not overwrite parts of individual contents. You should store the byte ranges that correspond to the individual original files. In the example, they are stored in the ranges array. Additionally, the original file names of the individual files are stored in the partialNames array. Those values can be used to provide a more convenient recognition of the referenced ranges in the sub-documents' metadata.

Code Block
titleBuilding Content of a Compound Document
linenumberstrue
//generate file for compound document content
File compoundFile = File.createTempFile("compound", ".bin");
OutputStream bos = new BufferedOutputStream(new FileOutputStream(compoundFile));
 
//partial document = Teildokument
String[] ranges = new String[3];                     //Byte ranges of the partial documents in the compound document
String[] partialNames = new String[3];        //Names of the partial documents
 
//write partial document bytestreams into binary compound file
long offset = 0;
String range1 = offset + "-" + (offset + document2BAlengthdocument1BAlength - 1);
bos.write(document2BAdocument1BA);
ranges[10] = range2range1;
partialNames[10] = "test1test.txt";
 
offset += document2BAlengthdocument1BAlength;
longString document3BAlengthrange2 = document3BA.length;
String range3 = offset + "-" + (offset + document2BAlength - 1);
bos.write(document2BA);
ranges[1] = range2;
partialNames[1] = "test1.txt";
 
offset += document2BAlength;
String range3 = (offset) + "-" + (offset + document3BAlength - 1);
bos.write(document3BA);
ranges[2] = range3;
partialNames[2] = "test2.txt";
 
IOUtils.closeQuietly(bos);

...

Metadata

...

for the Object Creation

Creating a Compound Document

The import endpoint POST /api/dms/objects expects a multipart request body. Multiple objects can be created in yuuvis® Momentum with one request. Thus, it is possible to create the compound document with the concatenated byte array as content file and, in the same request, some sub-documents.

As shown in the example objects list below, the compound document is defined first. All following objects are sub-documents. They refer to the same cid like the compound document, but a range is additionally specified. Here, the ranges correspond to the individual original files and their original file names are noted in the metadata of the sub-documents.

Code Block
titleCompound Documents Document and Subdocuments
linenumberstrue
{
    "objects": [
        {
            "contentStreams": [
                {
                    "fileName": "./compound.bin",
                    "mimeType": "application/octet-stream",
                    "cid": "cid_63apple"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "testCompound"
                }
            }
        },
        {
            "contentStreams": [
                {
                    "fileName": "./compound.bin",
                    "range": "0-1244",
                    "mimeType": "text/plain",
                    "cid": "cid_63apple"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "test.txt"
                }
            }
        },
        {
            "contentStreams": [
                {
                    "fileName": "./compound.bin",
                    "range": "1245-25161338",
                    "mimeType": "text/plain",
                    "cid": "cid_63apple"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "test1.txt"
                }
            }
        },
     ]
}

If subdocuments are to be imported later, the contentStreams object in the subdocument’s metadata comprises the contentStreamId and repositoryId from the DMS API response of the import of the compound document, the "mimeType" attribute befitting that subdocument and the same "range" attribute as with the concurrent import.

Code Block
titleMetadata for Subsequent Import of Subdocuments
linenumberstrue
{   {
            "contentStreams": [
      "objects": [         {
            "contentStreams": [        "fileName": "compound.bin",
                 {   "range": "1339-2610",
                    "contentStreamIdmimeType": "8FF6DBAE-1969-11E9-83A4-DFA1C5E44BD0text/plain",
                    "repositoryIdcid": "repo242cid_63apple",
                }
   "range": "2517-3811",        ],
            "mimeTypeproperties": "text/plain"{
                "objectTypeId": }{
            ],        "value": "document"
   "properties": {            },
    "objectTypeId            "name": {
                    "value": "documenttest2.txt"
                },
            }
   "name": {    }
                "value": "test2.txt"]
}

Creating Sub-Documents

It is also possible to create sub-documents of an already existing compound document. Do not assign a binary content file a seconf time, but reference the contentStreamId, repositoryId and archivePath where the binary content file of the compound document is stored. The archivePath is especially required if reconstruction is not possible with metadata information (e.g., if a pathTemplate containing dynamic path elements like DATE is configured in the archive profile). As you can see in the example, you can also specify a concatenation of multiple ranges. 

Code Block
titleMetadata for Subsequent Import of Subdocuments
linenumberstrue
{
    "objects": [
        {
            "contentStreams": [
                {
                    "contentStreamId": "8FF6DBAE-1969-11E9-83A4-DFA1C5E44BD0",
                    "repositoryId": "repo242",
                    "archivePath": "default/2023/01/06/",
                    "range": "1159-1365,0-45"
                    "mimeType": "text/plain"
                }
            ],
            "properties": {
                "objectTypeId": {
                    "value": "document"
                },
                "name": {
                    "value": "sub-concatenation.txt"
                }
            }
        }
    ]
}

Importing Compound Documents

Importing a compound document with the DMS API works in the same way as regular imports via POST of a multipart body with metadata and content to the endpoint /api/dms/objects. In the example, the compoundImportJsonString contains the metadata for the compound document and sub-documents as shown before.

Code Block
languagejava
titleImporting a Compound Document
linenumberstrue
RequestBody compoundImportRequestBody = new MultipartBody.Builder()
        .setType(MultipartBody.FORM)
       } .addFormDataPart("data", "metaData.json", RequestBody.create(JSON, compoundImportJsonString))
        }
  .addFormDataPart("cid_63apple", "compound.bin", RequestBody.create(OCTETSTREAM, compoundFile))
     }    .build();
]
}

Importing Compound Documents

Importing a compound document with the DMS API works in the same way as regular imports via POST of a multipart body with metadata and content to the endpoint /api/dms/objects.

Code Block
languagejava
titleImporting a Compound Document
linenumberstrue
RequestBody compoundImportRequestBody 
Request compoundImportRequest = new MultipartBodyRequest.Builder()
        .setType(MultipartBody.FORMheader("Authorization", "Basic " + auth)
        .addFormDataPartheader("dataX-ID-TENANT-NAME", "metaData.json", RequestBody.create(JSON, compoundImportJsonString)default")
        .addFormDataPart("cid_63apple", "compound.bin", RequestBody.create(OCTETSTREAM, compoundFile))url(baseUrl + "objects")
        .post(compoundImportRequestBody)
        .build();
 
RequestResponse compoundImportRequestcompoundImportResponse = new Request.Builder()
        .header("Authorization", "Basic " + auth)
        .header("X-ID-TENANT-NAME", "default")
        .url(baseUrl + "objects")
        .post(compoundImportRequestBodyclient.newCall(compoundImportRequest).execute();

Importing Large Compound Documents

Importing very large compound documents together with the creation of many sub-documents (about 5000 or more) within one request can overload the Core API and cause some of the operations to fail. Therefore, it is recommended to stage such requests in several episodes: First, import the compound document itself as a single document and extract the contentStreamID and repositoryId from the response. From there, import the metadata of all sub-documents subsequently in a series of batch imports. Limit the amount of sub-documents per import to less than 5000 to avoid overloading. 

When subsequently importing sub-documents, please note that the metadata (postSubDocumentImportJsonString in the example) must be sent in a multipart body as well, even if this multipart body consists of only one part.

Code Block
languagejava
titleSubsequent Import of a Sub-document
linenumberstrue
RequestBody postSubDocumentImportRequestBody = new MultipartBody.Builder()
        .buildsetType();MultipartBody.FORM)
     Response compoundImportResponse = client.newCall(compoundImportRequest).execute();

Importing Large Compound Documents

Importing very large compound documents (5000+ subdocuments) can overload the Core API and cause some of the import operations to fail. Therefore, it is recommended to stage the import of such compound documents in several episodes: First, import the compound document itself as a single document and the contentStreamID and repositoryId are extracted from the response. From there, import the metadata of all subdocuments subsequently in a series of batch imports that limit the amount of subdocuments per import to less than 5000 to avoid overloading. 

When subsequently importing subdocuments, please note that the metadata must be sent in a multipart body when importing the subdocuments, even if this multipart body consists of only one part. 

Code Block
languagejava
titleSubsequent Import of a Subdocument
linenumberstrue
RequestBody postPartialDocumentImportRequestBody = new MultipartBody.Builder(addFormDataPart("data", "metadata.json", RequestBody.create(JSON, postSubDocumentImportJsonString))
        .build();

Request postSubDocumentImportRequest = new Request.Builder()
        .header("Authorization", auth)
        .header("X-ID-TENANT-NAME", tenant)
        .url(baseUrl + "objects")
        .setTypepost(MultipartBody.FORMpostSubDocumentImportRequestBody)
        .addFormDataPart("data", "metaData.json", RequestBody.create(JSON, postPartialDocumentImportJsonString))
        .build();
 
Request postPartialDocumentImportRequestbuild();

Response postSubDocumentImportResponse = client.newCall(postSubDocumentImportRequest).execute();

Retrieving Sub-Documents

If you request the content of a compound document, you will get the total byte array. However, in most use cases, you want to retrieve a specific section of the total byte array. This is easily possible by retrieving a sub-document with a suitable range value that references the desired section in the total byte array (creation described above). You can retrieve its content by objectId (of the dub-document) as usual and you will get the corresponding section of the compound document's content.

Code Block
languagejava
titleRetrieve the content file of the DMS object specified by 'objectId'
Request getContentOfSubDocumentRequest = new Request.Builder()
        .header("Authorization", "Basic " + auth)
        .header("X-ID-TENANT-NAME", "default"tenant)
        .url(baseUrl + "objects/") + objectId       .post(postPartialDocumentImportRequestBody+ "/contents/file")
        .get().build();


Response postPartialDocumentImportResponsegetContentOfSubDocumentResponse = client.newCall(postPartialDocumentImportRequestgetContentOfSubDocumentRequest).execute();

>> Retrieving Documents via Core API

Deleting Compound Documents

Compound documents consist of one content document and several subdocuments that reference the content document through a content stream and one or more content ranges. When deleting these documents, it is generally avoided to remove any underlying content that may underlie other subdocuments. The subdocuments can be deleted as required, but the content remains accessible (unlike normal documents). If you request the deletion of a compound document, its metadata is deleted from the database. Thus, the DMS object itself does not exist in the system anymore. However, if at least one sub-document references a range within the former compound document's binary content file, the entire original byte array remains in the binary storage. It is even possible to define new sub-documents with reference on the same repositoryId and contentStreamId. Only if you delete all sub-documents of an already deleted compound document, the binary content is deleted as well.

>> Deleting Documents via Core API

Summary

In this tutorial, we used an OkHttpClient with cookie handling to import to import and retrieve a compound document and several sub-documents through the Core API.

A complete code example can be found in this git repository.

Info
iconfalse
More Tutorials

Read On

Section


Column
width25%

Importing Documents via Core API

Insert excerpt
Importing Documents via Core API
Importing Documents via Core API
nopaneltrue
 Keep reading


Column
width25%

Retrieving Documents

Insert excerpt
Retrieving Documents via Core API
Retrieving Documents via Core API
nopaneltrue
 Keep reading


Column
width25%

Deleting Documents

Insert excerpt
Deleting Documents via Core API
Deleting Documents via Core API
nopaneltrue
 Keep reading



...