User Preparation of Training Data for Document Extraction

This page covers when a Semi-structured extraction model is needed and the requirements for preparing the training data used to create your custom model. 


Question: Why do you need to prepare your data before being able to create a Semi-structured extraction model in Hero Platform_?

In order for machine learning to work, the system needs to be told what data it should be looking for. Preparing, or labeling data, is the step required before the machine learning can take place.

This page shows you how to take your raw data (the invoices) and attach a label to the important parts (custom fields), so when the machine learning happens, the machine can understand what it should be looking for.

When Training Data for the Document Extractor is Required

The Document Extractor has a set of default fields that are configured in a model provided by Automation Hero. These fields cover the most common information present on invoices and should be enough for many simple invoice processing use cases. You will only need to build a custom invoice document extractor if:

  • Custom fields are required that are not part of the the default model.
  • You want control over retraining and improving the accuracy of one of the default fields.
  • The documents are semi-structured but not an invoice document.
    • Non-invoice documents will most likely start with a lower accuracy until they have been trained and retrained with a large volume of documents. 

Data Preparation Method

Follow the standard data preparation method unless you you have additional input data to create a context aware model.

Standard Data Preparation

The standard method of data preparation is to label as many sample documents as possible by drawing boxes around field values. 

Best results are when the bounding boxes are drawn tightly around the text values.

The model is then trained by reading the field values to learn the type of value to be expected. 

Context Aware Data Preparation

The context aware method of data preparation differs as it requires an Input data source containing a list of possible field values.

The same labeling process is used in the context aware method as the standard method, but fewer labeled sample documents are required.

Best results are when the bounding boxes are drawn larger around the printed values. The box drawn on the sample values should try and include the entire area where the values may be found on the input documents. This method helps the model recognize variations like longer names, multi-line values, etc.

Requirements for Training Data

Before creating a new model in the Document Extractor, users must first prepare training data.

  • PDF files are supported.
    • Multi-page PDFs are supported. 
  • Users must create names for custom fields.
  • Add coordinate boxes for any instances of custom fields on each of the provided documents.

Standard Requirements

  • Training data must contain 20-50+ documents.
    • More training data usually equates to a model that produces more accurate results.
    • To improve the default fields, a significantly larger number of labeled documents are recommended.
    • For best results, the training data should include 20-50 instances of each custom field.
      • If all fields are on each document, then 20-50 total documents should be sufficient.
      • If some documents only have 1/2 of the custom fields, users should increase the total number until there are enough instances of each field.

Training to extract values with too few examples can lead to the training model failing.

In such a case INSUFFICIENT_DATA or INSUFFICIENT_FIELD_DATA is the error message reply in the logs.

INSUFFICIENT_DATA is the reply when there are two few documents to adequately train the model. If this error is given, add additional sample training documents.

INSUFFICIENT_FIELD_DATA is the reply when a specific field(s) have two few examples in the sample training documents. If this error is given, add additional training documents that contain the field causing the error and/or use the context aware training feature.

  • This error is given early in the training process as to not lose a significant amount time.  
  • Download the training metrics from the settings tab to see which field caused the error.

Context Aware Requirements

  • Training data must contain 15 documents.
    • Additional accurately labeled training data usually equates to a model that produces more accurate results.
  • An Input data source containing 1000+ sample values for each custom field.

Languages supported


Documents in other languages can be trained. The model’s accuracy depends on the number of labeled documents.

Automation Hero recommends labeling 1-2k documents or use the Context aware training feature to improve results with a smaller labeled document set. (20-50)

Best Practices

After following the requirements above, the following are recommend to achieve the highest accuracy in your document extraction model. 

Documents used for training:

  • Use a variety of different layouts.
  • Use a variety of different field values.
    • E.g., Different suppliers or different accounts.
  • The custom fields should be present on as many of the documents as possible.
  • The larger the amount of documents used for training data, the better the training results.
    • Note: Training on a large number of very similar documents will produce less accurate results. Try to use documents in your training data that represent as many of the layouts in from your production documents as possible to ensure best results.

Prepare Training Data Summary

This section covers how to prepare training data and then select it as an input from the Semi-structured extraction model creation dialog.

Process of using a Flow and Human in the Loop to output/skill to label PDFs and Images

  • Produce data in Hero Platform_ through an Input.
  • Configure a Human in the Loop Output with custom field names and coordinate boxes for those fields.
  • Run the Flow to process data into the Human in the Loop Output.
  • In Human in the Loop, populate the coordinate value box for each field by drawing boxes on the document. When complete, submit the skills.
  • After the processing from Human in the Loop has been completed, the data is ready for training.

Prepare Training Data Guide

This guide covers how to prepare your data for use in creating a Semi-structured extraction model.

Create a Connection and Input to Your Preparation Data Invoices

Create a Connection to the system where your data is located

 Click here to expand...

The first step in preparing your data is to create a Connection to the system where your data is located.

In this example, a Connection is created to an S3 Bucket where the sample invoices are located.

  1. From the navigation menu, select Integration → Connections.
  2. Click Create New Connection
  3. Configure the Connection to the system where your invoices are stored.
    In this example, the sample invoice is stored in an S3 Bucket.
  4. Click OK to save the Connection.

Create an Input to bring your invoices to Hero Platform_

Create an Input in order to bring your data preparation invoices into Hero Platform_.

  1. From the navigation menu, select Integration → Inputs.
  2. Click Create New Input
  3. Configure the Input to access the invoices you want to use to train the Semi-structured extraction model.
    In this example, only a sample invoice is selected. Automation Hero recommends using 20-50+ invoices to train the Semi-structured extraction model.
    Learn more about enter a file extension.
  4. Click OK to save the Input.

Create a Connection and Output for Robin

The purpose of bringing your data preparation invoices into Hero Platform_ is to label the important parts of the data so that machine learning can build a model from it.

This part of the process sends your invoices to Robin, Automation Hero's "Human in the Loop" application where humans can label the custom fields of your data preparation invoices. 

Only custom fields that are not part of the standard list of invoice fields need to be labeled unless you want to take over retraining those specific fields. 

If taking over a default field, use the exact name of the field in your Robin Output. Field names are not case sensitive but any blank spaces must be included.

Create a Robin Connection

 Click here to expand...

Before being able to send data into Robin for actual humans to review and edit, a Connection must first be made.

  1. From the navigation menu, select Integration → Connections.
  2. Click Create New Connection
  3. Select RobinSkill as the Connection type. There is no other configuration necessary.
  4. Click OK to save the Connection.

Create a Robin Output

 Click here to expand...

A Robin Output is necessary to send your preparation invoices to the Robin application. 

  1. From the navigation menu, select Integration → Outputs.
  2. Click Create New Output
  3. Configure the Robin Output to access the invoices you want to use to train the Semi-structured extraction model.
    In the section Fields mapping table, enter all of the custom fields you want to label.
    Also, include an output field for the binary (the invoice document).

    This example is creating a custom field for the a contact email on a document.
  4. Follow the instructions for configuring a Robin Output.
  5. In the Field schema editor:
    Select Document Review for both fields marking locations and binary invoice images.
  6. Continue configuring the Robin Output and save it in Hero Platform_.

Building a Flow to Move Invoices Into Robin for Human Labeling

 Click here to expand...

 After you have completed both creating a Connection and Input to your data preparation invoices and creating a Connection and Output to Robin, you can build a Flow.

This Flow brings your preparation invoices into Robin so that humans are able to label the custom fields.

  1. From the navigation menu, select Automations → Flows.
  2. Click Create New Flow
  3. Enter a name for the Flow.
  4. In the Flow Studio, locate your Input with the invoice documents from the element browser and drag it onto the canvas.
  5. Locate the Multiple Constants function and drag it onto the canvas.
  6. Connect the Input element to the Multiple Constants element.
  7. Create null value PageBoundingBox fields, one for each custom field from the invoice.
  8. Locate your Robin Output from the element browser and drag it onto the canvas.
  9. Connect the Multiple Constants function to the Robin Output.
  10. Configure the Robin Output.
  11. Click Save in the Flow Studio toolbar to save the Flow.
  12. Click Run Now to run the Flow.

Label Custom Fields in Robin

Open the Robin application and Robin Skill for Labeling

 Click here to expand...

Your invoice documents have now been sent to Robin so humans can label the custom fields and their values.

To open Robin, enter your Hero Platform_ URL in your browser and add "/robin-app" on the end.


Enter your name and password for Hero Platform_ to log into Robin.

Open the Robin Skill.

Labeling custom fields in Robin

 Click here to expand...

It is now time to start preparing your data by marking field positions and entering values.

  1. Open a Robin Task.
    In this example, there is only one task.
  2. Click the invoice file name to open it within the Robin task details page.
  3. Draw a label box around the value on the invoice and select the corresponding Robin field from the menu.
    The best practice for drawing bounding boxes depends on which data preparation method was used.  
    • When using the standard data preparation method, it is recommended to draw the bounding box tightly around the value in the Robin task.
    • When using the context aware feature, it is recommended to draw a bounding box larger than the value in the sample document so that the model may have better accuracy in finding longer values. Try and draw a box to include the entire area where a value may be located in the input documents as well as the sample document.  

    In this example, the "Net" field value is "£3,479.99".
    Notice the blue box around the image text. While a field is selected, the box is the color blue. When that field is no longer selected, the box turns red. Selecting that field again turns the box blue to indicate the current coordinates.

    The location of the box is displayed in the corresponding Robin field.
    Continue to mark the location of all the custom fields.

    Right-click any drawn boxes to:

    • Select Edit to use the coordinates for this box in another field.
      • A case-sensitive filter is provided to search for specific fields.
    • Select Remove to delete the selected drawn box.

  4. After all fields are populated, Click Submit.
  5. Continue this process until all Robin Tasks have been submitted.

Here is a quick GIF of that process:


Congratulations! The data you will use to begin training your Semi-structured extraction model has now been prepared. That data is being stored in Hero Platform_'s Data Store.

To use data stored in the Data Store:

  1. Create a RobinSkill Connection.
  2. Create a RobinSkill Input to your information stored in the Data Store.

Now that you have prepared data, start training your Semi-structured extraction model