Work with Semi-Structured Document Extraction


The Semi-structured extraction feature is similar to the Invoice Extraction feature. Both use a set of default fields.

Customization means that you have the ability to utilize additional, non-default fields to the extraction. You can also retrain default fields.


Open Documents

Open AI from the navigation menu and select Documents.

Document Overview Page

View the saved document extractions.

This includes Invoices, Fixed Forms, and Semi-structured models.

The documents overview page displays all of the saved document extractions.

You can view the:

  • Document extraction names - Name given to identify the document model.
  • Number of versions - The number of different versions for the document extraction.
  • Created - The date the document extraction was created.
  • Settings:
    • Edit -  Open the extraction in Document Studio to edit the extraction.
    • Clone - Copy the document extraction.
    • Delete -  Delete the document extraction.
    • Configuration - Edit the name of the extraction.

Create a Semi-structured Extraction Model

To create a new Semi-structured extraction model:

  1. After opening Documents in Hero Platform_, click Create Document Model.

  2. Enter a name for the Semi-structured extraction model and click Next.

  3. Select Semi-structured and click Next.

  4. Select your training data Input which has been saved in the Data Store during the data preparation stage of the process.

    A RobinSkill Input is required to access data in the Data Store:

    Click Save and Preview to enter the Document Studio.

    Languages supported

    Text typeLanguageNote
    Typed TextMultiple languagesSupported


    Documents in other languages can be trained. The model’s accuracy depends on the number of labeled documents.

    Automation Hero recommends labeling 1-2k documents or use the Context aware training feature to improve results with a smaller labeled document set. (20-50)

  5. The Document Studio displays a training document with located pre-trained fields highlighted. 
    The field names, values, and confidence score are located under the Available Fields tab.
    Adjust the UI size of the preview document or tab information by clicking and dragging the center line.
    Documents with multiple pages can be viewed by scrolling. View additional documents added to the extraction by clicking the arrows on either side of Document Studio.

    Only the first five input documents are displayed in the Document Studio. When training the extraction, all submitted documents are used for training.

  6. Click Save and Train in the toolbar after reviewing the sample results.
    • Training is available when new data or a change has been made to a previous version.
    • The specific versions can be selected by the drop-down menu at the top of the screen.

    • When saving, choose between saving to the current version or creating a new version.

    Click Start training to begin training and then add the Semi-structured extraction to Hero Platform_.

Information Tabs in a Semi-structured Extraction Model

Available Fields

Custom Fields and Standard Fields

Displays the standard and  custom fields

Fields can be filtered from the header bar or search for specific fields using the search bar.


  • All fields - All default fields are displayed with or without a value.
  • Only extracted fields - Displays fields only where a value was found.

Sample Documents

Displays the first five documents from the training data Input. 

Use the arrow icons on the side of the Document Studio to preview the model's performance.

After training has been completed, the custom fields are marked on the sample documents. 

Remove sample documents by clicking the trash icon.


Displays the start and end time for the training of the model.

Click Download files to download a zip file containing two log files: training.log and metrics.json

  • The metric.json file contains the text displayed in the metric charts below.


The metric charts show each trained fields information:

  • Precision - How often is the extractor correct when values are identified.
  • Recall - How many of the known elements does the extractor identify.
  • F1-score - A measure of overall performance that combines both precision and recall calculations.

In general, higher values are better.


Make this a context aware document extractor


Automation Hero has created features called "Context Awareness". These features let you connect and use data from your existing data sources. This data can help our AI make better decisions about the information it detects. 

These features can be used to enhance the training speed and accuracy of document extraction models.

The context aware feature for the Semi-structured extractor is a helping hand that allows you to reduce the amount of labeled sample documents needed to produce highly accurate trained models by supplying a source of sample values.

Automation Hero recommends this feature when possible to save time and increase accuracy.

  • Quick data preparation process from a reduced number of labeled documents. 
  • Fewer possibilities of human error in the data labeling process.
  • Improves accuracy with a deeper learning of expected field values.

This training method results in the creation of a data extraction model that’s both faster and easier.

  • An input containing 1000+ sample values for each field.
    • Each custom labeled field needs sample values for training the model. 
  • A minimum of 15 labeled sample documents.
    • Automation Hero recommends increasing the amount of accurately labeled sample documents to help the AI have a better understanding of what it is looking for.
  • As the context aware feature helps create a model based on the source values as well as the supplied labeled sample documents, it is important for the labeled sample documents to be accurate so as to produce accurate results.

Select Yes to enable this feature. This feature is disabled by default.

Click Save and Train in the Document Studio.

Before training begins, a pop-up box is displayed with the model's custom fields.

Next to each custom field, select the Input and corresponding field name for each custom field in the model. Each source field should contain 1000+ sample values.

When complete, click Start training to begin training the Semi-structured extraction model.


Select a language used for values in typed text fields that are created for the model.

List of supported languages.

Semi-structured Extraction Model Training Status

Status iconStatusDefinition

ReadyTraining is complete.

Needs trainingA model version has been created but has not yet been trained.

TrainingThe model is in the process of being trained.

ErrorThe training of the model has crashed and was not completed.

Use a Semi-structured Extraction Model in a Flow

After a Semi-structured extraction model has been saved and trained, it can be used as a function in a Flow.

To use a Semi-structured extraction model in a Flow:

  1. Open and start creating a Flow in the Flow Studio.
  2. View the Document functions in the element browser.
  3. Click and drag the Semi-structured extraction model from the element browser onto the Flow Studio canvas.
  4. Connect the Semi-structured extraction model using a cable from an element in the Flow.
  5. Confirm or select a version of the Semi-structured extraction model.
  6. Add Input documents.
  7. Select the coordinates type:

    Relative coordinates is the recommended option. Relative coordinates are more stable and can adjust for document scaling while absolute coordinates may require adjustments for document scaling changes. Support for absolute coordinates will be removed in a future release.

    • Absolute Coordinates - returns (output field) the position of a value box on a document by pixel location on a document.
      • Metadata (Tuple) containing x, y, w, h (Long) values
    • Relative Coordinates - returns (output field) the position of a value box on a document by percentage space on that document.

      • Page_bounding_box (Tuple) containing boundingBox (Tuple) containing left, top, width, height (Double) values.
  8. Configure/review the fields for the Fixed Form model's containerized function deployment.

    1. Capture logs - Select if the containerized function should capture logs.
    2. RAM - Adjust the sliding bar for memory (RAM) allocation for the function.
    3. vCPU - Adjust the sliding bar for CPU consumption. (by cores)
    4. Attempt timeout(s) - Enter the timeout setting (in seconds).
    5. Initial Delay - Enter the initial delay value in seconds for amount of time to between when container starts and when the Flow begins to use it.
    6. Retry attempts - Enter the max retry attempts before failing.

    Automation Hero recommends leaving the containerized function settings at the default levels unless problems arise. 

    An example of when raising the default settings may be beneficial is when the the documents being processed are very large.

  9. Configure the fields for the Semi-structured extraction model.
  10. Click OK to finish adding the Semi-structured extraction model to the Flow.