PDF Extract Text

Description

Extracts embedded text from a .pdf document. A new column is created for the results.

The PDF text can be parsed into a single string or into a list of strings by page. 

Requires the user to specify the PDF as binary data.

The maximum characters that can be extracted from a PDF file is 1,000,000.

  • This function extracts embedded text from PDF files. It does not use optical character recognition (OCR) to locate text on PDFs.
    • There are different types of PDFs. Scanned PDFs, for example, are usually an image and contain no embedded text. While desktop applications generating PDFs usually have a text layer. 
  • This function has memory consumption in proportion of the size of the input file. Because of this, the Flow execution can require more memory than expected for input PDF files that are larger than 30MB.

After the string has been extracted, it can be used in a Flow as any other string. Examples include:

Use

  • PDF:
    • Select an argument. (Binary) 
      • The function outputs an error as a result of <null> values.
  • Parse by pages:
    • If unchecked, the embedded text of all pages in a .pdf document is parsed into a record as a single string.
    • If checked, the embedded text of all pages in a .pdf document are parsed into a single record as a list. Each element in the list is a string of the embedded text per page.
      • The text per page can be separated using a function like Flatten List.
  • Password:
    • This is an optional parameter, needed if the PDF from the input is password protected.
    • The left/right arrow icon: Select a dynamic password provided by a field in the drop-down menu or enter a constant password. 
      • The function fails if the dynamic password data is a <null> value.
  • Output field name:
    • Enter an output field name.

Type

Formulas