PDF Extract Text

Description

Extracts embedded text from a .pdf document. A new column is created for the results.

The PDF text can be parsed into a single string or into a list of strings by page. 

Requires the user to specify the PDF as binary data.

The maximum characters that can be extracted from a PDF file is 1,000,000.

  • This function extracts embedded text from PDF files. It does not use optical character recognition (OCR) to locate text on PDFs.
  • This function has a memory consumption in proportion of the size of the input file. Because of this, the Flow execution can require more memory than expected for input PDF files that are larger than 30MB.

After the string has been extracted, it can be used in a Flow as any other string. Examples include:

Use

  • Select an argument. (Binary) 
    • The function fails if the binary content is a <null> value.
  • Parse by pages:
    • If unchecked, the embedded text of all pages in a .pdf document is parsed into a record as a single string.
    • If checked, the embedded text of all pages in a .pdf document are parsed into a single record as a list. Each element in the list is a string of the embedded text per page.
      • The text per page can be separated using a function like Flatten List.
  • Select or enter a password if the PDF from the input is password protected.
    • The function fails if the password data is a <null> value
  • Enter an output field name.
  • Click OK.

Type

Formulas