Repartition

Description

This function creates partitions from input data.

Some input data (e.g., a large CSV file) does not allow for partitioning and cannot be set to run distributed. The Repartition function can overcome this obstacle. When data passes through the Repartition function, it is stored on the local hard disk and added into multiple partitions depending on the parallelism configured in the Flow.

This function can also act as a barrier or stage break so that all the data must arrive at this function element before the Flow proceeds. Each stage of the Flow only requires the resources that belong to the specific Flow elements within that stage. This means that the function has the ability to restructure Flows into smaller pieces consuming fewer resources.

This function is for advanced users familiar with execution time and resource usage optimization. Please contact support for additional information and help with this function

Use

This function is located under the aggregation section of the element browser.

By default, there are no fields to configure, all input fields coming from the prior element in the Flow are configured for partitioning.

Remove fields from being partitioned by right-clicking the Reparition element in the Flow and selecting Edit Outputs. Uncheck the boxes of any fields that should not go through the partitioning process.

Examples

The following are use case examples using the Repartition function.

Use case 1: Leverage Maximum Parallelism 

In this example, you want an function that repartitions the data to leverage the configured maximum parallelism.

You have a data source that cannot provide multiple partitions. (e.g., A .csv file with 5K values.)

You have set the Flow configuration settings have the max parallelism set to "4" and then you run the Flow.

The automated Flow only ran on a single partition as the .csv could not be set to split.

Now, you add the Repartition function to the Flow after the Input element and run the Flow again.

Hero Platform_ now distributes the run into 4 partitions. 

Use case 2: Optimal Use of Memory

In this example, your Flow uses two Docker functions

One function has a 7 GB memory requirement. The other function has a 5 GB memory requirement.

Normally, without the ability to partition, this Flow would require 12 GB of memory each time it runs.

Adding the Repartition function between the Docker functions in the Flow lets the execution split into two stages.

During the first stage, the Flow requires 7 GB of memory. When the Flow reaches the Repartition function, the first Docker container stops before the second one begins processing.

The Repartition function has helped improve the total memory consumption process of the Flow.

Use case 3: Keep Connections From Timing Out

In this example, a Input requires an active Connection to another server (E.g., a database, FTP server, etc…)

Some of the functions tend to process the data slower which is causing a problem. (E.g., Dockerized functions, OCR, etc…)

Here, your Input needs to keep that open connection for minutes, hours, or longer. The problem is that a network issue or network timeout may occur and stop the Connection.

You can use the Reparatition function can increase the processing time by storing data in temporary files to help eliminate the chance of network issues.