Parquet File Writer
Learn more about the Parquet File Writer connector and how to use it in the Digibee Integration Platform.
Parquet File Writer is a Pipeline Engine v2 exclusive connector.
The Parquet File Writer connector allows you to read Parquet files based on Avro files.
Parquet is a columnar file format designed for efficient data storage and retrieval. Further information can be found on the official website.
Parameters
Take a look at the configuration parameters of the connector. Parameters supported by Double Braces expressions are marked with (DB)
.
General tab
Parameter | Description | Default value | Data type |
---|---|---|---|
Parquet File Name | The file name of the Parquet file to be written. | file.parquet | String |
Avro File Name | The file name of the Avro file that contains the data to be written to the Parquet file. Only Avro files within schemas with the type | file.avro | String |
File Exists Policy | Defines which behavior to follow when a file with the same name (Parquet File Name parameter) already exists in the current pipeline execution. You can select the following options: Overwrite (overwrite the existing file) or Fail ( interrupt execution with an error if the file already exists). | Overwrite | String |
Fail On Error | If the option is active, the execution of the pipeline with an error will be interrupted. Otherwise, the pipeline execution proceeds, but the result will show a false value for the | False | Boolean |
Advanced tab
Parameter | Description | Default value | Data type |
---|---|---|---|
Dictionary Encoding | Defines if dictionary encoding for columns must be enabled. | False | Boolean |
Compression Codec | The compression codec to be used when compressing the Parquet file. Options:
| Uncompressed | String |
Row Group Size | Defines the size of row groups in the Parquet file. | 134217728 | Integer |
Page Size | Defines the size of pages in the Parquet file. | 1048576 | Integer |
Documentation tab
Parameter | Description | Default value | Data type |
---|---|---|---|
Documentation | Section for documenting any necessary information about the connector configuration and business rules. | N/A | String |
Important information
The Parquet File Writer connector can only generate Parquet files based on Avro files. It's not possible to create them directly from a JSON payload.
Despite this limitation, the Digibee Integration Platform provides a way to generate Avro files using the Avro File Writer connector, apart from the fact that the Parquet File Writer connector can handle Avro files generated from another source outside the platform.
When writing a Parquet file using the connector, Avro files containing the data types BINARY
and FIXED
are treated as binary data. When reading the generated file with the Parquet File Reader connector, the data for these types is displayed in base64 format.
You should also note that performance differences can occur when writing compressed and uncompressed Parquet files. Since compression requires more memory and processing, it's important to validate the limits supported by the pipeline when you apply it.
Usage examples
Uncompressed Parquet file
Writing an uncompressed Parquet File based on an Avro file:
Parquet File Name: file.parquet
Avro File Name: file.avro
File Exists Policy: Overwrite
Compression Codec: Uncompressed
Example of Avro file content in JSON format:
Output:
Compressed Parquet file
Writing a compressed Parquet File based on an Avro file:
Parquet File Name: file.parquet
Avro File Name: file.avro
File Exists Policy: Overwrite
Compression Codec: Snappy
Example of Avro file content in JSON format:
Output:
File Exists Policy as Fail
Writing a Parquet File with the same name of an existent file in the pipeline file directory:
Parquet File Name: file.parquet
Avro File Name: file.avro
File Exists Policy: Fail
Output:
Last updated