Highly scalable ETL model for Digibee
Learn how to design a scalable ETL process using event-based pipelines.
Last updated
Was this helpful?
Learn how to design a scalable ETL process using event-based pipelines.
Last updated
Was this helpful?
Although an iPaaS platform like Digibee is a cloud-based software that helps connect applications across different systems and organizations in real time, the need to implement ETL (Extract, Transform, and Load) processes is inevitable. Despite varying capabilities, it is possible to implement this type of solution with high efficiency.
This document describes an ETL model used for the extraction, transformation (field mapping), and delivery of large volumes of records from a relational database to a target system using the Digibee Integration Platform.
The objective is to extract the data within a predefined time window by scaling events according to demand.
Consider a scenario where a high volume of data must be extracted from a relational database containing different tables, each representing a distinct business flow — in other words, several integrations. Altogether, these tables can represent about 20 million records per day.
Each table requires field mapping, and the data must be sent in batches to the target, which can be either an API or a File server.
Also consider that there is no restricted time window for data delivery at the destination. However, since extraction takes place from a critical-use database, it must be performed during off-peak hours (overnight). Therefore, it is important that the architecture is highly scalable and allows for parameterization by flow.
The implementation of this model follows an event-based architecture, consisting of:
Two initial pipelines for the Extract phase
Two pipelines that together handle the Transform and Load phases
The architecture outlined in this article assumes a simple transformation, such as field mapping from source to target with a few basic rules.
However, if the scenario requires complex data transformations that depend on extensive internal processing, it may be necessary to evolve the architecture by introducing a dedicated event pipeline solely for transformation before the data is sent, as shown in the model below:
With the Digibee Integration Platform, evolving this model in a decoupled way is simple, offering great scalability.
In addition to the pipeline structure mentioned above, the following components are important for controlling and operating the ETL model:
Extraction page control: An Object Store is required to store each “page metadata” for extraction.
Temporary Object Store: A dedicated Object Store per flow where extracted pages are stored before transformation and delivery.
Integration control: Although not shown in the diagram, it is highly recommended to maintain an Object Store for integration control to track the daily status of each integration in the List of integrations.
The architecture presented in this article ensures high scalability in the ETL process. Depending on the demand and constraints of each component, extraction and loading can be scaled independently, pacing each stage as needed. For example, if the extraction window at the source is shorter than the loading window at the target, we can speed up extraction by increasing replicas.
This decoupling of stages also prevents an outage in the target system from blocking the extraction stage, which can be fully completed and delivered later.
Additionally, the model enables sharing of common components across various processes, such as the page generation pipelines for extraction and the page dispatching pipeline.
A generic pipeline initiates the data extraction process. This pipeline acts as a trigger for integrations. Its sole responsibility is to generate page metadata for each flow and publish the extraction event.
A specific pipeline for each integration receives the page metadata upon publication, performs the extraction, and stores the metadata in a temporary (flow-specific).
As an event-driven process, this pipeline can be scaled based on the flow’s needs. Scaling can be in terms of to support the extracted page, or in terms of replicas to handle multiple pages simultaneously. For instance, increasing replicas may be a good strategy to reduce extraction time.
With the , horizontal scaling (replicas) can be configured with . You only need to define the minimum and maximum number of replicas, and the Platform will adjust automatically.
This pipeline is responsible for retrieving each page from the temporary , performing the transformation, and delivering it to the target. Like the extraction pipeline, this delivery pipeline should be specific to each integration.
List of integrations (Flows): Can be a variable configured with a JSON or a table in the client's database. For this article, we used a Global.
Common mechanisms: Existing reusable mechanisms such as reprocessing or can support the ETL process.