Highly scalable ETL model for Digibee

Learn how to design a scalable ETL process using event-based pipelines.

Although an iPaaS platform like Digibee is a cloud-based software that helps connect applications across different systems and organizations in real time, the need to implement ETL (Extract, Transform, and Load) processes is inevitable. Despite varying capabilities, it is possible to implement this type of solution with high efficiency.

This document describes an ETL model used for the extraction, transformation (field mapping), and delivery of large volumes of records from a relational database to a target system using the Digibee Integration Platform.

The objective is to extract the data within a predefined time window by scaling events according to demand.

Scenario

Consider a scenario where a high volume of data must be extracted from a relational database containing different tables, each representing a distinct business flow — in other words, several integrations. Altogether, these tables can represent about 20 million records per day.

Each table requires field mapping, and the data must be sent in batches to the target, which can be either an API or a File server.

Also consider that there is no restricted time window for data delivery at the destination. However, since extraction takes place from a critical-use database, it must be performed during off-peak hours (overnight). Therefore, it is important that the architecture is highly scalable and allows for parameterization by flow.

Architecture

The implementation of this model follows an event-based architecture, consisting of:

Two initial pipelines for the Extract phase
Two pipelines that together handle the Transform and Load phases

The architecture outlined in this article assumes a simple transformation, such as field mapping from source to target with a few basic rules.

However, if the scenario requires complex data transformations that depend on extensive internal processing, it may be necessary to evolve the architecture by introducing a dedicated event pipeline solely for transformation before the data is sent, as shown in the model below:

With the Digibee Integration Platform, evolving this model in a decoupled way is simple, offering great scalability.

Key components

Extract

Pagination pipeline for extraction

A generic Scheduler-type pipeline initiates the data extraction process. This pipeline acts as a trigger for integrations. Its sole responsibility is to generate page metadata for each flow and publish the extraction event.

The page metadata is a JSON structure that specifies:

The integration name
Page ID
Start record
End record

Based on a list of configured integrations and their respective parameters, it identifies:

Flow name
Source table
Number of records to extract per page/event
Extraction event to be published
Start time for extraction
List of dependencies (if any) on other integrations

Page extraction pipeline

A specific pipeline for each integration receives the page metadata upon publication, performs the extraction, and stores the metadata in a temporary Object Store (flow-specific).

As an event-driven process, this pipeline can be scaled based on the flow’s needs. Scaling can be in terms of size (Small, Medium, or Large) to support the extracted page, or in terms of replicas to handle multiple pages simultaneously. For instance, increasing replicas may be a good strategy to reduce extraction time.

With the Consumption Based model, horizontal scaling (replicas) can be configured with Autoscaling. You only need to define the minimum and maximum number of replicas, and the Platform will adjust automatically.

Transform and Load

Page dispatcher pipeline

This Scheduler-type pipeline is responsible for checking if a flow is ready for sending and publishing the page ID of each flow to the delivery pipeline.

While extractions are ongoing, it monitors the already extracted pages from each integration, their dependencies, and publishes the page ID to the final pipeline (for delivery), which will collect the data and send it to the target.

Page delivery pipeline

This Event-type pipeline is responsible for retrieving each page from the temporary Object Store, performing the transformation, and delivering it to the target. Like the extraction pipeline, this delivery pipeline should be specific to each integration.

Other important components

In addition to the pipeline structure mentioned above, the following components are important for controlling and operating the ETL model:

List of integrations (Flows): Can be a Global variable configured with a JSON or a table in the client's database. For this article, we used a Global.
Extraction page control: An Object Store is required to store each “page metadata” for extraction.
Temporary Object Store: A dedicated Object Store per flow where extracted pages are stored before transformation and delivery.
Common mechanisms: Existing reusable mechanisms such as reprocessing or notifications can support the ETL process.
Integration control: Although not shown in the diagram, it is highly recommended to maintain an Object Store for integration control to track the daily status of each integration in the List of integrations.

Conclusion

The architecture presented in this article ensures high scalability in the ETL process. Depending on the demand and constraints of each component, extraction and loading can be scaled independently, pacing each stage as needed. For example, if the extraction window at the source is shorter than the loading window at the target, we can speed up extraction by increasing replicas.

This decoupling of stages also prevents an outage in the target system from blocking the extraction stage, which can be fully completed and delivered later.

Additionally, the model enables sharing of common components across various processes, such as the page generation pipelines for extraction and the page dispatching pipeline.

PreviousEvent-driven architecture NextNaming: Global, Accounts, and API Keys

Last updated 23 days ago

Was this helpful?