RTDIP Ingestion Pipeline Framework

blog

RTDIP has been built to simplify ingesting and querying time series data. One of the most anticipated features of the Real Time Data Ingestion Platform for 2023 is the ability to create streaming and batch ingestion pipelines according to requirements of the source of the data and needs of the data consumer. Of equal importance is the need to query this data and an article that focuses on egress will follow in due course.

Overview

The goal of the RTDIP Ingestion Pipeline framework is:

Support python and pyspark to build pipeline components
Enable execution of sources, transformers, sinks/destinations and utilities components in a framework that can execute them in a defined order
Create modular components that can be leveraged as a step in a pipeline task using Object Oriented Programming techniques included Interfaces and Implementations per component type
Deploy pipelines to popular orchestration engines
Ensure pipelines can be constructed and executed using the RTDIP SDK and rest APIs

Pipeline Jobs

The RTDIP Data Ingestion Pipeline Framework will follow the typical convention of a job that users will be familiar with if they have used orchestration engines such as Apache Airflow or Databricks Workflows.

A pipeline job consists of the following components:

erDiagram
  JOB ||--|{ TASK : contains
  TASK ||--|{ STEP : contains
  JOB {
    string name
    string description
    list task_list
  }
  TASK {
    string name
    string description
    string depends_on_task
    list step_list
    bool batch_task
  }
  STEP {
    string name
    string description
    list depends_on_step
    list provides_output_to_step
    class component
    dict component_parameters
  }

As per the above, a pipeline job consists of a list of tasks. Each task consists of a list of steps. Each step consists of a component and a set of parameters that are passed to the component. Dependency Injection will ensure that each component is instantiated with the correct parameters.

Pipeline Runtime Environments

Python	Apache Spark	Databricks	Delta Live Tables

Pipelines will be able to run in multiple environment types. These will include:

Python: Components will be written in python and executed on a python runtime
Pyspark: Components will be written in pyspark and executed on an open source Apache Spark runtime
Databricks: Components will be written in pyspark and executed on a Databricks runtime
Delta Live Tables: Components will be written in pyspark and executed on a Databricks runtime and will write to Delta Live Tables

Runtimes will take precedence depending on the list of components in a pipeline task.

Pipelines with at least one Databricks or DLT component will be executed in a Databricks environment
Pipelines with at least one Pyspark component will be executed in a Pyspark environment
Pipelines with only Python components will be executed in a Python environment

Pipeline Clouds

azure-aws-gcp

Certain components are related to cloud providers and in the tables below, it is indicated which cloud provider is related to its specific component. It does not mean that the component can only run in that cloud, instead its highlighting that the component is related to that cloud provider.

Cloud	Target
Azure	Q1-Q2 2023
AWS	Q2-Q4 2023
GCP	2024

Pipeline Orchestration

Airflow	Databricks	Dagster

Pipelines will be able to be deployed to orchestration engines so that users can schedule and execute jobs using their preferred orchestration engine.

Orchestration Engine	Target
Databricks Workflows	Q2 2023
Airflow	Q2 2023
Delta Live Tables	Q3 2023
Dagster	Q4 2023

Pipeline Components

The Real Time Data Ingestion Pipeline Framework will support the following components:

Sources - connectors to source systems
Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
Destinations - connectors to sink/destination systems
Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
Edge - components that will perform edge functionality such as connectors to protocols like OPC

Pipeline Component Types

Python	Apache Spark	Databricks

Component Types determine system requirements to execute the component:

Python - components that are written in python and can be executed on a python runtime
Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime
Databricks - components that require a Databricks runtime

Sources

Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but will also support batch components as these are still important and necessary data souces in a number of circumstances in the real world.

Source Type	Python	Target
Delta	*	Q1 2023
Delta Sharing	*	Q1 2023
Autoloader		Q1 2023
Eventhub	*	Q1 2023
IoT Hub	*	Q2 2023
Kafka		Q2 2023
Kinesis		Q2 2023
SSIP PI Connector		Q2 2023
Rest API		Q2 2023
MongoDB		Q3 2023

* - target to deliver in the following quarter

There is currently no spark connector for IoT Core. If you know a way to add it as a source component, please raise it by creating an issue on the GitHub repo.

Transformers

Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.

Transformer Type	Python	Apache Spark	Databricks	Azure	AWS	Target
Eventhub Body						Q1 2023
OPC UA						Q2 2023
OPC AE						Q2 2023
SSIP PI						Q2 2023
OPC DA						Q3 2023

* - target to deliver in the following quarter

This list will dynamically change as the framework is developed and new components are added.

Destinations

Destinations are components that connect to sink/destination systems and write data to them.

Destination Type	Python	Target
Delta Append	*	Q1 2023
Eventhub	*	Q1 2023
Delta Merge		Q2 2023
Kafka		Q2 2023
Kinesis		Q2 2023
Rest API		Q2 2023
MongoDB		Q3 2023
Polygon Blockchain		Q3 2023

* - target to deliver in the following quarter

Utilities

Utilities are components that perform utility functions such as logging, error handling, data object creation, maintenance and are normally components that can be executed as part of a pipeline or standalone.

Utility Type	Python	Target
Delta Table Create	*	Q1 2023
Delta Optimize		Q2 2023
Delta Vacuum	*	Q2 2023
Set ADLS Gen2 ACLs		Q2 2023
Set S3 ACLs		Q2 2023
Great Expectations		Q3 2023

* - target to deliver in the following quarter

Secrets

Secrets are components that perform authentication functions and are normally components that can be executed as part of a pipeline or standalone.

Secrets Type	Python	Apache Spark	Databricks	Azure	AWS	Target
Databricks Secrets						Q2 2023
Hashicorp Vault						Q2 2023
Azure Key Vault						Q3 2023
AWS Secrets Manager						Q3 2023

Edge

Edge components are designed to provide a lightweight, low latency, low resource consumption, data ingestion framework for edge devices. These components will be designed to run on edge devices such as Raspberry Pi, Jetson Nano, etc. For cloud providers, this will be designed to run on AWS Greengrass and Azure IoT Edge.

Edge Type	Azure IoT Edge	AWS Greengrass	Target
OPC CloudPublisher			Q3-Q4 2023
Fledge			Q3-Q4 2023
Edge X			Q3-Q4 2023

Conclusion

This is a very high level overview of the framework and the components that will be developed. As the framework is open source, the lists defined above and timelines can change depending on circumstances and resource availability. Its an exciting year for 2023 for the Real Time Data Ingestion Platform. Check back in regularly for updates and new features! If you would like to contribute, please visit our repository on Github and connect with us on our Slack channel on the LF Energy Foundation Slack workspace.