Skip to content

Pipeline Components

Overview

The Real Time Data Ingestion Pipeline Framework supports the following component types:

  • Sources - connectors to source systems
  • Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
  • Destinations - connectors to sink/destination systems
  • Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
  • Secrets - components that facilitate accessing secret stores where sensitive information is stored such as passwords, connectiong strings, keys etc

Component Types

Python Apache Spark Databricks
python pyspark databricks

Component Types determine system requirements to execute the component:

  • Python - components that are written in python and can be executed on a python runtime
  • Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime
  • Databricks - components that require a Databricks runtime

Note

RTDIP are continuously adding more to this list. For detailed information on timelines, read this blog post and check back on this page regularly.

Sources

Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but also support batch components as these are still important and necessary data souces of time series data in a number of circumstances in the real world.

Source Type Python Apache Spark Databricks Azure AWS
Delta ✔ ✔ ✔ ✔
Delta Sharing ✔ ✔ ✔ ✔
Autoloader ✔ ✔ ✔
Eventhub ✔ ✔ ✔ ✔
Eventhub Kafka ✔ ✔ ✔ ✔
IoT Hub ✔ ✔ ✔ ✔
Kafka ✔ ✔ ✔ ✔
Kinesis ✔ ✔ ✔
MISO Daily Load ISO ✔ ✔ ✔ ✔
MISO Historical Load ISO ✔ ✔ ✔ ✔
PJM Daily Load ISO ✔ ✔ ✔ ✔
PJM Historical Load ISO ✔ ✔ ✔ ✔
CAISO Daily Load ISO ✔ ✔ ✔ ✔
CAISO Historical Load ISO ✔ ✔ ✔ ✔
ERCOT Daily Load ISO ✔ ✔ ✔ ✔
Weather Forecast API V1 ✔ ✔ ✔ ✔
Weather Forecast API V1 Multi ✔ ✔ ✔ ✔
ECMWF MARS Weather Forecast ✔ ✔ ✔ ✔
MFFBAS API ✔ ✔ ✔ ✔ ✔
ENTSO-E API ✔ ✔ ✔ ✔ ✔

Note

This list will dynamically change as the framework is further developed and new components are added.

Transformers

Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.

Transformer Type Python Apache Spark Databricks Azure AWS
Binary To String ✔ ✔ ✔ ✔
OPC Publisher OPCUA Json To Process Control Data Model ✔ ✔ ✔ ✔
OPC Publisher OPCAE Json To Process Control Data Model ✔ ✔ ✔ ✔
Fledge OPCUA Json To Process Control Data Model ✔ ✔ ✔ ✔
EdgeX OPCUA Json To Process Control Data Model ✔ ✔ ✔ ✔
SSIP PI Binary Files To Process Control Data Model ✔ ✔ ✔ ✔
SSIP PI Binary JSON To Process Control Data Model ✔ ✔ ✔ ✔
SEM Json To Process Control Data Model ✔ ✔ ✔ ✔
Honeywell APM Json To Process Control Data Model ✔ ✔ ✔ ✔
Process Control Data Model To Honeywell APM Json ✔ ✔ ✔ ✔
Mirico Json To Process Control Data Model ✔ ✔ ✔ ✔
Pandas to PySpark DataFrame Conversion ✔ ✔ ✔ ✔
PySpark to Pandas DataFrame Conversion ✔ ✔ ✔ ✔
MISO To Meters Data Model ✔ ✔ ✔ ✔
Raw Forecast to Weather Data Model ✔ ✔ ✔ ✔
PJM To Meters Data Model ✔ ✔ ✔ ✔
ERCOT To Meters Data Model ✔ ✔ ✔ ✔
CAISO To Meters Data Model ✔ ✔ ✔ ✔
ECMWF NC Forecast Extract Point To Weather Data Model ✔ ✔ ✔ ✔
ECMWF NC Forecast Extract Grid To Weather Data Model ✔ ✔ ✔ ✔

Note

This list will dynamically change as the framework is further developed and new components are added.

Destinations

Destinations are components that connect to sink/destination systems and write data to them.

Destination Type Python Apache Spark Databricks Azure AWS
Delta ✔ ✔ ✔ ✔
Delta Merge ✔ ✔ ✔ ✔
Eventhub ✔ ✔ ✔ ✔
Kakfa ✔ ✔ ✔ ✔
Eventhub Kakfa ✔ ✔ ✔ ✔
Kinesis ✔ ✔ ✔
Rest API ✔ ✔ ✔ ✔
Process Control Data Model To Delta ✔ ✔ ✔
Process Control Data Model Latest Values To Delta ✔ ✔ ✔
EVM ✔ ✔ ✔

Note

This list will dynamically change as the framework is further developed and new components are added.

Utilities

Utilities are components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance and are normally components that can be executed as part of a pipeline or standalone.

Utility Type Python Apache Spark Databricks Azure AWS
Spark Session ✔ ✔ ✔ ✔
Spark Configuration ✔ ✔ ✔ ✔
Delta Table Create ✔ ✔ ✔ ✔
Delta Table Optimize ✔ ✔ ✔ ✔
Delta Table Vacuum ✔ ✔ ✔ ✔
AWS S3 Bucket Policy ✔ ✔ ✔ ✔ ✔
AWS S3 Copy ✔ ✔ ✔ ✔ ✔
ADLS Gen 2 ACLs ✔ ✔ ✔ ✔ ✔
Azure Autoloader Resources ✔ ✔ ✔ ✔ ✔
Spark ADLS Gen 2 Service Principal Connect ✔ ✔ ✔ ✔

Note

This list will dynamically change as the framework is further developed and new components are added.

Secrets

Secrets are components that perform functions to interact with secret stores to manage sensitive information such as passwords, keys and certificates.

Secret Type Python Apache Spark Databricks Azure AWS
Databricks Secret Scopes ✔ ✔ ✔
Hashicorp Vault ✔ ✔ ✔ ✔ ✔
Azure Key Vault ✔ ✔ ✔ ✔ ✔

Note

This list will dynamically change as the framework is further developed and new components are added.

Conclusion

Components can be used to build RTDIP Pipelines which is described in more detail here.