Skip to content

Apache Airflow

Databricks Provider

Apache Airflow can orchestrate an RTDIP Pipeline that has been deployed as a Databricks Job. For further information on how to deploy an RTDIP Pipeline as a Databricks Job, please see here.

Databricks has also provided more information about running Databricks jobs from Apache Airflow here.

Prerequisites

  1. An Apache Airflow instance must be running.
  2. Authentication between Apache Airflow and Databricks must be configured.
  3. The python packages apache-airflow and apache-airflow-providers-databricks must be installed.
  4. You have created an RTDIP Pipeline and deployed it to Databricks.

Example

The JOB ID in the example below can be obtained from the Databricks Job.

from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
from airflow.utils.dates import days_ago

default_args = {
  'owner': 'airflow'
}

with DAG('databricks_dag',
  start_date = days_ago(2),
  schedule_interval = None,
  default_args = default_args
  ) as dag:

  opr_run_now = DatabricksRunNowOperator(
    task_id = 'run_now',
    databricks_conn_id = 'databricks_default',
    job_id = JOB_ID
  )