Skip to content

Chat Open AI Databricks Connector

The Chat Open AI(Chat GPT) Databricks Connector leverages langchain to execute SQL queries on tables in Databricks. Keep in mind that you can try this out on any data in Databricks, it is not limited to data ingested by RTDIP pipelines.

Note

This is experimental and you will likely experience variable responses to your questions depending on the complexity of the data you use in this setup. Start small, with only a 2 - 3 tables before scaling up.

Prerequisites

  1. Obtain an Open AI API key. Register on the Open AI platform to obtain an Open AI API key, follow these instructions to obtain your API key and familiarize yourself with the Open AI documentation.
  2. Access to Databricks SQL or a Databricks Cluster and data stored as tables in Unity Catalog or Hive Metastore.

Warning

Consider the implications of exposing your data with Open AI models and seek approval prior to registering with any of these services.

Setup the SQL AI Agent

With all the prerequisites in place, it's time to setup the SQL AI Agent.

Firstly, import the required RTDIP SDK components

from rtdip_sdk.connectors import ChatOpenAIDatabricksConnection

Next, configure the component with all the relevant connection information:

agent = ChatOpenAIDatabricksConnection(
    catalog="<databricks catalog>", 
    schema="<databricks schema>", 
    server_hostname="<databricks host name>",                   
    http_path="<databricks http path>",                         
    access_token="<Azure AD token or databricks PAT token>",
    openai_api_key="<Open AI API key>",
    openai_model = "gpt-4",                                     
    sample_rows_in_table_info = 5, 
    verbose_logging = True
)

Some notes on the above:

  • server_hostname and http_path can be obtained from your Databricks SQL Warehouse or Databricks Cluster.
  • access_token can be either a Databricks PAT Token or Azure AD Token. To obtain an Azure AD token, please refer to this documentation
  • open_ai_model defaults to gpt-4 but is not easily available at the time of writing. Alternatively, the gpt-3.5-turbo-16k-0613 has worked well in our tests
  • sample_rows_in_table_info limits the number of rows queried in a table when the SQL Database Agent is looking context in the data. Be careful to not increase this too much as its then possible to exceed token limits on the gpt models

Execute Queries

With the ChatOpenAIDatabricksConnection class configured, its now possible to execute queries. Try out some questions with the following code:

response = agent.run("What was the average actual power generated by Turbine 1 at ACME Wind Farm on 6 May?")
print(response)

And the answer:

"The average power generated for turbine 1 at Acme on 6 May is 847.4923418083226."

Limitations

The biggest limitation experienced to date is the token limit on the Chat GPT models. These limits can be quickly exceeded as the SQL AI Agent queries the data and uses the responses as tokens to find context and answers to questions. Some of this can be reduced by using the Chat GPT models that allow for higher token limits, like the gpt-3.5-turbo-16k-0613 model. However, if the Databricks schema has many tables and needs to do a number queries to determine the answer, 16k tokens is still easily exceeded.

The connector does not perform much cross querying of tables to solve answers. For example, the agent does not try to query a metadata table to then solve a question for related data in another table.

The answers are not always the same and asking the same question multiple times did not always provide the same answer (however, because its an Intelligence based solution, there could be good reasons for that).