Using BigQuery Dataframes with dbt python models

Back to guides

BigQuery

Google

GCP

BigFrames

Quickstart

Intermediate

Introduction

In this guide, you'll learn how to set up dbt so you can use it with BigQuery Dataframes (BigFrames):

Build scalable data transformation pipelines using dbt and Google Cloud, with SQL and Python.
Leverage BigFrames from dbt for scalable BigQuery SQL.

In addition to the existing dataproc/pyspark based submission methods for executing python models, you can now use the BigFrames submission method to execute Python models with pandas-like and scikit-like APIs, without the need of any Spark setup or knowledge.

BigQuery Dataframes is an open source python package that transpiles pandas and scikit-learn code to scalable BigQuery SQL. The dbt-bigquery adapter relies on the BigQuery Studio Notebook Executor Service to run the python client side code.

Prerequisites

A Google Cloud account
A dbt Cloud account
Basic to intermediate SQL and python.
Basic understanding of dbt fundamentals. We recommend the dbt Fundamentals course.

What you'll build

Here's what you'll build in two parts:

Google Cloud project setup
- A one-time setup to configure the Google Cloud project you’ll be working with.
Build and Run the Python Model
- Create, configure, and execute a Python model using BigQuery DataFrames and dbt.

You will set up the environments, build scalable pipelines in dbt, and execute a python model.

Implementation of the BigFrames submission method

Figure 1 - Implementation of the BigFrames submission method for dbt python models

Configure Google Cloud

The dbt BigFrames submission method supports both service account and OAuth credentials. You will use the service account in the following steps.

Create a new Google Cloud Project

a. Your new project will have the following list of APIs already enabled, including BigQuery, which is required.
- Default APIs
b. Enable the BigQuery API which also enables the following additional APIs automatically
- BigQuery API's

Create a service account and grant IAM permissions

This service account will be used by dbt to read and write data on BigQuery and use BigQuery Studio Notebooks.

Create the service account with IAM permissions:

#Create Service Account
gcloud iam service-accounts create dbt-bigframes-sa
#Grant BigQuery User Role
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:dbt-bigframes-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/bigquery.user
#Grant BigQuery Data Editor role. This can be restricted at dataset level
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:dbt-bigframes-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/bigquery.dataEditor
#Grant Service Account user 
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:dbt-bigframes-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/iam.serviceAccountUser
#Grant Colab Entperprise User
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:dbt-bigframes-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/aiplatform.colabEnterpriseUser

(Optional) Create a test BigQuery Dataset

Create a new BigQuery Dataset if you don't already have one:

#Create BQ dataset 
bq mk --location=${REGION} echo "${GOOGLE_CLOUD_PROJECT}" | tr '-' '_'_dataset

Create a GCS bucket to stage the python code

For temporary log and code storage, please create a GCS bucket and assign the required permissions:

#Create GCS bucket
gcloud storage buckets create gs://${GOOGLE_CLOUD_PROJECT}-bucket --location=${REGION}
#Grant Storage Admin over the bucket to your SA 

gcloud storage buckets add-iam-policy-binding gs://${GOOGLE_CLOUD_PROJECT}-bucket --member=serviceAccount:dbt-bigframes-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin

Create a GCS bucket to hold the logs

#Create GCS bucket
gcloud storage buckets create gs://${GOOGLE_CLOUD_PROJECT}-bucket-logs --location=${REGION}
#Grant Storage Admin over the bucket to your SA 
gcloud storage buckets add-iam-policy-binding gs://${GOOGLE_CLOUD_PROJECT}-bucket-logs --member=serviceAccount:dbt-bigframes-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin

Create, configure, and execute your Python models

In your dbt project, create a sql model in your models directory, ending in the .sql file extension. Name it my_sql_model.sql.
In the file, copy this sql into it.
```
   select 
   1 as foo,
   2 as bar
```
Now create a new model file in the models directory, named my_first_python_model.py.

In the my_first_python_model.py file, add this code:

def model(dbt, session):
   dbt.config(submission_method="bigframes")
   bdf = dbt.ref("my_sql_model") #loading from prev step
   return bdf

Configure the BigFrames submission method by using either:

a. Project level configuration via dbt_project.yml

models:
my_dbt_project:
   submission_method: bigframes
   python_models:
      +materialized: view

b. The Python code via dbt.config in the my_first_python_model.py file

def model(dbt, session):
   dbt.config(submission_method="bigframes")
   # rest of the python code...

Run dbt run
You can view the logs in dbt logs. You can optionally view the codes and logs (including previous executions) from the Colab Enterprise Executions tab and GCS bucket from the GCP console.
Congrats! You just created your first two python models to run on BigFrames!

Introduction​

Prerequisites​

What you'll build​

Configure Google Cloud​

Create, configure, and execute your Python models​

Introduction

Prerequisites

What you'll build

Configure Google Cloud

Create, configure, and execute your Python models