BigQuery ML with Scheduled Queries and Partitioned Tables

Google’s BigQuery is a perfect fit as Marketing Analytics Data Warehouse. Google Analytics 360 uses BigQuery. If you are using free Google Analytics and Google Ads, you can also import their data into BigQuery via Google App Script or Ads Script. External data can be imported to enhance this data and this sets no limits in building own machine learning models. Models include forecasting, regression, classification, clustering, recommendation systems and more.

BigQuery ML Business Uses Cases for Marketing, Sales and more.

Google ML enables you to apply machine learning to BigQuery data by using standard SQL syntax. This is groundbreaking for Marketing Analytics, since it increases model development speed by:

  • less programming need => no more Python code,
  • enabling to broader ML practitioners => standard SQL is enough to apply ML,
  • quick model building and availability => no more ETL jobs to move data.

Since September 2018 Google allows you to schedule queries. This is an important detail, because it allows us to save the predictions, which are made by SQL queries. You have to use partitioned tables if you want to use Google BigQuery Machine Learning (ML) by scheduling queries.

How BigQuery ML works

1. Create Model

Create a simple model with existing customers. As an example, let’s use some simple values to predict the sale of a new customer.

Schedule the query to run every month to update the model regularly.

2. Predict the future event

Predict the sale of new customers. New customers can be from yesterday, last hour, etc.

Schedule the query and save results in a destination table.

3. Save predictions by scheduling the predictions query

You can schedule the prediction query, daily, hourly, weekly, monthly, custom and on demand.

Destination table can be a partitioned table, partitioned by field DATE.

Machine Learning with BigQuery in the past

If you were using Google Cloud for Machine Learning in the past, you needed to write Python models in Google Functions and trigger the functions by new files in Google Cloud Storage. The file transfers from BigQuery to Cloud Storage were steered by Cloud Scheduler and Cloud Pub/Sub and the daily predictions transfers back to BigQuery were done with additional Python Code, since BigQuery transfers were not appending data into existing tables.

This overhead on programmatic architecture is now obsolete. Historical data to build the models are in BigQuery, ML is done within BigQuery and the predictions are available in BigQuery without the need of additional engineering work.

Comparison partitioned tables vs. fragmented tables

BigQuery was append-only by design. Deleting single rows was not possible in the past. Recommendation was to use daily fragmented tables. Most of data import in business are on daily basis, so this action solved the majority of the problems. By doing daily import to fragmented tables, you could easily replace one day as a standalone table by deleting the wrong day. This was handy to solve errors in daily import jobs. You did not have to delete everything from huge datasets and recreate again.

Fragmented Tables save daily data into single tables by naming them correspondingly.

[PREFIX]_YYYYMMDD

Google allows deleting rows from tables now. This enables the deletion of specific days of partitioned tables. This makes fragmented tables obsolete. Partitioned tables have more advantages.

Partitioned TablesFragmented Tables
Tables are partitioned based on
DATE, TIMESTAMP, or DATETIME as column
Ingestion time as load or arrival time
Integer range as column
Tables are partitioned based on
Table name [PREFIX]_YYYYMMDD
Fast in query performanceSlower performance
Each table has a copy of the schema and meta data
BigQuery has to confirm the access rights to each table within the given data range
Partition by number, ingestion time, day, week, month, yearFragmentation by day only
4.000 partitonsUnlimited fragments but query max 1.000 “dates”
Recommended best practice from Google

Cheat Sheet

  • BigQuery ML enables fast development for Marketing Analytics.
  • Use partitioned tables to build BigQuery ML to schedule the predictions query.
  • You can delete partitioned if errors occur.

Join the conversation on LinkedIn

More Similar Posts

Menu