How to use Google BigQuery Views in Data Pipelines

How to use Google BigQuery Views in Data Pipelines

Use views to load multiple tables with wildcards

In Data Pipelines it is possible to consume BigQuery views and combine them with other datasets. From the Data Pipelines user's perspective there is no difference between a view and a table.

It needs to be noted, however, that extra processing costs may be incurred on Google's side as the view needs to be recomputed each time it is accessed.

One of the most common use cases for views is reading and taking the union of multiple tables with the same schema but different names. For example, let's say a new BigQuery table is created every day called mydataset.user_data_<timestamp> where timestamp is the current date. To query all of these tables without having to explicitly reference the name of each table, the BigQuery SQL will be mydataset.user_data*. This will load and take the union of all tables starting with user_data (eg. user_data_2021_04_05, user_data_2021_04_06 and so on).

In this video we walk through such a scenario. The two tables are called sets_2020 and sets_2021 which are loaded using a wildcard: sets*. Wildcards can appear anywhere in the table name, so if the timestamp was a prefix rather than a postfix then we would use *sets.

So what is best practice for using views instead of tables? In general, if the data in the source tables is not going to change (eg. historic data) it is best to create a new table rather than a view to avoid the overhead. If the data is dynamic or new data is added regularly (eg. ad impressions collected on a daily basis), then using a view is necessary so all tables can be read at the same time, without having to manually update the pipeline every day just to add the latest table in the mix. Keeping this in mind can result in significant cost savings.