Use Google Sheets as Lookup Tables for Data Normalization in Google BigQuery

I’m a Google BigQuery fanboy. There are other cloud-based “big data” databases out there, but BigQuery’s set of unique features combined with an extremely easy to use interface and wide range of integrations have made it particularly useful as a self-taught, quasi-data engineer without a large technical team or CS degree to fall back on.

A feature that I’m continually surprised doesn’t get talked about more often is the ability to create a database table from a Google Sheets’ spreadsheet.

The best thing about the feature is that it’s not just a one-time import – it’s a live connection. The spreadsheet IS the database table. This means that any updates made in the sheet will be immediately available the next time the table is queried.

Lookup Tables

A lookup table doesn’t have a standardized definition, but you’ll often hear it described as a usually relatively small table with rarely changed descriptive data that can be joined on to other tables to provide greater context or two standardize values from another table: for example expanding two-letter state codes into the full state name or show the product name from a SKU.

Lookup tables can be a great way to normalizing data in a query for data with too many values or that changes too frequently to be hard-coded into a SQL CASE statement, but that would be too difficult to standardize at the point of data collection or before your main tables are loaded into your database. You don’t really want to have to write code that converts the SKU from every single transaction into the product name and price – it’s easier just to pass the SKU into the transaction table and use a product lookup table for the other, more static data.

Google Sheets provides an incredibly easy way to create easily-editable lookup tables for your BigQuery database.

Example: FEC Filing Data

I’m a politics nerd (yes even now. yes… something is definitely wrong with me). I was way too excited the day I realized that Google provides all of the campaign finance data from the FEC in a Google BigQuery dataset (it’s usually updated quite frequently, but the moron-in-chief’s government shutdown means, among many other things, FEC data isn’t being published).

Like all good government data sets… it’s not super easy to work with. Column names are all shortened (I’m sure from the days when column names had tiny length limits), and many columns have code values that you need to be referenced for their full meaning.

Committee Master File schema table `bigquery-public-data.fec.cm18`

One of the first things I did when I wanted to work with this data was create views that “cleaned up” these tables the way that they were easier for me to work with. This included joining multiple year’s tables together, expanding column names so they were more descriptive and expanding certain column values.

--  VIEW: Committee Master File
select cmte_id as id
      ,cmte_nm as name
      ,tres_nm as treasurer_name
      ,cmte_dsgn as designation
      ,cmte_city as city
      ,cmte_st as state
      ,cmte_zip as zip_code      
        when cmte_dsgn = "A" then "Authorized by a candidate"
        when cmte_dsgn = "B" then "Lobbyist/Registrant PAC"
        when cmte_dsgn = "D" then "Leadership PAC"
        when cmte_dsgn = "J" then "Joint fundraiser"
        when cmte_dsgn = "P" then "Principal campaign committee of a candidate"
        when cmte_dsgn = "U" then "Unauthorized"
       end as committee_designation      
        when cmte_tp = "C" then "Communication cost" 
        when cmte_tp = "D" then "Delegate committee" 
        when cmte_tp = "E" then "Electioneering communication" 
        when cmte_tp = "H" then "House" 
        when cmte_tp = "I" then "Independent expenditor (person or group)" 
        when cmte_tp = "N" then "PAC - nonqualified" 
        when cmte_tp = "O" then "Independent expenditure-only (Super PACs)" 
        when cmte_tp = "P" then "Presidential" 
        when cmte_tp = "Q" then "PAC - qualified" 
        when cmte_tp = "S" then "Senate" 
        when cmte_tp = "U" then "Single-candidate independent expenditure" 
        when cmte_tp = "V" then "PAC with non-contribution account - nonqualified" 
        when cmte_tp = "W" then "PAC with non-contribution account - qaualified " 
        when cmte_tp = "X" then "Party - nonqualified" 
        when cmte_tp = "Y" then "Party - qualified" 
        when cmte_tp = "Z" then "National party nonfederal account" 
       end as type
      ,cmte_pty_affiliation as party
        when cmte_filing_freq = "A" then "Administratively terminated"
        when cmte_filing_freq = "D" then "Debt"
        when cmte_filing_freq = "M" then "Monthly filer"
        when cmte_filing_freq = "Q" then "Quarterly filer"
        when cmte_filing_freq = "T" then "Terminated"
        when cmte_filing_freq = "W" then "Waived"
       end as committee_filing_frequency
        when org_tp = "C" then "Corporation"
        when org_tp = "L" then "Labor organization"
        when org_tp = "M" then "Membership organization"
        when org_tp = "T" then "Trade association"
        when org_tp = "V" then "Cooperative"
        when org_tp = "W" then "Corporation without capital stock"
       end as interest_group_category
      ,connected_org_nm as connected_organization_name
      ,cand_id as candidate_id
from  (
  select * from `bigquery-public-data.fec.cm18`
  union all
  select * from `bigquery-public-data.fec.cm16`

There are several long CASE statements where single-letter codes are expanded into their full definitions.

This is fine for a relatively small set of values (the longest here is 16 rows). But when we come to the “cmte_pty_affiliation” column – committee party affiliation, it’s a different story.

Despite our strong two-party system in the United States, there are actually 95 parties that appear in the FEC’s campaign finance filings. Each with a three-letter code that gets used in most FEC data tables. Now DEM and REP are pretty intuitive – but what about CIT? AMP? SEP? Not so much…

(Those were the Citizen’s Party, American Party and Socialist Equality Party… in case you’re interested).

95 values is a long CASE statement. And while new parties don’t crop up that often – it does happen. So are you going to have to dig into all your SQL code when the Pirate Party finally gets the recognition they deserve?

A lookup table makes way more sense. Fortunately, 95 rows is pretty easy in a spreadsheet. Here’s the reference table I created:

Here’s how to setup a Google Sheet as a BigQuery table:

Create a new Google Sheets spreadsheet.

BigQuery will only reference the first sheet in your Google Spreadsheet. If you have multiple sheets in your workbook (for instance you might have other data in other sheets that gets consolidated into a single sheet for your reference table), just make sure the one that you want BigQuery to point at is the first sheet in the workbook. You can drag and drop sheet taps to change order.

Quick Tip: you can create a new spreadsheet by going to the URL:

Column Headers

Typically when we create a table in spreadsheet we would want to have column headers in the first row. However sometimes BigQuery will treat your header row as a data row.

This appears to happen when all of the columns have the type STRING. Hopefully Google will add a checkbox at some point like they do with CSV imports to tell BigQuery to skip x number of rows, but until then, there are a few workarounds here:

Add Sheet as Table in BigQuery

In the BigQuery UI, select a dataset in a project that you own (when you’re working with public datasets in BigQuery, you still need to create your own dataset under a Google Cloud Project – TK).

When you’ve selected a dataset, click “Create Table”.

Select “Drive” from “Create table from”. Paste the URL to the Google Spreadsheet in the “Select Drive URI” field and select “Google Sheet” from File format.

Create a Google BigQuery table from a Google Sheet spreadsheet.
Configure a new external table to point at your Google Sheet.

Then enter a table name. I like to create namespaces in my table names, so I prepend all my reference tables with “ref_” – just so they all appear grouped together in my dataset.

Naming your BigQuery Table

Next we define the schema of the sheet. While there is an option to “auto detect” schema and input parameters, this doesn’t always work as expected, and I’ve often found it is a better practice to define the schema yourself. If auto detect gets something wrong, you’ll have to delete the table and recreate it.

Define your Google Sheet's schema in BigQuery. While you can use auto-detect, it's more reliable to just define your schema manually up front.
Define your table schema.

In this case our schema is straight forward – three columns: all strings.

BigQuery doesn’t check the schema for you – if there is invalid data in a cell, it won’t throw an error until you try to query it. This won’t be a problem for STRING columns, but it will be an issue if you have numeric fields.

Once your schema is set, click “Create table”.

Now we can include our lookup table in our previous view, converting our three-letter code to a party_name field.

   --- ...
      ,connected_org_nm as connected_organization_name
      ,cand_id as candidate_id
   --- new lookup field
      ,party.party_code_descriptor as party_name
from  (
  select * from `bigquery-public-data.fec.cm18`
  union all
  select * from `bigquery-public-data.fec.cm16`
) o
left join `political-research.fec_data.ref_party_codes` party
 on o.cmte_pty_affiliation = party.party_code

Data Normalization

“Ok, that was cool” I hear you saying. “Is that it?”

Not even close.

Perhaps the best application for the Google Sheets integration with BigQuery that I’ve found is the ability to create lookup tables that are dynamically populated from another query.

What’s more, you can write a query that self-references the Google Sheet table so that when you rerun the query, any changes you’ve made to the data will be incorporated.

But let’s try an example…

Example: Normalizing Expenditure Categories

One of the datasets available in the FEC data is expenditures – where campaigns are spending their money.

Here is the schema from my “operating_expenditures” view (I’ll include the SQL for all my views at the end):

Note: I’m using SuperQuery as my BigQuery interface here.

Say I wanted to see what the top expense categories by amount spent and the number of campaign committees who reported that expense.

Well, I could group by category_description:

Note: we put where amount > 0 to limit the query to just expenses paid and not those that were filed as refunds or credits.

You can see write a way we have a problem – the biggest category is Null – or uncategorized. So maybe we’ll have more luck grouping by the purpose column:

There’s no NULL value in our first 20 or so results, so that’s a good start. But immediately we can see the issue – the 1st row is “media buy” and the 3rd row is “media”. Those should probably be the same.

This is a super common issue with big messy datasets. This data is taking from raw FEC filings, and there isn’t any standardization as to how filers have to categorize the purpose of the expense.

In fact when we count the number of distinct values for purpose, we see that there are more than 186k different values for this field.

However, we could use the value of the purpose column to add categories to some of the records that don’t have them. We can do this by first importing data into a Google Sheet from a SQL query, then manually adding our new categories. Then that entire sheet can be treated as a table by BigQuery, allowing us to easily join the data back in to the database.

First, we need to install a Google Sheets extension that allows you to run SQL queries in BigQuery and display the results in a sheet. I use OWOX BI’s extension for this.

Once you have installed the add on, click create a report.

Next, we write a query that pulls the top 15k values for purpose, along with the category description, total $ spent and number of committees. However we are going to arrange the columns a little unusually – and we will pull in category_description twice (with different names), for reasons that will make sense soon.

select count(distinct exp.committee_id) as nm_cmte
      ,sum(exp.amount) as total
      ,exp.category_description as new_category
      ,exp.category_description as old_category
from fec_data.operating_expenditures exp
where amount > 0
group by 3,4,5
order by 2 desc
limit 15000

Write #standardSQL at the top so that the BigQuery API knows we’re using the Standard version of SQL – not the old legacy kind.

We need to title our query, and this title will become the name of the sheet where the data will be display (after this I won’t post screenshots of the query window – rather I’ll just post the SQL code so you can copy it).

Click “Save & Run”

This should give us a sheet that looks something like this:

Once you have this, you can delete “Sheet 1”, so that this sheet is now the first sheet in your workbook.

Next, we’re going to create a new query that will give us all of the possible values for category_description in order of total dollars spent. Call this query “Expenditure Categories”:

select category_description
      ,count(distinct committee_id) as num_committees
      ,sum(amount) as total
from fec_data.operating_expenditures exp
where amount > 0
    and category_description is not null
group by 1
order by 3 desc

After we run that query, we get a new sheet like this (the unneeded rows and columns have been removed)

Go back to our first sheet. Here we create a data validation rule in our new_category column, based on our second query, so that we can easily select from one of the standard values:

You can now select from one of the existing category_description values for each of the uncategorized values.

We placed new_category between purpose and old_category, so that it’s easier to scan across multiple values and let the purpose and old_category values inform your selection of a new category (originally I had new_category at the very end after nm_cmte and total and it was difficult to quickly associate values across that many rows).

Next, using the same method as before, we add our new spreadsheet as a table in BigQuery, specifying a new schema based on the columns of our spreadsheet:

(Because we have mixed column types, we don’t have to worry about the header row)

Categorizing Data

Obviously categorizing even 15k rows of data would take a lot of time – even if it is easier than editing a SQL table. But because this is a spreadsheet, we can avail ourselves of the functions a spreadsheet provides for organizing data to make the process easier.

Select the entire table range and hit the “filter” icon. This turns the range into a sortable, filterable table:

In this example, first I filter by the “new_category” column to show just the rows that are blank. If you were wanting to recategorize some existing expenses though you might not do this.

We the table to show results that have either the word “advertising” or “media” in the purpose column. Then we simply select the “Advertising Expenses” option from the drop down of valid values and the copy and paste that down the rest of the rows. Note, with this many rows it can sometimes take a few seconds for large copy and paste operations to work.

You can continue this process of filtering list to progressively categorize rows that don’t have a category – for example we can categorize any rows with “payroll” or “salary” in the purpose as “Administrative/Salary/Overhead Expenses”

Classify as many rows as you would like, depending on how deep you want to go. I spent about 10 minutes and categorized a few easy ones like “travel”, “fundraising” and “direct mail” that easily fit into the existing categories.

Note: there are a lot of entries where multiple categories apply – presumably vendors paid who might have done several types of work, for instance direct mail and polling. We’re just looking for the “best guess” to make aggregated analysis more meaningful.

Querying New Data

Now we need to incorporate the new categories into our original query.

select coalesce(ref.new_category, exp.category_description) category_description
      ,count(distinct exp.committee_id) as num_committees
      ,sum(exp.amount) as total
from fec_data.operating_expenditures exp
left join fec_data.ref_match_expenditure_type ref
on exp.purpose = ref.purpose
and ifnull(exp.category_description,"") = ifnull(ref.old_category,"")
where amount > 0
group by 1
order by 3 desc

We simply join our reference table on to the operating_expenditures table on two keys: purpose and category_description.

Because null isn’t equal to anything in SQL, we need to do a substitution on the category_description and old_category columns using IFNULL, where we substitute null with an empty string “”.

This gives us the following results:

We use the COALESCE function to check the new reference table new_category function first, then check the old category_description column from the operating_expenditures table.

It’s a little hard to see, but we have decreased the total uncategorized dollars by 55% ($3,420,221,504.14 from $7,765,284,100.84 previously). Not bad for 10 minutes of classifying work.

Incorporating Data Updates

“But wait!” I hear you say, “what happens if there is new data that comes in that we need to categorize – won’t all of our hard work tagging disappear when we refresh the query?”

This is where things get cool! We can modify the query to pull in our edits, so that the query can be rerun while persisting all of our changes.

select count(distinct exp.committee_id) as nm_cmte
      ,sum(exp.amount) as total
      ,coalesce(self.new_category, exp.category_description) as new_category
      ,exp.category_description as old_category
from fec_data.operating_expenditures exp
left join fec_data.ref_match_expenditure_type self
 on exp.purpose = self.purpose
and ifnull(exp.category_description,"") = ifnull(self.old_category,"")
where amount > 0
group by 3,4,5
order by 2 desc
limit 15000

We join the operating_expenditures table to the ref_match_expenditure_type table, and use the self alias to identify that this references the same sheet.

Again, use the COALESCE function to pull in the new_category value first if it exists, then the old category_description.

This means that the query can be re-run and all of our work will be preserved.

As new records are added to the table and we see different values for purpose, we are able to keep re-categorizing. Now, because we’re only looking at the top 15k values, it’s possible that some less-common values might “fall out” of the top 15k, and lose their classification.

However, the next iteration of this approach could periodically dump our reference table to another table, maybe using BigQuery’s new scheduled queries, in order to preserve all manual classification work.

Questions / Improvements

This post covered some ways you can leverage Google Sheets as lookup tables in Google BigQuery. Let me know if there are other applications of this technique you have tried or you have any suggested improvements!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.