The Data Analyst SQL Cookbook

Author

Paulius Alaburda

Published

October 18, 2023

Preface

I have a terrible memory for things that I use rarely. When it comes to SQL, there are certain code patterns that are extremely useful when doing data modelling. But there’s only so many data models you need to build before actually using them. I wrote this so that I have all of my useful patterns in one place with the benefit of sharing this with you.

This book was born out of a need to collect SQL queries that were too infrequently used to be memorized but too frequently used to be looked up. The queries are organized by topic and are meant to be used as a reference. The book is also meant to be a living document that will be updated as new queries are discovered and old queries are improved.

Furthermore, these SQL patterns are not universally known! I have found that many developers are not aware of these patterns and are reinventing the wheel when they encounter these problems. I hope that this book will help to spread these patterns and make them more widely known. My team has been victim to the same problem - we found a solution on StackOverflow only to find a better solution a year down the line.

The impetus to writing this book was that I couldn’t find a book that I could throw at a fresh data analyst and say “here, this will get you up to speed on SQL”. The book initially was supposed to be a collection of SQL queries that my team used frequently or infrequently. Think of it as a SQL cookbook of sorts - it contains recipes that are useful but not so common that you’d learn them by heart. However, I also realised that having a single resource for SQL is better than two so I added an introduction to SQL as well.

Part one of the book is exactly that - basics presented in a concise manner to get you up to speed on running queries as an analyst. Part two is the cookbook - the queries are loosely grouped around the types of problems they solve. For example, queries for working with time series will be in the same chapter. Loosely is a load-bearing word here - I would like to find a more natural way of organising the material but the recipes are going to be useful depending on your area. You might not even find a use for some of the queries in this book!

There’s also another, secret, part - building Data Models. A lot of Data Analysts forgo building a star schema and just whack the raw data with whatever tool they’re most comfortable with. It’s not necessarily a bad practice or anything but I prefer building out my data model so that it meets two criteria:

It’s expressive - relatively simple SQL queries are required to get different calculations, i.e. I don’t need to use exotic SQL when all I want is a count or a sum
It’s

If the model is expressive and X, then there’s a few neat things that get solved:

You no longer need to write complicated Calculated Fields or DAX queries - that means the calculations are simpler and more portable.
This means you have less lock-in - if for some reason you need to switch BI vendors, your data model drives most of the logic.
An expressive data model also lets you double check if the result in your Data Warehouse and your BI tool matches

Whenever I build a data model (or semantic layer whatever you wanna call it), I try to build it in a way that my main metrics can be calculated using simple SQL queries like counting rows or SUM’ing a column. Achieving this nets you a few neat things:

You can check your metric definition inside your Data Warehouse and your BI tool
You have less vendor lock-in and can switch BI tools if need be

My favourite example is how Power BI has an unwieldy query for calculating the number of subscriptions for each month, i.e. showing the change of active SCD2 type rows. In a data model, this is solved by building a factless fact table. And you really really need to know this as a data analyst - I strongly believe that only the analyst can build the semantic layer because the semantic layer is driven by the metrics and questions that need answering. If your tables in the Data Warehouse are not built in a way that helps you answer questions and build your main KPIs, your data analysts are going to build out their logic in the BI tool or somewhere else entirely. This chapter is recipes targeted towards building tables inside of a Data Warehouse.

To make this book really useful, the book uses the interactive SQL extension for Quarto. All the code chunks you find in the book are executable. Have fun!

I like data modeling because it introduces clarity and a common language between data analysts. If all reports follow some sane dimension and fact structure, you can probably onboard a new data analyst faster than to reports that do not. However, I understand why people despise Kimball’s techniques and embrace stuff like one-big-table (OBTs). Have you seen the list of techniques listed on KImball’s website? Half of these names don’t make sense to me. If I have a calendar table that is used as a filter on two fact tables, suddenly it’s called a “conformed” dimension? If I define multiple relationships between my dimension and fact tables, it’s called a “role playing” dimension now? That’s too much lingo for my tastes. The data modeling chapter contains techniques that I have found useful with sample SQL code to show how they could be applied.