A very small change leads to a very small improvement
Author
Paulius Alaburda
Published
January 19, 2025
The other day I was fiddling with a silver data layer at our data warehouse to ensure that the tables are built reasonably fast. We have one table that is made up of a few joined tables and a few window functions. It was clear that building a joined table and then adding the windowed columns afterwards was faster. What surprised me was that adding the window functions was faster by building them in a CTE and joining to the original table. Apparently you can replicate this in DuckDB:
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(duckdb)
Warning: package 'duckdb' was built under R version 4.4.3
Loading required package: DBI
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
interval_benchmark <- bench::mark(window_joined =dbGetQuery(con, "with tmp as ( select id, lead(date_from,1) over (partition by group_id order by id) as lead_date_from from intervals) select intervals.*, tmp.lead_date_from from intervals left join tmp on tmp.id = intervals.id"),window_together =dbGetQuery(con, "select intervals.*, lead(date_from,1) over (partition by group_id order by id) as lead_date_from from intervals"),check =FALSE)
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
autoplot(interval_benchmark)
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the bench package.
Please report the issue at <https://github.com/r-lib/bench/issues>.