Handling post hook vacuums with dbt and Redshift
When using Redshift as your data warehouse with dbt, you'll likely want to vacuum your tables after incremental models run. This is especially important if you're relying on sort keys to maintain query performance over time. Redshift doesn't automatically sort new rows as they’re inserted, so to keep things efficient, you need to run VACUUM to reorganise the data.
The problem is: Redshift only allows one VACUUM operation at a time per cluster. If two vacuums run simultaneously, Redshift will throw errors like:
ERROR: VACUUM cannot be run concurrently with another VACUUM or COPY
This becomes a real issue when you add VACUUM as a post-hook on multiple models. If you run dbt with more than one thread, which you probably are, to parallelise model execution, there’s a good chance that two post-hook vacuums will run at the same time and fail. That breaks your DAG, leaves downstream models unbuilt, and requires manual intervention.
The naive approach: serial execution
One way to prevent this is to run:
dbt run --threads 1
This ensures only one model (and one vacuum) runs at a time. But it also slows everything down dramatically. You're now processing models serially, even though only the vacuum step needs to be serialised. Not a great tradeoff.
The better approach: centralised, serialised vacuums
Instead of tying vacuums to individual models, we can run them at the end of the run, after all models have been built, and process them one by one. Here’s the general idea:
- Track which incremental models were run successfully.
- At the end of the run, loop through those models and issue a VACUUM statement.
- Run that loop serially to avoid concurrency issues.
This can be implemented using a run-operation macro and a run-level post-hook.
Step 1: Add a run-level post-hook
In your dbt_project.yml
:
on-run-end:
- "{{ vacuum_incremental_models(results) }}"
This macro will receive the full results object from the dbt run and can use it to identify which models to vacuum.
Step 2: Write the macro
In macros/vacuum_incremental_models.sql
:
{% macro vacuum_incremental_models(results) %}
{% for result in results %}
{% if result.status == 'success' and result.node.resource_type == 'model' %}
{% set materialized = result.node.config.materialized %}
{% if materialized == 'incremental' %}
{% set schema = result.node.schema %}
{% set table = result.node.alias %}
{% set sql %}
vacuum {{ schema }}.{{ table }};
{% endset %}
{{ log("Running vacuum on " ~ schema ~ "." ~ table, info=True) }}
{% do run_query(sql) %}
{% endif %}
{% endif %}
{% endfor %}
{% endmacro %}
Benefits with this approach:
- You can still run dbt run with multiple threads
- Models execute in parallel
- Vacuums are executed safely at the end, one by one
- Your DAG stays healthy and error-free