Handling post hook vacuums with dbt and Redshift

Data Dbt Redshift

Handling post hook vacuums with dbt and Redshift

When using Redshift as your data warehouse with dbt, you'll likely want to vacuum your tables after incremental models run. This is especially important if you're relying on sort keys to maintain query performance over time. Redshift doesn't automatically sort new rows as they’re inserted, so to keep things efficient, you need to run VACUUM to reorganise the data.

The problem is: Redshift only allows one VACUUM operation at a time per cluster. If two vacuums run simultaneously, Redshift will throw errors like:

ERROR:  VACUUM cannot be run concurrently with another VACUUM or COPY

This becomes a real issue when you add VACUUM as a post-hook on multiple models. If you run dbt with more than one thread, which you probably are, to parallelise model execution, there’s a good chance that two post-hook vacuums will run at the same time and fail. That breaks your DAG, leaves downstream models unbuilt, and requires manual intervention.

The naive approach: serial execution

One way to prevent this is to run:

dbt run --threads 1

This ensures only one model (and one vacuum) runs at a time. But it also slows everything down dramatically. You're now processing models serially, even though only the vacuum step needs to be serialised. Not a great tradeoff.

The better approach: centralised, serialised vacuums

Instead of tying vacuums to individual models, we can run them at the end of the run, after all models have been built, and process them one by one. Here’s the general idea:

This can be implemented using a run-operation macro and a run-level post-hook.

Step 1: Add a run-level post-hook

In your dbt_project.yml:

on-run-end:
    - "{{ vacuum_incremental_models(results) }}"

This macro will receive the full results object from the dbt run and can use it to identify which models to vacuum.

Step 2: Write the macro

In macros/vacuum_incremental_models.sql:

{% macro vacuum_incremental_models(results) %}
    {% for result in results %}
        {% if result.status == 'success' and result.node.resource_type == 'model' %}
            {% set materialized = result.node.config.materialized %}
            {% if materialized == 'incremental' %}
                {% set schema = result.node.schema %}
                {% set table = result.node.alias %}
                {% set sql %}
                    vacuum {{ schema }}.{{ table }};
                {% endset %}
                {{ log("Running vacuum on " ~ schema ~ "." ~ table, info=True) }}
                {% do run_query(sql) %}
            {% endif %}
        {% endif %}
    {% endfor %}
{% endmacro %}

Benefits with this approach:

Phil Spiess

Phil Spiess

Data professional by day, sailing and windsurfing enthusiast by weekend. Phil combines his passion for technology and the outdoors in his work and writing.