Monitor table statistics using dbt on Snowflake

posted on 2022-03-08

Snowflake has some table statistics available on the tables DBT creates. But if we want data observability metrics on what happened over time, we have to implement some custom SQL.

In this blog post we will set up a dbt incremental model to track changes in these table statistics using the modification time as a unique key.

To get table information in Snowflake you execute show tables. However, this is not something you have get as a table, so after executing show tables you could use result_scan to collect the response into a table:

SHOW TABLES;
SELECT *
FROM TABLE(RESULT_SCAN())

Because these are two statements, you could use a dbt pre-hook to execute show tables and have the result_scan be the model body. Don't do this:

{{
  config(
    pre_hook='show tables'
  )
}}
SELECT
  "created_on" AS created_on,
  "name" AS table_name,
  "database_name" AS database_name,
  "schema_name" AS schema_name,
  "kind" AS kind,
  "comment" AS comment,
  "cluster_by" AS cluster_by,
  "rows" AS row_count,
  "bytes" AS bytes,
  "owner" AS owner,
  "retention_time" AS retention_time,
  "automatic_clustering" AS automatic_clustering,
  "change_tracking" AS change_tracking,
  "search_optimization" AS search_optimization,
  "search_optimization_progress" AS search_optimization_progress,
  "search_optimization_bytes" AS search_optimization_bytes,
  "is_external" AS is_external
FROM TABLE(RESULT_SCAN())

This works, but it completely ignores the table in that is already available in the information_schema of your Snowflake database. This gives a much cleaner approach to collecting this data in an incremental model:

{{
  config(
    materialized='incremental',
    unique_key='unique_key'
  )
}}
SELECT
  table_catalog,
  table_schema,
  table_name,
  table_owner,
  table_type,
  is_transient,
  clustering_key,
  row_count,
  bytes,
  retention_time,
  created,
  last_altered,
  auto_clustering_on,
  comment,
  MD5(CONCAT(last_altered, table_catalog, table_schema, table_name)) AS unique_key
FROM INFORMATION_SCHEMA.TABLES
WHERE row_count is not null

Every time a table is altered and you run this incremental model, a row will be added to allow you to track changes to the tables. Make sure you execute this dbt model as often as you see fit, for example after each dbt run.

Now that we have the data ready, let's create a small Snowflake dashboard to investigate this data:

From model we created earlier we select columns and introduce a schema.table name field we can later use to display the specific table.
We select the last month of newly created tables (keep in mind that dbt will CREATE OR REPLACE the tables every time): created > dateadd(month, -1, current_date())
We check to see which of these tables has changed the most by ordering by stddev(row_count) desc and picking the top 10.
For those tables, we only need the daily average in the graph, so we group by created_date and name.

Combining these steps we end up with two CTEs and a final grouping:

with row_counts as (
select
    concat(table_schema, '.', table_name) as name,
    created,
    date(created) as created_date,
    row_count
from dbt.table_monitoring_inc
where row_count is not null
and created > dateadd(month, -1, current_date())
),
interesting_tables as (
select
    name,
    ifnull(stddev(row_count), 0) as change_rate
from row_counts
group by name
order by change_rate desc
limit 10
)


select 
    created_date,
    name,
    avg(row_count) as avg_row_count
from row_counts
inner join interesting_tables using (name)
group by created_date, name

This will get you a graph of the 10 tables, that have changed in number of rows the most, in the last month, on a daily basis.

Side notes

Why not a snapshot? Well, because of the influx of data, I expect to see row_count changes almost every run. We want to keep every combination of created_on and row_count there is to create a nice graph. Therefore it's not really a slow changing dimension.
If you really want (or need for some reason) the pre-hook approach, you will have to create two models: one with the pre-hook and one with the incremental approach. If you do not, the incremental query and the result_scan will interfere.
If you create an incremental model, always name all the columns explicitly. This will allow you to monitor for changes in the underlying table. Also consider adding on_schema_change='fail'.
Should you not register the INFORMATION_SCHEMA.TABLES table as a source and reference that? Well, yes. That will allow you to easily track if multiple models are using this same information schema table, collect freshness, add tests, etc. But for this blog post I left that out.