Monitor table statistics using dbt on Snowflake
Snowflake has some table statistics available on the tables DBT creates. But if we want data observability metrics on what happened over time, we have to implement some custom SQL.
In this blog post we will set up a dbt incremental model to track changes in these table statistics using the modification time as a unique key.
To get table information in Snowflake you execute show tables
. However, this is not something you have get as a table, so after executing show tables
you could use result_scan
to collect the response into a table:
SHOW TABLES;
SELECT *
FROM TABLE(RESULT_SCAN())
Because these are two statements, you could use a dbt pre-hook to execute show tables
and have the result_scan
be the model body. Don't do this:
{{
config(
pre_hook='show tables'
)
}}
SELECT
"created_on" AS created_on,
"name" AS table_name,
"database_name" AS database_name,
"schema_name" AS schema_name,
"kind" AS kind,
"comment" AS comment,
"cluster_by" AS cluster_by,
"rows" AS row_count,
"bytes" AS bytes,
"owner" AS owner,
"retention_time" AS retention_time,
"automatic_clustering" AS automatic_clustering,
"change_tracking" AS change_tracking,
"search_optimization" AS search_optimization,
"search_optimization_progress" AS search_optimization_progress,
"search_optimization_bytes" AS search_optimization_bytes,
"is_external" AS is_external
FROM TABLE(RESULT_SCAN())
This works, but it completely ignores the table in that is already available in the information_schema
of your Snowflake database. This gives a much cleaner approach to collecting this data in an incremental model:
{{
config(
materialized='incremental',
unique_key='unique_key'
)
}}
SELECT
table_catalog,
table_schema,
table_name,
table_owner,
table_type,
is_transient,
clustering_key,
row_count,
bytes,
retention_time,
created,
last_altered,
auto_clustering_on,
comment,
MD5(CONCAT(last_altered, table_catalog, table_schema, table_name)) AS unique_key
FROM INFORMATION_SCHEMA.TABLES
WHERE row_count is not null
Every time a table is altered and you run this incremental model, a row will be added to allow you to track changes to the tables. Make sure you execute this dbt model as often as you see fit, for example after each dbt run.
Now that we have the data ready, let's create a small Snowflake dashboard to investigate this data:
- From model we created earlier we select columns and introduce a schema.table name field we can later use to display the specific table.
- We select the last month of newly created tables (keep in mind that dbt will
CREATE OR REPLACE
the tables every time):created > dateadd(month, -1, current_date())
- We check to see which of these tables has changed the most by ordering by
stddev(row_count) desc
and picking the top 10. - For those tables, we only need the daily average in the graph, so we group by
created_date
andname
.
Combining these steps we end up with two CTEs and a final grouping:
with row_counts as (
select
concat(table_schema, '.', table_name) as name,
created,
date(created) as created_date,
row_count
from dbt.table_monitoring_inc
where row_count is not null
and created > dateadd(month, -1, current_date())
),
interesting_tables as (
select
name,
ifnull(stddev(row_count), 0) as change_rate
from row_counts
group by name
order by change_rate desc
limit 10
)
select
created_date,
name,
avg(row_count) as avg_row_count
from row_counts
inner join interesting_tables using (name)
group by created_date, name
This will get you a graph of the 10 tables, that have changed in number of rows the most, in the last month, on a daily basis.
Side notes
- Why not a
snapshot
? Well, because of the influx of data, I expect to see row_count changes almost every run. We want to keep every combination of created_on and row_count there is to create a nice graph. Therefore it's not really a slow changing dimension. - If you really want (or need for some reason) the
pre-hook
approach, you will have to create two models: one with the pre-hook and one with the incremental approach. If you do not, the incremental query and theresult_scan
will interfere. - If you create an incremental model, always name all the columns explicitly. This will allow you to monitor for changes in the underlying table. Also consider adding
on_schema_change='fail'
. - Should you not register the
INFORMATION_SCHEMA.TABLES
table as a source and reference that? Well, yes. That will allow you to easily track if multiple models are using this same information schema table, collect freshness, add tests, etc. But for this blog post I left that out.