When working with tables …

posted on 2022-12-11

There is a large number of opinions about how to deal with data and organize tables. But I think there are a few things that I end up doing for all table related information, that I might as well try to formalize in a blog post.

Be explicit when naming things. This will make working with the tables later easier because you will end up with fewer name collisions when joining and fewer mistakes. This means avoiding context dependent generic names, like id: favor account_id over simply id in the accounts table.

This also means being explicit about the type information you are not able to convey in the schema, always postfix the timezone on dates and timestamps. It is quite possible you will encounter dates encoded as stings, so you might as well add date and timestamp to the column name. If you later transport the date across some JSON or CSV, you don't have to worry about adding that information. Also, when it comes to date vs timestamp, the only difference is the granularity of the information. Both point to an interval on a timeline.

This means avoiding creation in favor of creation_timestamp_utc or creation_date_utc. As a side-note, avoid created because it could also be seen as a boolean referring to was this entry created or derived from other information.

One could say that date should not be added, because the schema of a database most certainly has support to expose if this is a date or not (you are not going to name all your tables something_table, right?). However, I find the same information in multiple resolutions in the same table to either support partitioning or improve performance on columnar operations that don't require the other fields. This would create a collusion in the names if you don't add that date and timestamp postfix: creation_timestamp_utc and creation_date_utc.

Enforce the implicit when storing things. This will reduce errors. This means avoiding having a creation_date_utc encoded as a DATE in one table, and a VARCHAR in another table. If you are working with timestamps, make sure you decide if you are going to use the TIMESTAMP type or not. Then when you decided upon this, check schema's and enforce them where possible. If a column is not the correct type, consider it a bug.

Enforcing up front is relevant and required to make sure you don't end up with corrupted data. Look for tools that help you do that, like running a layered data warehouse using dbt to build some data in the base layers, run tests on that layer and only if all tests succeed, continue to the next layer. No matter how many layers you have, this also means you will probably want the last layer to be fully materialized to ensure that you can try new data at a lower level without disrupting service to other systems.

Happy hacking!