Our models (typically) fit into three main categories: staging, warehouse, marts. For more detail about aspects of this structure, check out the dbt best practices.
The file and naming structures are as follows (example):
ssp_analytics
├── .github
│ ├── workflows
│ │ ├── ci.yml
│ │ ├── daily_refresh.yml
│ │ └── post_merge_deploy.yml
│ └── pull_request_template.md
├── _project_docs
│ ├── automation
│ │ │ └── profiles.yml
│ └── style_guide.md
├── analyses
├── seeds
│ └── some_data.csv
├── snapshots
├── tests
│ └── assert_some_test_scenario.sql
├── macros
│ ├── _macros__definitions.yml
│ ├── _macros__docs.md
│ └── generate_schema_name.sql
├── models
│ ├── marts
│ │ ├── _marts__docs.md
│ │ ├── _marts__models.yml
│ │ └── nba_games_detail.sql
│ ├── staging
│ │ ├── nba
│ │ │ ├── _nba__docs.md
│ │ │ ├── _nba__models.yml
│ │ │ ├── _nba__sources.yml
│ │ │ ├── stg_nba__games.sql
│ │ │ └── stg_nba__teams.sql
│ │ └── gsheets
│ │ ├── _gsheets__models.yml
│ │ ├── _gsheets__sources.yml
│ │ ├── stg_gsheets__franchise_actives.yml
│ │ ├── stg_gsheets__franchise_general_managers.yml
│ │ └── stg_gsheets__franchise_head_coaches.sql
│ ├── warehouse
│ │ ├── dimensions
│ │ │ ├── _dimensions__docs.md
│ │ │ ├── _dimensions__models.yml
│ │ │ ├── dim_calendar_dates_.sql
│ │ │ ├── dim_games.sql
│ │ │ └── dim_teams.sql
│ │ └── facts
│ │ ├── _facts__docs.yml
│ │ ├── _facts__models.yml
│ │ └── fct_games_played.sql
├── README.md
├── dbt_project.yml
├── packages.yml
└── requirements.txt
- All objects should be plural, such as:
stg_nba__teams - Staging models are 1:1 with each source table and named with the following convention:
stg_<source>__<table_name>.sql - Marts contain all of the useful data about a particular entity at a granular level and should lean towards being wide and denormalized.
- Intermediate tables (if needed) should help break apart complex or lengthy logic and follow the following convention:
int_[entity]s_[verb]s.sql
- Model-specific attributes (like sort/dist keys) should be specified in the model.
- If a particular configuration applies to all models in a directory, it should be specified in the
dbt_project.ymlfile. - In-model configurations should be specified like this:
{{
config(
materialized = 'table',
sort = 'id',
dist = 'id'
)
}}- Marts should always be configured as tables
- Only
stg_models (orbase_models if your project requires them) should select fromsources. - All other models should only select from other models.
- Every subdirectory should contain a
.ymlfile, in which each model in the subdirectory is tested. For staging folders, there will be both_sourcename__sources.ymlas well as_sourcename__models.yml. For other folders, the structure should be_foldername__models.yml(example_finance__models.yml). - At a minimum, unique and not_null tests should be applied to the primary key of each model.
- Schema, table and column names should be in
snake_case. - Use names based on the business terminology, rather than the source terminology.
- Each model should have a primary key.
- The primary key of a model should be named
<object>_id, e.g.account_id– this makes it easier to know whatidis being referenced in downstream joined models. - For base/staging models, fields should be ordered in categories, where identifiers are first and timestamps are at the end.
- Timestamp columns should be named
<event>_at, e.g.created_at, and should be in UTC. If a different timezone is being used, this should be indicated with a suffix, e.gcreated_at_pt. - Booleans should be prefixed with
is_orhas_. - Price/revenue fields should be in decimal currency (e.g.
19.99for $19.99; many app databases store prices as integers in cents). If non-decimal currency is used, indicate this with suffix, e.g.price_in_cents. - Avoid reserved words as column names
- Consistency is key! Use the same field names across models where possible, e.g. a key to the
customerstable should be namedcustomer_idrather thanuser_id.
For more information about why we use so many CTEs, check out this discourse post.
- All
{{ ref('...') }}statements should be placed in CTEs at the top of the file - Where performance permits, CTEs should perform a single, logical unit of work.
- CTE names should be as verbose as needed to convey what they do
- CTEs with confusing or noteable logic should be commented
- CTEs that are duplicated across models should be pulled out into their own models
- create a
finalor similar CTE that you select from as your last line of code. This makes it easier to debug code within a model (without having to comment out code!) - CTEs should be formatted like this:
with
events as (
...
),
-- CTE comments go here
filtered_events as (
...
)
select * from filtered_events-
Use trailing commas
-
Indents should be four spaces (except for predicates, which should line up with the
wherekeyword) -
Lines of SQL should be no longer than 80 characters
-
Field names and function names should all be lowercase
-
The
askeyword should be used when aliasing a field or table -
Fields should be stated before aggregates / window functions
-
Aggregations should be executed as early as possible before joining to another table.
-
Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see this rant for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design.
-
Prefer
union alltounion* -
Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers".
-
If joining two or more tables, always prefix your column names with the table alias. If only selecting from one table, prefixes are not needed.
-
Be explicit about your join (i.e. write
inner joininstead ofjoin).left joinsare normally the most useful,right joinsoften indicate that you should change which table you selectfromand which one youjointo. -
DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP, BRAIN TIME IS EXPENSIVE
with
my_data as (
select * from {{ ref('my_data') }}
),
some_cte as (
select * from {{ ref('some_cte') }}
),
some_cte_agg as (
select
id,
sum(field_4) as total_field_4,
max(field_5) as max_field_5
from some_cte
group by 1
),
final as (
select [distinct]
my_data.field_1,
my_data.field_2,
my_data.field_3,
-- use line breaks to visually separate calculations into blocks
case
when my_data.cancellation_date is null
and my_data.expiration_date is not null
then expiration_date
when my_data.cancellation_date is null
then my_data.start_date + 7
else my_data.cancellation_date
end as cancellation_date,
some_cte_agg.total_field_4,
some_cte_agg.max_field_5
from my_data
left join some_cte_agg
on my_data.id = some_cte_agg.id
where my_data.field_1 = 'abc'
and (
my_data.field_2 = 'def' or
my_data.field_2 = 'ghi'
)
having count(*) > 1
)
select * from final
- Your join should list the "left" table first (i.e. the table you are selecting
from):
select
trips.*,
drivers.rating as driver_rating,
riders.rating as rider_rating
from trips
left join users as drivers
on trips.driver_id = drivers.user_id
left join users as riders
on trips.rider_id = riders.user_id
- Indents should be two spaces
- List items should be indented
- Use a new line to separate list items that are dictionaries where appropriate
- Lines of YAML should be no longer than 80 characters.
version: 2
models:
- name: events
columns:
- name: event_id
description: This is a unique identifier for the event
tests:
- unique
- not_null
- name: event_time
description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)"
tests:
- not_null
- name: user_id
description: The ID of the user who recorded the event
tests:
- not_null
- relationships:
to: ref('users')
field: id- When using Jinja delimiters, use spaces on the inside of your delimiter, like
{{ this }}instead of{{this}} - Use newlines to visually indicate logical blocks of Jinja
- https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview
- https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091
- https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/
- https://docs.getdbt.com/docs/about/viewpoint
- https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md