One of the most important principles in software engineering, in my opinion, is single source of truth. It applies up and down the chain, from databases to React front ends to data engineering and analytics. The Semantic Layer is a relatively new concept (outside of very large organizations that were able to create their own custom solutions) that allows us to apply the single source of truth construct to data analytics: define your data objects in one centralized place and reuse them everywhere.

A Simple Example

I’m not quite happy with that definition above; it’s a little convoluted for my taste, so let’s look at a simple example. Business leaders in your organization probably want some sort of dashboard where they can see how many active subscribers your service currently has. We might build a database table subscribers something like this:

customer_id subscribed trialing
uuid1 0 0
uuid2 1 0
uuid3 1 1

If subscribed = 1 and trialing = 0 , the user is actively subscribed, and we want to show them in our dashboard’s total count of subscribers. The query is simple:

select count(*)
from subscribers
where subscribed = 1 and trialing = 0;

As your organization grows, you’ll inevitably create multiple dashboards, ad hoc analyses, Jupyter notebooks, and even end up using multiple data analytics tools.

What happens when someone who’s less familiar with your database forgets the and trialing = 0 bit? Now they’re looking at, and potentially making important business decisions based on an incorrectly high number.

Potentially worse, consider the very real possibility of making some change in the future that requires an update to the definition of “is subscribed.” Maybe you add a new column provided_free that indicates subscriptions that are provided to users for free as part of marketing initiatives. You now have a major, major headache: you’ll need to find every usage of this query in every analytics tool, notebook, and analysis, and carefully update each one to include and provided_free = 0. Yikes.

Single Source of Truth to the Rescue

A Semantic Layer allows you to centralize the “users subscribed” definition in one single, easy-to-update place. dbt’s Semantic Layer, formerly a separate project named MetricFlow, is the only tool I’ve personally used in production, but there are a number of options on the market today. I don’t want to get bogged down in code as each Semantic Layer tool is different, but in dbt we’d do something like this:

dimensions:
	- name: is_subscribed
		description: 'True for each paying subscriber, false otherwise'
		expr: case when subscribed = 1 and trialing = 0 then true else false end
	  type: categorical

measures: 
  - name: all_customers
    description: 'The distinct count of paying customers by account creation date'
    agg: count_distinct
		filter: {{Dimension('is_subscribed')}} is true
    agg_time_dimension: created_at

dbt’s Semantic Layer capabilities take this code and make it queryable in your analytics tools. Instead of the raw select count(*) SQL, we query the Semantic Layer all_customers measure.

In the future when our back-end engineers add the provided_free field, they’ll ping us in analytics: “bad news, we need to update every count of subscribers everywhere to include provided_free = 0" Rather than a week or more of a nightmare digging through all of our analytics tools, we can simply head into dbt and update the is_subscribed dimension!

case when subscribed = 1 and trialing = 0 and provided_free = 0 then true else false end