Name conventions for temporal columns

2023-12-08

Data engineering Data cleansing Best practices

Contents


For goodness' sake, can we please all agree on some naming conventions for temporal (date, time, datetime etc.) columns?

1 The problem

Suppose in some tables, you see this:

DTWEEKDAYMTHX
2024-01-011142
2024-01-022142
2024-01-033142
2024-01-044142
2024-01-055142

And in other tables, you see this:

DATEWEEKDAYMNTHX
2024-01-01MondayJanuary42
2024-01-02TuesdayJanuary42
2024-01-03WednesdayJanuary42
2024-01-04ThursdayJanuary42
2024-01-05FridayJanuary42

Now, you can't really label this incorrect. From the lens of a single table, anyone could argue that 'date', 'day' and 'mnth' are acceptable names to refer to a date, day of the week and month of the year. And anyone could argue that weekdays can be represented as both a string ('Monday') and an integer (1) and are therefore acceptable to place in a column called 'weekday'.

However, it is inconsistent. There are inconsistencies in name and inconsistencies in meaning. From the lens of multiple tables, no one could rationally argue why, in principle, there should be:

Moreover, it isn't visible in the above, but in some projects I've worked on, a column called 'date' might be of type `DATE` in some tables and of type `VARCHAR(255)` in other tables, requiring a cast to `DATE`.

Maybe the above is acceptable if there are only a handful of tables, in which case the data scientist can clean it up themselves with only a modicum of annoyance. But the problem gets worse the more tables this applies to. What if you're working on a project with tens or hundreds of tables, with no rhyme or reason whatsoever as to the naming of temporal columns, so that there is no way to efficiently surmise from the column name alone what data it contains, and perhaps forcing you to create lots of your own views like the following:


create or replace view schema_y.table_x_cleaned as
select
date(dt) as date,
weekday,
mnth as mth,
x
from schema_y.table_x;

Where you've renamed 'mnth' to 'mth' for no other reason than that is the name of the month column in most of the tables you're working with in that project.

It's high entropy. It's chaos.

2 My solution

Maybe by the year 2030 we will all have gotten our act together and this problem will no longer persist in corporate settings.

In the meantime, here are the conventions I use in my personal projects, so that I don't even have to think about it anymore:

AbbreviationMeaningExample
ddDay as a date2000-01-01
dsDay as a stringmon
dowDay of week as an integer7
woyWeek of year as an integer52
yYear as an integer2000
mdMonth as a date2000-01
msMonth as a stringjan
hHour of day as an integer23
mMinute as an integer59
sSecond as an integer59
tTime23:59:59
moyMonth of year as an integer12
dtDatetime2000-01-01 23:59

So the above example would be saved in the database as:

DTDSMSX
2023-01-01monjan42
2023-01-02tuejan42
2023-01-03wedjan42
2023-01-04thujan42
2023-01-05frijan42

3 Rationale

I will now present my rationale for the above conventions.

Because there is no standard English word for them, in what follows, define a time unit as referring to any of the well-known units of time which are commonly used in practice in day-to-day data science projects. These are the year, quarter, month, week, day, hour, minute and the second.

(Depending on the business domain, in a lot of projects, you can safely assume that you will never, ever care about milliseconds, microseconds, nanoseconds etc. The proposed conventions apply only to those projects for which this is the case, although I don't doubt they could be extended to handle these fractions-of-a-second time units. This would apply to, say, real-time streaming analytics for a high-frequency trading or engineering application involving sensor data and IoT, in which these units become important).

3.1 Base representations of time units

Observation: Conveniently, almost all time units have unique initials.

The only exception is month/minute. For convenience, let's call this the 'month/minute ambiguity'.

Time unitInitial
yeary
quarterq
monthm
weekw
dayd
hourh
minutem
seconds

Call these the base representations of these time units.

3.2 Temporal representations of time units

Every time unit - surprise, surprise - has a temporal representation.

Proposal: Let's refer to string representations by appending 't' to their base representations.

AbbreviationMeaningExamples
ytYear as a temporal2000
mtMonth as a temporalJan-00
dtDay as a temporal2000-01-01
htHour as a temporal2000-01-01T11
mtMinute as a temporal2000-01-01T11:59
stSecond as a temporal2000-01-01T11:59:59

3.3 Integer representations of time units

Every time unit has at least one integer representation. Most of them have multiple. They are all defined WRT a 'bigger' unit. For example, you can talk about the day of the week (1, ..., 7), day of the month (1, ..., 28/29/30/31) or day of the year (1, ..., 365).

The time unit with only one integer representation is the year, which is for practical purposes the biggest time unit (I haven't ever heard of anyone talking about year of the decade (1, ..., 10) or year of the century (1, ..., 100) in day-to-day English, let alone in a data science project). Its integer representation is therefore simply referred to as 'y'.

Proposal: Let's refer to these integer representations using the abbreviation: [base representation of the small unit]o[base representation of the bigger unit].

So we have:

AbbreviationMeaningAKAExamples
yYearNaN2000, ..., 2023
qoyQuarter of yearNaN1, ..., 4
moyMonth of yearNaN1, ..., 12
woyWeek of yearNaN1, ..., 52
doyDay of yearNaN1, ..., 365
domDay of monthNaN1, ..., 31
dowDay of weekWeekday1, ..., 7
hodHour of dayNaN1, ..., 24
mohMinute of hourNaN1, ..., 60
somSecond of minuteNaN1, ..., 60

Strictly speaking, the month/minute ambiguity causes the following two cases to fail:

From the above conventions it is unclear whether, say, the first one refers to the 'minute of the quarter' or the 'month of the quarter'.

Practically speaking, however, it's amusing to me the idea of anyone seriously talking about the 'minute of the quarter', (1, ..., 129600, ...) and I can't see how it could be a real issue.

Note: Year is the only time unit whose integer and temporal representations 'look' the same. The difference would be the data types of their columns in a database table.

3.3.1 Canonical integer representations

Furthermore, every time unit has an 'obvious' integer representation which most people would agree is its 'standard' or 'canonical' integer representation. Therefore, refer to these as canonical integer representations (CIRs).

Time unitCIR
yy
qqoy
mmoy
wwoy
ddow
hhod
mmoh
ssom

Note that the CIR isn't simply [unit] of [next bigger unit], because this isn't the case for the week or the month. i.e. we don't often speak about the 'week of the month' or the 'month of the quarter'.

In cases when there is no possibility of ambiguity, a time unit's CIR can be used. Otherwise, a full integer representation must be used.

3.4 String representations of time units

Observation: The only calendar units that have English, and thus the only calendar units that have string representations, are months and weekdays. The rest only have integer representations.

Proposal: Let's refer to string representations by appending 's' to their base representations.

AbbreviationMeaningExamples
msMonth as a stringmon, ..., sun
dsDay as a stringjan, ..., dec

4 Summary

In summary, my above naming conventions for temporal columns:

I use these and other of my own conventions extensively throughout my personal data science projects. I find them indispensable to boosting my productivity.

5 Endnote: Weekpart

One final thought: For goodness' sake, can we please all agree to a common word to refer to the 'part of the week' (weekday or weekend) - perhaps the 'weekpart'?

Back to top