2023-12-08
Data engineering Data cleansing Best practicesFor goodness' sake, can we please all agree on some naming conventions for temporal (date, time, datetime etc.) columns?
Suppose in some tables, you see this:
DT | WEEKDAY | MTH | X |
---|---|---|---|
2024-01-01 | 1 | 1 | 42 |
2024-01-02 | 2 | 1 | 42 |
2024-01-03 | 3 | 1 | 42 |
2024-01-04 | 4 | 1 | 42 |
2024-01-05 | 5 | 1 | 42 |
And in other tables, you see this:
DATE | WEEKDAY | MNTH | X |
---|---|---|---|
2024-01-01 | Monday | January | 42 |
2024-01-02 | Tuesday | January | 42 |
2024-01-03 | Wednesday | January | 42 |
2024-01-04 | Thursday | January | 42 |
2024-01-05 | Friday | January | 42 |
Now, you can't really label this incorrect. From the lens of a single table, anyone could argue that 'date', 'day' and 'mnth' are acceptable names to refer to a date, day of the week and month of the year. And anyone could argue that weekdays can be represented as both a string ('Monday') and an integer (1) and are therefore acceptable to place in a column called 'weekday'.
However, it is inconsistent. There are inconsistencies in name and inconsistencies in meaning. From the lens of multiple tables, no one could rationally argue why, in principle, there should be:
Moreover, it isn't visible in the above, but in some projects I've worked on, a column called 'date' might be of type `DATE` in some tables and of type `VARCHAR(255)` in other tables, requiring a cast to `DATE`.
Maybe the above is acceptable if there are only a handful of tables, in which case the data scientist can clean it up themselves with only a modicum of annoyance. But the problem gets worse the more tables this applies to. What if you're working on a project with tens or hundreds of tables, with no rhyme or reason whatsoever as to the naming of temporal columns, so that there is no way to efficiently surmise from the column name alone what data it contains, and perhaps forcing you to create lots of your own views like the following:
create or replace view schema_y.table_x_cleaned as
select
date(dt) as date,
weekday,
mnth as mth,
x
from schema_y.table_x;
Where you've renamed 'mnth' to 'mth' for no other reason than that is the name of the month column in most of the tables you're working with in that project.
It's high entropy. It's chaos.
Maybe by the year 2030 we will all have gotten our act together and this problem will no longer persist in corporate settings.
In the meantime, here are the conventions I use in my personal projects, so that I don't even have to think about it anymore:
Abbreviation | Meaning | Example |
---|---|---|
dd | Day as a date | 2000-01-01 |
ds | Day as a string | mon |
dow | Day of week as an integer | 7 |
woy | Week of year as an integer | 52 |
y | Year as an integer | 2000 |
md | Month as a date | 2000-01 |
ms | Month as a string | jan |
h | Hour of day as an integer | 23 |
m | Minute as an integer | 59 |
s | Second as an integer | 59 |
t | Time | 23:59:59 |
moy | Month of year as an integer | 12 |
dt | Datetime | 2000-01-01 23:59 |
So the above example would be saved in the database as:
DT | DS | MS | X |
---|---|---|---|
2023-01-01 | mon | jan | 42 |
2023-01-02 | tue | jan | 42 |
2023-01-03 | wed | jan | 42 |
2023-01-04 | thu | jan | 42 |
2023-01-05 | fri | jan | 42 |
I will now present my rationale for the above conventions.
Because there is no standard English word for them, in what follows, define a time unit as referring to any of the well-known units of time which are commonly used in practice in day-to-day data science projects. These are the year, quarter, month, week, day, hour, minute and the second.
(Depending on the business domain, in a lot of projects, you can safely assume that you will never, ever care about milliseconds, microseconds, nanoseconds etc. The proposed conventions apply only to those projects for which this is the case, although I don't doubt they could be extended to handle these fractions-of-a-second time units. This would apply to, say, real-time streaming analytics for a high-frequency trading or engineering application involving sensor data and IoT, in which these units become important).
Observation: Conveniently, almost all time units have unique initials.
The only exception is month/minute. For convenience, let's call this the 'month/minute ambiguity'.
Time unit | Initial |
---|---|
year | y |
quarter | q |
month | m |
week | w |
day | d |
hour | h |
minute | m |
second | s |
Call these the base representations of these time units.
Every time unit - surprise, surprise - has a temporal representation.
Proposal: Let's refer to string representations by appending 't' to their base representations.
Abbreviation | Meaning | Examples |
---|---|---|
yt | Year as a temporal | 2000 |
mt | Month as a temporal | Jan-00 |
dt | Day as a temporal | 2000-01-01 |
ht | Hour as a temporal | 2000-01-01T11 |
mt | Minute as a temporal | 2000-01-01T11:59 |
st | Second as a temporal | 2000-01-01T11:59:59 |
Every time unit has at least one integer representation. Most of them have multiple. They are all defined WRT a 'bigger' unit. For example, you can talk about the day of the week (1, ..., 7), day of the month (1, ..., 28/29/30/31) or day of the year (1, ..., 365).
The time unit with only one integer representation is the year, which is for practical purposes the biggest time unit (I haven't ever heard of anyone talking about year of the decade (1, ..., 10) or year of the century (1, ..., 100) in day-to-day English, let alone in a data science project). Its integer representation is therefore simply referred to as 'y'.
Proposal: Let's refer to these integer representations using the abbreviation: [base representation of the small unit]o[base representation of the bigger unit].
So we have:
Abbreviation | Meaning | AKA | Examples |
---|---|---|---|
y | Year | NaN | 2000, ..., 2023 |
qoy | Quarter of year | NaN | 1, ..., 4 |
moy | Month of year | NaN | 1, ..., 12 |
woy | Week of year | NaN | 1, ..., 52 |
doy | Day of year | NaN | 1, ..., 365 |
dom | Day of month | NaN | 1, ..., 31 |
dow | Day of week | Weekday | 1, ..., 7 |
hod | Hour of day | NaN | 1, ..., 24 |
moh | Minute of hour | NaN | 1, ..., 60 |
som | Second of minute | NaN | 1, ..., 60 |
Strictly speaking, the month/minute ambiguity causes the following two cases to fail:
From the above conventions it is unclear whether, say, the first one refers to the 'minute of the quarter' or the 'month of the quarter'.
Practically speaking, however, it's amusing to me the idea of anyone seriously talking about the 'minute of the quarter', (1, ..., 129600, ...) and I can't see how it could be a real issue.
Note: Year is the only time unit whose integer and temporal representations 'look' the same. The difference would be the data types of their columns in a database table.
Furthermore, every time unit has an 'obvious' integer representation which most people would agree is its 'standard' or 'canonical' integer representation. Therefore, refer to these as canonical integer representations (CIRs).
Time unit | CIR |
---|---|
y | y |
q | qoy |
m | moy |
w | woy |
d | dow |
h | hod |
m | moh |
s | som |
Note that the CIR isn't simply [unit] of [next bigger unit], because this isn't the case for the week or the month. i.e. we don't often speak about the 'week of the month' or the 'month of the quarter'.
In cases when there is no possibility of ambiguity, a time unit's CIR can be used. Otherwise, a full integer representation must be used.
Observation: The only calendar units that have English, and thus the only calendar units that have string representations, are months and weekdays. The rest only have integer representations.
Proposal: Let's refer to string representations by appending 's' to their base representations.
Abbreviation | Meaning | Examples |
---|---|---|
ms | Month as a string | mon, ..., sun |
ds | Day as a string | jan, ..., dec |
In summary, my above naming conventions for temporal columns:
I use these and other of my own conventions extensively throughout my personal data science projects. I find them indispensable to boosting my productivity.
One final thought: For goodness' sake, can we please all agree to a common word to refer to the 'part of the week' (weekday or weekend) - perhaps the 'weekpart'?