A dense percentile rank in Snowflake

2023-10-27

Snowflake SQL

1 Background
2 The task
3 The problem
4 My solution
5 Summary using a toy example
6 Endnote: How this is done in Pandas
7 References

1 Background

I was once asked to develop a model that produces scores for over a million sites in Australia. The sites included certain residential street addresses and points of interest (POIs) such as shopping centres, supermarkets, quick service restaurants (QSRs), hotels, universities and parking lots. Now, each site had a set of features which are on different levels of geographic granularity. For example:

Site-level features: Distance features (distance to nearest supermarket, count of QSRs within a 10km radius), traffic features (daily volume of traffic driving past that site per vehicle category)
Postcode-level features: Population count by demographic (e.g. men and women aged 18-35 and 36+) and home ownership status (owners vs. renters), proportion of each building type (apartment vs. detached house, commercial vs. residential)

In short, a few rows and column of the data looked something like this:

site_id	site_kind	poi_kind	street_address	postcode	dist_to_supermarket	other site-level features...	prop_apartment	other postcode-level features...
3D990748	RESIDENTIAL	NaN	...	2428	7.4	...	0.1559	...
0A9AD384	POI	SUPERMARKET	...	2428	0.0	...	0.1559	...
38990C94	POI	QSR	...	2428	3.4	...	0.1559	...
85E00EFA	RESIDENTIAL	NaN	...	2428	0.8	...	0.1559	...
467AF263	RESIDENTIAL	NaN	...	2428	2.0	...	0.1559	...
92E1C7AB	RESIDENTIAL	NaN	...	2429	7.7	...	0.0151	...
434F0F78	RESIDENTIAL	NaN	...	2429	6.7	...	0.0151	...
C0DCBD14	POI	PARKING	...	2429	3.9	...	0.0151	...
B3BC2EAD	RESIDENTIAL	NaN	...	2429	0.8	...	0.0151	...
0B48D43C	RESIDENTIAL	NaN	...	2429	8.3	...	0.0151	...

(Note: These data are fabricated).

To help them understand how the features were driving the final model score, the business asked me provide them an extract of the scores, features and their ordinal statistics (rank and percentile rank out of all Australia) for a small subset of these sites.

Now before I go on, there is a key detail I need to mention: For several reasons, the table of 1M+ sites contains not only the site-level features but also the postcode-level features left-joined onto it. So there is a lot of redundancy in the postcode-level features for this table, but this is acceptable.

2 The task

For each site, suppose I want to rank its value for a postcode-level feature, e.g. number of residential buildings, in descending order (1^st is better than 2^nd is better than...) and the percentile rank of the value for prop_apartment in ascending order (100% is better than 99% is better than...). The natural way to do this in Snowflake is something like:


select
site_id,
postcode,
prop_apartment,
rank() over(order by prop_apartment desc) as prop_apartment_rnk,
percent_rank() over(order by prop_apartment asc) as prop_apartment_pct
from site_features;

This returns:

site_id	postcode	prop_apartment	prop_apartment_rnk	prop_apartment_pct
3D990748	2428	0.1559	62559	0.862697
0A9AD384	2428	0.1559	62559	0.862697
38990C94	2428	0.1559	62559	0.862697
85E00EFA	2428	0.1559	62559	0.862697
467AF263	2428	0.1559	62559	0.862697
92E1C7AB	2429	0.0151	271273	0.409068
434F0F78	2429	0.0151	271273	0.409068
C0DCBD14	2429	0.0151	271273	0.409068
B3BC2EAD	2429	0.0151	271273	0.409068
0B48D43C	2429	0.0151	271273	0.409068

3 The problem

But this isn't what I want. Remember that by design, there is a lot of redundancy in the postcode-level features in this table, and there are only 2000 distinct values for prop_apartment in this table (because there are 2000 postcodes in the site universe). Naturally I would like the percentile values to be integer multiples of 1 / 2000 = 0.0005, yet that isn't what I'm getting. It is taking into account the number of sites in each postcode when calculating the percentile, but I don't care about this for the purpose of ranking its values.

4 My solution

For the rank, Snowflake's inbuilt dense_rank function comes to the rescue.


select
site_id,
postcode,
prop_apartment,
dense_rank() over(order by prop_apartment desc) as prop_apartment_drnk,
from site_features;

But for some reason, ATOW Snowflake does not have a built-in equivalent for the percent_rank function (by analogy, this might be called 'dense_percent_rank'). So this must be manually implemented. From Snowflake's documentation:

PERCENT_RANK is calculated as:

If n is 1: PERCENT_RANK = 0

If n is greater than 1: PERCENT_RANK = (r - 1) / (n - 1)

where r is the RANK of the row and n is the number of rows in the window partition.

Inspired by this, I arrived at the following rather verbose final solution. For the sake of illustration, assume there is now an extra unique postcode in the site universe, giving 2001 distinct postcodes in total.


select
site_id,
postcode,
prop_apartment,
dense_rank() over(order by prop_apartment desc) as prop_apartment_drnk,
(dense_rank() over(order by prop_apartment asc) - 1) / (count(distinct postcode) over(partition by 1) - 1) as prop_apartment_dpct
from site_features;

site_id	postcode	prop_apartment	prop_apartment_drnk	prop_apartment_dpct
3D990748	2428	0.1559	295	0.5255
0A9AD384	2428	0.1559	295	0.5255
38990C94	2428	0.1559	295	0.5255
85E00EFA	2428	0.1559	295	0.5255
467AF263	2428	0.1559	295	0.5255
92E1C7AB	2429	0.0151	906	0.2200
434F0F78	2429	0.0151	906	0.2200
C0DCBD14	2429	0.0151	906	0.2200
B3BC2EAD	2429	0.0151	906	0.2200
0B48D43C	2429	0.0151	906	0.2200

Note that the 'dense' percentile ranks are now integer multiples of 0.0005, indicating they are ignoring the number of sites within each postcode. This is what I want.

(Note: For some reason a gratuitous over(partition by 1) is needed in the denominator, otherwise a 'not a valid groupby expression' error is thrown).

5 Summary using a toy example

To summarise the above as simply as possible, I have created the following toy example showing my thought process leading to the final solution.


create or replace table toy (
     id number,
     k text,
     x number
);
    
insert into toy values
     (1, 'a', 1),
     (2, 'a', 1),
     (3, 'a', 1),
     (4, 'b', 2),
     (5, 'b', 2),
     (6, 'b', 2),
     (7, 'c', 3),
     (8, 'd', 4),
     (9, 'd', 4),
     (10, 'e', 5);
    
select
*,
rank() over(order by x desc) as x_rnk,
dense_rank() over(order by x desc) as x_drnk,
percent_rank() over(order by x asc) as x_pct,
(dense_rank() over(order by x asc) - 1) / (count(distinct k) over(partition by 1) - 1)     as x_dpct
from toy;

Running these queries returns the following table:

ID	K	X	X_RNK	X_DRNK	X_PCT	X_DPCT
1	a	1	8	5	0.000000	0.00
2	a	1	8	5	0.000000	0.00
3	a	1	8	5	0.000000	0.00
4	b	2	5	4	0.333333	0.25
5	b	2	5	4	0.333333	0.25
6	b	2	5	4	0.333333	0.25
7	c	3	4	3	0.666667	0.50
8	d	4	2	2	0.777778	0.75
9	d	4	2	2	0.777778	0.75
10	e	5	1	1	1.000000	1.00

Description of columns:

k: Analogous to postcode in the above.
x: Analogous to prop_apartment in the above.
x_rnk and x_drnk: 'Normal' and dense x ranks.
x_pct: 'Normal' percentile rank.
x_dpct: A 'dense' percentile rank that meets my needs.

6 Endnote: How this is done in Pandas

As an endnote, in case you're a Python user and are lucky enough to have your data fit in memory on a single machine, this is all handled very nicely in Pandas. pd.DataFrame.rank allows you to specify which method to use. The options are 'first', 'min', 'max', 'average', and 'dense'. The default is 'average'.


import pandas as pd
pdf = pd.DataFrame(
     data=[1, 2, 2, 3, 3, 3, 4, 4, 4, 4],
     columns=["x"],
)
meth_lst = [
     "first",
     "min",
     "max",
     "average",
     "dense",
]
for meth in meth_lst:
     pdf[f"rnk_{meth}"] = pdf["x"].rank(method=meth)
     pdf[f"pct_rnk_{meth}"] = pdf["x"].rank(method=meth, pct=True)

x	rnk_first	pct_rnk_first	rnk_min	pct_rnk_min	rnk_max	pct_rnk_max	rnk_average	pct_rnk_average	rnk_dense	pct_rnk_dense
1	1.0	0.1	1.0	0.1	1.0	0.1	1.0	0.10	1.0	0.25
2	2.0	0.2	2.0	0.2	3.0	0.3	2.5	0.25	2.0	0.50
2	3.0	0.3	2.0	0.2	3.0	0.3	2.5	0.25	2.0	0.50
3	4.0	0.4	4.0	0.4	6.0	0.6	5.0	0.50	3.0	0.75
3	5.0	0.5	4.0	0.4	6.0	0.6	5.0	0.50	3.0	0.75
3	6.0	0.6	4.0	0.4	6.0	0.6	5.0	0.50	3.0	0.75
4	7.0	0.7	7.0	0.7	10.0	1.0	8.5	0.85	4.0	1.00
4	8.0	0.8	7.0	0.7	10.0	1.0	8.5	0.85	4.0	1.00
4	9.0	0.9	7.0	0.7	10.0	1.0	8.5	0.85	4.0	1.00
4	10.0	1.0	7.0	0.7	10.0	1.0	8.5	0.85	4.0	1.00

It is clear what is being done here: 'first' splits ties by the order in which they appear in the dataframe, giving a unique percentile per row. 'min' takes the minimum of these per unique value. 'max' takes the maximum. 'average' takes the average of these last two. And finally 'dense' is what I showed before (a percentile applied to the unique values).

N.B. in all cases this is the 'inclusive' definition of percentile (see here) rather than the 'exclusive' definition. Pandas does not provide the option to calculate the 'exclusive' definition.

7 References

https://docs.snowflake.com/en/sql-reference/functions/dense_rank

https://docs.snowflake.com/en/sql-reference/functions/percent_rank