A dense percentile rank in Snowflake

2023-10-27

Snowflake SQL

Contents


1 Background

I was once asked to develop a model that produces scores for over a million sites in Australia. The sites included certain residential street addresses and points of interest (POIs) such as shopping centres, supermarkets, quick service restaurants (QSRs), hotels, universities and parking lots. Now, each site had a set of features which are on different levels of geographic granularity. For example:

In short, a few rows and column of the data looked something like this:

site_idsite_kindpoi_kindstreet_addresspostcodedist_to_supermarketother site-level features...prop_apartmentother postcode-level features...
3D990748RESIDENTIALNaN...24287.4...0.1559...
0A9AD384POISUPERMARKET...24280.0...0.1559...
38990C94POIQSR...24283.4...0.1559...
85E00EFARESIDENTIALNaN...24280.8...0.1559...
467AF263RESIDENTIALNaN...24282.0...0.1559...
92E1C7ABRESIDENTIALNaN...24297.7...0.0151...
434F0F78RESIDENTIALNaN...24296.7...0.0151...
C0DCBD14POIPARKING...24293.9...0.0151...
B3BC2EADRESIDENTIALNaN...24290.8...0.0151...
0B48D43CRESIDENTIALNaN...24298.3...0.0151...

(Note: These data are fabricated).

To help them understand how the features were driving the final model score, the business asked me provide them an extract of the scores, features and their ordinal statistics (rank and percentile rank out of all Australia) for a small subset of these sites.

Now before I go on, there is a key detail I need to mention: For several reasons, the table of 1M+ sites contains not only the site-level features but also the postcode-level features left-joined onto it. So there is a lot of redundancy in the postcode-level features for this table, but this is acceptable.

2 The task

For each site, suppose I want to rank its value for a postcode-level feature, e.g. number of residential buildings, in descending order (1st is better than 2nd is better than...) and the percentile rank of the value for prop_apartment in ascending order (100% is better than 99% is better than...). The natural way to do this in Snowflake is something like:


select
site_id,
postcode,
prop_apartment,
rank() over(order by prop_apartment desc) as prop_apartment_rnk,
percent_rank() over(order by prop_apartment asc) as prop_apartment_pct
from site_features;

This returns:

site_idpostcodeprop_apartmentprop_apartment_rnkprop_apartment_pct
3D99074824280.1559625590.862697
0A9AD38424280.1559625590.862697
38990C9424280.1559625590.862697
85E00EFA24280.1559625590.862697
467AF26324280.1559625590.862697
92E1C7AB24290.01512712730.409068
434F0F7824290.01512712730.409068
C0DCBD1424290.01512712730.409068
B3BC2EAD24290.01512712730.409068
0B48D43C24290.01512712730.409068

3 The problem

But this isn't what I want. Remember that by design, there is a lot of redundancy in the postcode-level features in this table, and there are only 2000 distinct values for prop_apartment in this table (because there are 2000 postcodes in the site universe). Naturally I would like the percentile values to be integer multiples of 1 / 2000 = 0.0005, yet that isn't what I'm getting. It is taking into account the number of sites in each postcode when calculating the percentile, but I don't care about this for the purpose of ranking its values.

4 My solution

For the rank, Snowflake's inbuilt dense_rank function comes to the rescue.


select
site_id,
postcode,
prop_apartment,
dense_rank() over(order by prop_apartment desc) as prop_apartment_drnk,
from site_features;

But for some reason, ATOW Snowflake does not have a built-in equivalent for the percent_rank function (by analogy, this might be called 'dense_percent_rank'). So this must be manually implemented. From Snowflake's documentation:

PERCENT_RANK is calculated as:

If n is 1: PERCENT_RANK = 0

If n is greater than 1: PERCENT_RANK = (r - 1) / (n - 1)

where r is the RANK of the row and n is the number of rows in the window partition.

Inspired by this, I arrived at the following rather verbose final solution. For the sake of illustration, assume there is now an extra unique postcode in the site universe, giving 2001 distinct postcodes in total.


select
site_id,
postcode,
prop_apartment,
dense_rank() over(order by prop_apartment desc) as prop_apartment_drnk,
(dense_rank() over(order by prop_apartment asc) - 1) / (count(distinct postcode) over(partition by 1) - 1) as prop_apartment_dpct
from site_features;
site_idpostcodeprop_apartmentprop_apartment_drnkprop_apartment_dpct
3D99074824280.15592950.5255
0A9AD38424280.15592950.5255
38990C9424280.15592950.5255
85E00EFA24280.15592950.5255
467AF26324280.15592950.5255
92E1C7AB24290.01519060.2200
434F0F7824290.01519060.2200
C0DCBD1424290.01519060.2200
B3BC2EAD24290.01519060.2200
0B48D43C24290.01519060.2200

Note that the 'dense' percentile ranks are now integer multiples of 0.0005, indicating they are ignoring the number of sites within each postcode. This is what I want.

(Note: For some reason a gratuitous over(partition by 1) is needed in the denominator, otherwise a 'not a valid groupby expression' error is thrown).

5 Summary using a toy example

To summarise the above as simply as possible, I have created the following toy example showing my thought process leading to the final solution.


create or replace table toy (
     id number,
     k text,
     x number
);
    
insert into toy values
     (1, 'a', 1),
     (2, 'a', 1),
     (3, 'a', 1),
     (4, 'b', 2),
     (5, 'b', 2),
     (6, 'b', 2),
     (7, 'c', 3),
     (8, 'd', 4),
     (9, 'd', 4),
     (10, 'e', 5);
    
select
*,
rank() over(order by x desc) as x_rnk,
dense_rank() over(order by x desc) as x_drnk,
percent_rank() over(order by x asc) as x_pct,
(dense_rank() over(order by x asc) - 1) / (count(distinct k) over(partition by 1) - 1)     as x_dpct
from toy;

Running these queries returns the following table:

IDKXX_RNKX_DRNKX_PCTX_DPCT
1a1850.0000000.00
2a1850.0000000.00
3a1850.0000000.00
4b2540.3333330.25
5b2540.3333330.25
6b2540.3333330.25
7c3430.6666670.50
8d4220.7777780.75
9d4220.7777780.75
10e5111.0000001.00

Description of columns:

6 Endnote: How this is done in Pandas

As an endnote, in case you're a Python user and are lucky enough to have your data fit in memory on a single machine, this is all handled very nicely in Pandas. pd.DataFrame.rank allows you to specify which method to use. The options are 'first', 'min', 'max', 'average', and 'dense'. The default is 'average'.


import pandas as pd
pdf = pd.DataFrame(
     data=[1, 2, 2, 3, 3, 3, 4, 4, 4, 4],
     columns=["x"],
)
meth_lst = [
     "first",
     "min",
     "max",
     "average",
     "dense",
]
for meth in meth_lst:
     pdf[f"rnk_{meth}"] = pdf["x"].rank(method=meth)
     pdf[f"pct_rnk_{meth}"] = pdf["x"].rank(method=meth, pct=True)
xrnk_firstpct_rnk_firstrnk_minpct_rnk_minrnk_maxpct_rnk_maxrnk_averagepct_rnk_averagernk_densepct_rnk_dense
11.00.11.00.11.00.11.00.101.00.25
22.00.22.00.23.00.32.50.252.00.50
23.00.32.00.23.00.32.50.252.00.50
34.00.44.00.46.00.65.00.503.00.75
35.00.54.00.46.00.65.00.503.00.75
36.00.64.00.46.00.65.00.503.00.75
47.00.77.00.710.01.08.50.854.01.00
48.00.87.00.710.01.08.50.854.01.00
49.00.97.00.710.01.08.50.854.01.00
410.01.07.00.710.01.08.50.854.01.00

It is clear what is being done here: 'first' splits ties by the order in which they appear in the dataframe, giving a unique percentile per row. 'min' takes the minimum of these per unique value. 'max' takes the maximum. 'average' takes the average of these last two. And finally 'dense' is what I showed before (a percentile applied to the unique values).

N.B. in all cases this is the 'inclusive' definition of percentile (see here) rather than the 'exclusive' definition. Pandas does not provide the option to calculate the 'exclusive' definition.

7 References

https://docs.snowflake.com/en/sql-reference/functions/dense_rank

https://docs.snowflake.com/en/sql-reference/functions/percent_rank

Back to top