In the daily work, I was instructed to get some statistics about a group of data, so I learned to use Spark window functions to simplify my work.

What is window function in Spark

Window functions operate on a group of rows, referred to as a window, and calculate a return value for each row based on the group of rows. Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row.

In other words, you can apply some calculation on customized grouped data.

There are three types of window functions:

Ranking Function:

RANK: As name suggeted, it gives ranking numbers for perticular values. if the values are same, then rank number is also same. The next value will be n + 1 if there are n same values before this value. e.g., for the values of 1,1,2, the rank number will be 1,1,3

DENSE_RANK: It performs similar with RANK function, expect it ranks values by the differences regardless how many same values are there. e.g., for the values of 1,1,2, the rank number will be 1,1,2

PERCENT_RANK: Same as PERCENT_RANK in sql. It will give the percentile of current ranking over total number of rows in this window

NTILE: It get the group id of n group. n is the parameter taken by NTILE. If n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.

ROW_NUMBER: Same as name suggets, it will give the row number of each row

Analytic Functions

CUME_DIST: Get the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.

LAG: Get the value that is offset rows before the current row, and default if there is less than offset rows before the current row.

LEAD: Get the value that is offset rows after the current row, and default if there is less than offset rows after the current row.

NTH_VALUE: Get the offsetth non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

FIRST_VALUE: Get the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

LAST_VALUE: Get the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.

Aggregate Functions

It has a lot of functions, in this article will only list the most common one. For more details, please visit: https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions

MAX: Get the max value in the window.

MIN: Get the minimum value in the window.

COUNT: Get the total numbers of values in the window.

SUM: Get the sum of all values in the window.

AVG: Get the average(mean) of the values in the window.

Using Window Functions

Window functions can use with dataframes or with spark sql. We will only discuss how to use it with spark sql here.

Syntax

window_function [ nulls_option ] OVER
( [  { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ , ... ] ) ]
  { ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ]
  [ window_frame ] )

Window_function

What listed above

nulls_options

Specifies whether or not to skip null values when evaluating the window function. RESPECT NULLS means not skipping null values, while IGNORE NULLS means skipping. If not specified, the default is RESPECT NULLS.

available options: { IGNORE | RESPECT } NULLS

Note: Only LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE can be used with IGNORE NULLS.

Window_frame

Specifies which row to start the window on and where to end it.

available options: { RANGE | ROWS } { frame_start | BETWEEN frame_start AND frame_end }

frame_start and frame_end: UNBOUNDED PRECEDING | offset PRECEDING | CURRENT ROW | offset FOLLOWING | UNBOUNDED FOLLOWING

offset: specifies the offset from the position of the current row.

Note: if frame_end is omitted it defaults to CURRENT ROW.

Other options

Haven't figure out the differences between PARTITION and DISTRIBUTE, will update later

References:

Window Functions - Spark 3.4.0 Documentation

Spark Window Functions