SQL Aggregate Window Functions | Advanced SQL
Mastering SQL Aggregation and Window Functions for Data Analysis
SQL aggregation is a powerful tool for analyzing data in databases, offering functions like SUM, AVG, and COUNT to distill large datasets into meaningful insights on trends and patterns.
When combined with window functions, the analytical potential expands significantly. The common aggregate window functions are: COUNT(), SUM(), AVG(), MIN(), MAX(), FIRST_VALUE(), and LAST_VALUE()
Window functions operate on a set of rows related to the current row, called a “window,” and allow to perform calculations across these rows while still retaining the individual row-level details. This means we can compute aggregated values like sums, averages, counts, and more, but instead of collapsing the entire dataset into a single result, you can maintain the granularity of your data while applying these aggregate functions selectively.
In this article, we will discuss the basics of SQL aggregation paired with window functions, offering a comprehensive exploration of their combined capabilities. I will leverage the small datasets provided below for the analysis.
Aggregation: Total Sales Amount Calculation
SELECT
SUM(amount) AS total_amount
FROM sales;
Result:
This query calculates the total sales amount by summing up the amount column from the Sales table. It returns a single value representing the overall sales revenue across all transactions in the dataset.
Exploring Aggregate Data with GROUP BY
-- Using Group by
SELECT
salesperson_id
, SUM(amount) AS total_amount
FROM sales
GROUP BY
salesperson_id;
Result:
The GROUP BY clause in SQL is used to group rows with the same values into summary rows. By employing the GROUP BY clause, this query organizes sales orders based on the salesperrson_id, allowing us to calculate the total sales amount for each unique salesperson.
The limitation of this query is that it restricts the output to only two columns — salesperson_id and SUM(amount) — thus excluding additional information from the result set.
Aggregation Combined with Window function
Window functions offer a way to compute aggregates across a set of rows while still retaining individual row details.
Query without Partition
In this query, as we haven’t defined any partitioning columns, we are calculating the total sales amount by summing the amount column for the entire Sales table.
-- Query without Partition
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount) OVER() AS total_amount
FROM
sales;
Result:
This query presents individual sales order details alongside aggregated information such as the total sales amounts, considering the entire dataset as the window for calculation.
Partition by a Column: Calculating Total Amount Sold by a Salesperson
-- total amount sold by salesperson_id
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount) OVER (PARTITION BY salesperson_id) AS Total_sales
FROM
sales;
In this query, the SUM function is applied to the amount column, with partitioning based on salesperson ID. This partitioning helps calculate the total sales for each salesperson, giving a detailed breakdown of their individual sales performance. This query computes the total sales attributed to each individual salesperson for every row in the result set.
Calculating Lowest Sales Amount by a Salesperson
-- lowest amount by a salesperson
SELECT
salesperson_id,
order_date,
product_id,
amount,
MIN(amount) OVER (PARTITION BY salesperson_id) AS Total_sales
FROM
sales;
Calculating Percent of Total
The query calculates the percentage of each sale amount against the overall total and the total for each salesperson using window functions.
-- Calculating a percent of total
SELECT
salesperson_id,
order_date,
product_id,
amount,
ROUND(amount * 100.0 / SUM(amount) OVER () ,2) AS pct_of_total,
ROUND(amount * 100.0 / SUM(amount)
OVER (PARTITION BY salesperson_id) ,2) AS pct_of_sp_total
FROM
sales;
Running Total of a Column — Order By
-- Running sum Order by
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount) OVER (ORDER BY order_date) AS total_amount
FROM
sales;
This query selects salesperson_id, order_date, and product_id from the sales table. Additionally, it calculates the cumulative sum of the amount column ordered by the order_date using the SUM window function. The resulting total_amount column represents the running total of sales amount up to the current row.
Calculating Running Total Within a Window
-- Calculating Running Total Within a Window
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount)
OVER (
PARTITION BY salesperson_id
ORDER BY order_date) AS total_amount
FROM
sales;
This query computes the cumulative sum of the amount column partitioned by salesperson ID and ordered by order date, generating a running total of sales for each salesperson.
For example, for salesperson_id of 103, the total_amount (cumulative sum of sales amount) column is calculated as follows:
Row 7: 240
Row 8: 240 +480 = 720
Row 9: 240+480 +1950 = 2670
Row 10: 240+480 +1950+720 = 3390
Analyzing Rolling Sales Totals with SQL Window Functions
Rows between 2 preceding and current row
The rows between 2 preceding and current row clause calculates a rolling sum of the amount column, including the current row and the two preceding rows, ordered by the order_date.
-- Rolling sales of Rows between 2 preceding and current row
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount)
OVER (ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS total_amount
FROM
sales;
Moving average of daily sales over the last 3 days
-- moving average of sales over the last 3 days
SELECT
salesperson_id,
order_date,
product_id,
amount,
ROUND(AVG(amount)
OVER (ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS total_amount
FROM
sales;
Rows between 1 preceding and 1 following
It calculates a rolling sum of the amount column, considering the current row along with its immediate preceding and following rows, ordered by order_date.
-- Rows between 1 preceding and 1 following
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS total_amount
FROM
sales;
Rows between 2 preceding and 1 preceding
The window frame is specified with the rows between 2 preceding and 1 preceding clause, which includes the two preceding rows but not the current row, ordered by the order_date. This means that for each row, the total_amount column will display the sum of the amount for the two preceding rows.
Rows between Unbounded Preceding and Current Row
The sum should include all rows from the beginning of the data (unbounded preceding) up to the current row.
For each row, the total_amount column will display the sum of all preceding amount values in the dataset (a running total of sales amount), ordered by the order_date.
-- rows between unbounded preceding and current row
SELECT
salesperson_id,
order_date,
product_id,
amount,
SUM(amount)
OVER(ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS total_amount
FROM sales;
GROUP BY: When to Use It
Use Case:
If you need to aggregate data into summary rows based on a specific grouping (like total sales per salesperson), then GROUP BY is the appropriate choice.
Performance:
GROUP BY is efficient for aggregating data into smaller result sets (e.g., one row per group) and uses less memory by collapsing rows into a single result.
Limitations:
GROUP BY reduces the result set to one row per group, so it can’t provide individual row details along with the aggregation.
Total Sales per Salesperson:
SELECT salesperson_id,
SUM(amount) AS total_sales
FROM sales
GROUP BY salesperson_id;
This is the optimal approach when you only need the total sales by salesperson.
Window Functions: When to Use Them
Use Case:
Use window functions for calculations like running totals or moving averages that require row-level details with aggregation.
Performance:
Window functions can be resource-intensive as they process the entire dataset and add columns for each row, but they are ideal for retaining row-level details while applying aggregations across a defined window.
Limitations:
Window functions can be slower for large datasets, especially with partitions or ordering, and they don’t reduce row count, making them less efficient than GROUP BY for high-level summaries like total sales per salesperson.
Running Total of Sales for Each Salesperson:
SELECT salesperson_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY salesperson_id
ORDER BY order_date) AS running_total
FROM sales;
This is a classic use case for window functions as they preserve row-level details while calculating a running total.
Tweets’ Rolling Averages [Twitter SQL Interview Question]
This question is sourced from datalemur.com
Given a table of tweet data over a specified time period, calculate the 3-day rolling average of tweets for each user. Output the user ID, tweet date, and rolling averages rounded to 2 decimal places.
CREATE TABLE tweets (
user_id INT,
tweet_date DATETIME,
tweet_count INT
);
INSERT INTO tweets (user_id, tweet_date, tweet_count) VALUES
(111, '2022-06-01 00:00:00', 2),
(111, '2022-06-02 00:00:00', 1),
(111, '2022-06-03 00:00:00', 3),
(111, '2022-06-04 00:00:00', 4),
(111, '2022-06-05 00:00:00', 5),
(111, '2022-06-06 00:00:00', 4),
(111, '2022-06-07 00:00:00', 6),
(199, '2022-06-01 00:00:00', 7),
(199, '2022-06-02 00:00:00', 5),
(199, '2022-06-03 00:00:00', 9),
(199, '2022-06-04 00:00:00', 1),
(199, '2022-06-05 00:00:00', 8),
(199, '2022-06-06 00:00:00', 2),
(199, '2022-06-07 00:00:00', 2),
(254, '2022-06-01 00:00:00', 1),
(254, '2022-06-02 00:00:00', 1),
(254, '2022-06-03 00:00:00', 2),
(254, '2022-06-04 00:00:00', 1),
(254, '2022-06-05 00:00:00', 3),
(254, '2022-06-06 00:00:00', 1),
(254, '2022-06-07 00:00:00', 3);
3-day Rolling Average of Tweet Count for Each User
-- 3 day rolling average
SELECT user_id
, tweet_date
,ROUND(AVG(tweet_count)
OVER(partition by user_id
ORDER BY tweet_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2)
as rolling_avg_3d
FROM tweets;
In summary, this article explores SQL aggregation along with GROUP BY, partitioning, ordering, and window function which enable the extraction of valuable insights from data with both granularity and aggregate perspectives.
Github Link: