Recently I have some friends asking me what is the best way to learn SQL. They are both experienced Excel users, writing formula like breathing. However, they are not trained to use SQL to manipulate data. My initial thinking was: If you know Excel that well, SQL is just another “dialect” to you. Let me explain why.
Table of Contents
Basic Structure of SQL
To understand why I say that, let’s look at the structure of SQL. Below is one of the best visualisations by ByteByteGo about how SQL query is being executed. SQL statements are simple. Here are the steps:
Steps | Meaning |
1. SELECT | Select the column that you are interested in |
2. FROM | From where the table(s) sits |
3. JOIN (Optional) / ON | There are some information that is not in the table that I selected from. I need reference. (i.e. vlookup). To join two tables together, I need a key that can map two tables together. To specify that, we can use JOIN another_table ON key |
4. WHERE (Optional) | Filtering. I only want the data WHERE it fits my requirement |
5. GROUP BY (Optional) | Build a summary. Similar to build a pivot table with rows only. |
6. HAVING (Optional) | Filter based on summary. We will skip this first |
7. LIMIT (Optional) | Select first n rows. |
Viola! SQL is not that hard, isn’t it?
Case Study
Let’s look at a few examples. Here we will cover
- Select
- Joining
- Group By (Aggregation)
Consider a hypothetical case study, where we have 2 tables. Our task is to understand the demographics of the customers who are converted.
Conversion
table: The data where each row represents a customer and whether they have converted [to whatever].Customer
table: The table that holds the information of each customer.
1. Select something from Table
Let’s say we want to select the customer_id
column and the conversion_date
column from the Conversion
table. Here is how we will write the SELECT
statement.
- We specify which column to be selected, i.e.
customer_id
,conversion_date
- We specify which table we are getting the data from.
SELECT customer_id,
conversion_date
FROM Conversion
2. Joining
Suppose our task is now to select customer_id
, converted
, and their household_size
, as we want to understand what is the conversion for each household.
As we can see we don’t have all the information in the conversion
table. The household_size
data is stored in the customer
table. We need to join these two table together to obtain the result.
In excel, we will use vlookup
to lookup the household_size
by using the customer_id
. The vlookup
formula specify which table the data that we need is being stored, and how can we join two tables together by using a key (customer_id
)
In SQL, we can do the same operation by using the LEFT JOIN
syntax.
SELECT conversion.customer_id,
conversion.converted,
customer.household_size
FROM conversion
LEFT JOIN customer
ON conversion.customer_id = customer.customer_id
- In the above code, I have highlighted the conversion table in yellow and the customer table in red
- The
SELECT
part specifies what columns do we need from each of the table - The
LEFT JOIN
PART specifies where can we find extra data that we need, and theON
part specifies what is the relationship between these two tables.
While there are different joining method in SQL language, the basic principle is the same. In most of the case, it is easier to use LEFT JOIN
as the rationale is same as vlookup
in Excel.
3. GROUP BY
The purpose of data analysis is to understand whether there are any patterns in data. The usual way to do this is Aggregation. Here are the common aggregation use case:
- What is the number of conversion per household group?
- What is the number of conversion per age?
Let’s take the first one as an example.
In order to understand what is the conversion per household group, we can do a simple count from the table below.
- For each customer, we look at how big is the household size for each of them, and whether he/she has converted.
- If he/she has converted, we add 1 to the count.
- After we count all the customers, then we can have a view on what is the count for each of the household size.
Advanced Excel user may immediately think of Pivot Table. By using pivot table, we can specify the row is the household size,
and the values is the sum of converted
. In fact, the GROUP BY
syntax is doing the exact same thing.
To produce an aggregation table, we can use the below query
SELECT household_size,
sum(converted) as converted_count
FROM customer_table_with_household_size
GROUP BY household_size
- We use the
GROUP BY
statement to specify the dimension that we are interested in after aggregation. In this case, it ishousehold_size
- The
SUM(converted)
means we will do a sum of the valueconverted
at eachhousehold_size
level.- i.e. For household size = 3, there are 2 customers have converted, and 1 has not. As the
converted
flag is either1
or0
, if we do a sum,1 + 1 + 0 = 2
customers have converted.
- i.e. For household size = 3, there are 2 customers have converted, and 1 has not. As the
Summary
In this article, we have walked through the basic structure of SQL. We have also demonstrated how common is SQL syntax to Excel operations. I hope this article can help those who are good at Excel but still find SQL difficult to learn!
November 1, 2024
Great work!