SQL is easy, if you already know Excel

Recently I have some friends asking me what is the best way to learn SQL. They are both experienced Excel users, writing formula like breathing. However, they are not trained to use SQL to manipulate data. My initial thinking was: If you know Excel that well, SQL is just another “dialect” to you. Let me explain why.

Basic Structure of SQL

To understand why I say that, let’s look at the structure of SQL. Below is one of the best visualisations by ByteByteGo about how SQL query is being executed. SQL statements are simple. Here are the steps:

StepsMeaning
1. SELECTSelect the column that you are interested in
2. FROMFrom where the table(s) sits
3. JOIN (Optional) / ONThere are some information that is not in the table that I selected from. I need reference. (i.e. vlookup).
To join two tables together, I need a key that can map two tables together. To specify that, we can use
JOIN another_table ON key
4. WHERE (Optional)Filtering. I only want the data WHERE it fits my requirement
5. GROUP BY (Optional)Build a summary. Similar to build a pivot table with rows only.
6. HAVING (Optional)Filter based on summary. We will skip this first
7. LIMIT (Optional)Select first n rows.
Reference: https://blog.bytebytego.com/p/ep50-visualizing-a-sql-query

Viola! SQL is not that hard, isn’t it?

Case Study

Let’s look at a few examples. Here we will cover

  • Select
  • Joining
  • Group By (Aggregation)

Consider a hypothetical case study, where we have 2 tables. Our task is to understand the demographics of the customers who are converted.

  • Conversion table: The data where each row represents a customer and whether they have converted [to whatever].
  • Customer table: The table that holds the information of each customer.

1. Select something from Table

Let’s say we want to select the customer_id column and the conversion_date column from the Conversion table. Here is how we will write the SELECT statement.

  • We specify which column to be selected, i.e. customer_id, conversion_date
  • We specify which table we are getting the data from.
SELECT customer_id,
       conversion_date
FROM   Conversion

2. Joining

Suppose our task is now to select customer_id, converted, and their household_size, as we want to understand what is the conversion for each household.

As we can see we don’t have all the information in the conversion table. The household_size data is stored in the customer table. We need to join these two table together to obtain the result.

In excel, we will use vlookup to lookup the household_size by using the customer_id. The vlookup formula specify which table the data that we need is being stored, and how can we join two tables together by using a key (customer_id)

In SQL, we can do the same operation by using the LEFT JOIN syntax.

SELECT    conversion.customer_id,
          conversion.converted,
          customer.household_size
FROM      conversion
LEFT JOIN customer 
ON        conversion.customer_id = customer.customer_id
  • In the above code, I have highlighted the conversion table in yellow and the customer table in red
  • The SELECT part specifies what columns do we need from each of the table
  • The LEFT JOIN PART specifies where can we find extra data that we need, and the ON part specifies what is the relationship between these two tables.

While there are different joining method in SQL language, the basic principle is the same. In most of the case, it is easier to use LEFT JOIN as the rationale is same as vlookup in Excel.

3. GROUP BY

The purpose of data analysis is to understand whether there are any patterns in data. The usual way to do this is Aggregation. Here are the common aggregation use case:

  • What is the number of conversion per household group?
  • What is the number of conversion per age?

Let’s take the first one as an example.

In order to understand what is the conversion per household group, we can do a simple count from the table below.

  • For each customer, we look at how big is the household size for each of them, and whether he/she has converted.
  • If he/she has converted, we add 1 to the count.
  • After we count all the customers, then we can have a view on what is the count for each of the household size.

Advanced Excel user may immediately think of Pivot Table. By using pivot table, we can specify the row is the household size, and the values is the sum of converted. In fact, the GROUP BY syntax is doing the exact same thing.

To produce an aggregation table, we can use the below query

SELECT    household_size,
          sum(converted) as converted_count  
FROM      customer_table_with_household_size
GROUP BY  household_size
  • We use the GROUP BY statement to specify the dimension that we are interested in after aggregation. In this case, it is household_size
  • The SUM(converted) means we will do a sum of the value converted at each household_size level.
    • i.e. For household size = 3, there are 2 customers have converted, and 1 has not. As the converted flag is either 1 or 0, if we do a sum, 1 + 1 + 0 = 2 customers have converted.

Summary

In this article, we have walked through the basic structure of SQL. We have also demonstrated how common is SQL syntax to Excel operations. I hope this article can help those who are good at Excel but still find SQL difficult to learn!

1 Comment

  1. Ivan
    November 1, 2024

    Great work!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *