Data Analysis using SQL – Part 3 – Advanced SQL

Jeswin Augustine October 16, 2017

You now know two types of SQL commands, namely:

Data Definition Language
Data Manipulation Language

The Data Definition Language (DDL) is used to create and modify the schema of the database. Commands like CREATE, ALTER and DROP are part of this language.

As a data analyst, you would always be actively involved in data retrieval activities. Here, the Data Manipulation Language (DML) commands would come in handy, e.g. the DML command SELECT, its purpose, various clauses and filtering operations.

Order by Clause

The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. Some databases sort the query results in an ascending order by default.

The basic syntax of the ORDER BY clause is as follows −

SELECT column-list 
FROM table_name 
[WHERE condition] 
[ORDER BY column1, column2, .. columnN] [ASC | DESC];

The order in which it appears is “select * from table where some_variable = x order by some_variable”.

Example 
SELECT * 
FROM CUSTOMERS
WHERE AGE >32    
ORDER BY NAME, SALARY;

Aggregate Functions

An SQL aggregate function calculates on a set of values and returns a single value. For example, the average function ( AVG) takes a list of values and returns the average.

The following are the commonly used SQL aggregate functions:

AVG() – returns the average of a set.
COUNT() – returns the number of items in a set.
MAX() – returns the maximum value in a set.
MIN() – returns the minimum value in a set
SUM() – returns the sum of all or distinct values in a set

Except for the COUNT() function, SQL aggregate functions ignore null.

You can use aggregate functions as expressions only in the following:

The select list of a SELECT statement, either a subquery or an outer query.
A HAVING clause

Group by Clause

Many times as an analyst, you would have to generate reports related to specific departments. In such scenarios, you would collect information on departments, products, assembly lines, vendors, etc. SQL provides a special clause called ‘Group by’ for collecting facts about certain categories.

The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. This GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.

The basic syntax of a GROUP BY clause is shown in the following code block. The GROUP BY clause must follow the conditions in the WHERE clause and must precede the ORDER BY clause if one is used.

SELECT column1, column2
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2
ORDER BY column1, column2

Example 
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;

Having Clause

Suppose your manager asks you to count all the employees whose salary is more than the average salary in that particular department. Now, intuitively, you know that two aggregate functions would be used here — count() and avg(). You decide to apply the where condition on the average salary of the department, but to your surprise, the query fails. In fact, you should try writing this query before moving ahead.

The HAVING Clause enables you to specify conditions that filter which group results appear in the results.

The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on groups created by the GROUP BY clause.

The following code block shows the position of the HAVING Clause in a query.

SELECT column1, column2
FROM table1, table2
WHERE [ conditions ]
GROUP BY column1, column2
HAVING [ conditions ]
ORDER BY column1, column2

Example
SELECT ID, NAME, AGE, ADDRESS, SALARY
FROM CUSTOMERS
GROUP BY age
HAVING COUNT(age) >= 2;

The having clause is typically used when you have to apply a filter condition on an ‘aggregated value’. This is because the ‘where’ clause is applied before the aggregation takes place, and thus it is not useful when you want to apply a filter on an aggregated value.

In other words, the having clause is equivalent to a where clause after the group by has been executed but before the select part is executed.

Nested Queries

You know that a database is a collection of multiple related tables. While generating insights from the data, you may need to refer to these multiple tables in a query. There are two ways to deal with such types of queries:

Joins
Nested queries/Subqueries

A subquery is used to return data that will be used in the main query as a condition to further restrict the data to be retrieved.

Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along with the operators like =, <, >, >=, <=, IN, BETWEEN, etc.

There are a few rules that subqueries must follow −

Subqueries must be enclosed within parentheses.
A subquery can have only one column in the SELECT clause, unless multiple columns are in the main query for the subquery to compare its selected columns.
An ORDER BY command cannot be used in a subquery, although the main query can use an ORDER BY. The GROUP BY command can be used to perform the same function as the ORDER BY in a subquery.
Subqueries that return more than one row can only be used with multiple value operators such as the IN operator.
The SELECT list cannot include any references to values that evaluate to a BLOB, ARRAY, CLOB, or NCLOB.
A subquery cannot be immediately enclosed in a set function.
The BETWEEN operator cannot be used with a subquery. However, the BETWEEN operator can be used within the subquery.

Subqueries with the SELECT Statement

Subqueries are most frequently used with the SELECT statement. The basic syntax is as follows −

SELECT column_name [, column_name ]
FROM   table1 [, table2 ]
WHERE  column_name OPERATOR
   (SELECT column_name [, column_name ]
   FROM table1 [, table2 ]
   [WHERE])

Inner Join

The most important and frequently used of the joins is the INNER JOIN. They are also referred to as an EQUIJOIN.

The INNER JOIN creates a new result table by combining column values of two tables (table1 and table2) based upon the join-predicate. The query compares each row of table1 with each row of table2 to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, column values for each matched pair of rows of A and B are combined into a result row.

Syntax

The basic syntax of the INNER JOIN is as follows.

SELECT table1.column1, table2.column2...
FROM table1
INNER JOIN table2
ON table1.common_field = table2.common_field;

Example

SELECT  ID, NAME, AMOUNT, DATE
   FROM CUSTOMERS
   INNER JOIN ORDERS
   ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;

Multi Join

Multi-joins, you can join multiple tables using the common attributes between pairs of tables. This is possible because the result of a join is also a table which you can join further to another table (with a common attribute).

Example

SELECT customerName, customercity, customermail, salestotal
FROM onlinecustomers AS oc
   INNER JOIN
   orders AS o
   ON oc.customerid = o.customerid
   INNER JOIN
   sales AS s
   ON o.orderId = s.orderId

Outer Join

In an inner join, all the records from the first table are referred to and joined together. However, there may be situations when you need to join the records based on a second table.

There are three kinds of OUTER JOIN: left outer join, right outer join and full outer join. Let’s examine each kind of join in more detail.

SQL OUTER JOIN – left outer join

SQL left outer join is also known as SQL left join. Suppose, we want to join two tables: A and B. SQL left outer join returns all rows in the left table (A) and all the matching rows found in the right table (B). It means the result of the SQL left join always contains the rows in the left table.

The following illustrate SQL left outer syntax of joining 2 tables: table_A and table_B:

1234	SELECT column1, column2… FROM table_ALEFT JOIN table_B ON join_conditionWHERE row_condition

SQL OUTER JOIN – left outer join example

The following query selects all customers and their orders:

123456

SELECT c.customerid, c.companyName, orderidFROM customers cLEFT JOIN orders o ON o.customerid = c.customeridORDER BY orderid

All rows in the customers table are listed. In case, there is no matching row in the orders table found for the row in the customers table, the orderid column in the orders table is populated with NULL values.

We can use Venn diagram to visualize how SQL LEFT OUTER JOIN works.

SQL OUTER JOIN – right outer join

SQL right outer join returns all rows in the right table and all the matching rows found in the left table. The syntax of the SQL right outer join is as follows:

1234	SELECT column1, column2… FROM table_ARIGHT JOIN table_B ON join_conditionWHERE row_condition

SQL right outer join is also known as SQL right join.

SQL OUTER JOIN – right outer join example

The following example demonstrates the SQL right outer join:

123456

SELECT c.customerid, c.companyName, orderidFROM customers cRIGHT JOIN orders o ON o.customerid = c.customeridORDER BY orderid

The query returns all rows in the orders table and all matching rows found in the customers table.

The following Venn diagram illustrates how the SQL right outer join works:

SQL OUTER JOIN – full outer join

The syntax of the SQL full outer join is as follows:

1234	SELECT column1, column2… FROM table_AFULL OUTER JOIN table_B ON join_conditionWHERE row_condition

SQL full outer join returns:

all rows in the left table table_A.
all rows in the right table table_B.
and all matching rows in both tables.

Some database management systems do not support SQL full outer join syntax e.g., MySQL. Because SQL full outer join returns a result set that is a combined result of both SQL left join and SQL right join. Therefore you can easily emulate the SQL full outer join using SQL left join and SQL right join with UNION operator as follows:

1234567

SELECT column1, column2… FROM table_ALEFT JOIN table_B ON join_conditionUNIONSELECT column1, column2… FROM table_ARIGHT JOIN table_B ON join_condition

SQL OUTER JOIN – full outer join example

The following query demonstrates the SQL full outer join:

123456

SELECT c.customerid, c.companyName, orderidFROM customers cFULL OUTER JOIN orders o ON o.customerid = c.customeridORDER BY orderid

SQL OUTER JOIN - full outer join example

The following Venn diagram illustrates how SQL full outer join works:

What is the difference between “INNER JOIN” and “OUTER JOIN”?

Assuming you’re joining on columns with no duplicates, which is a very common case:

An inner join of A and B gives the result of A intersect B, i.e. the inner part of a Venn diagram intersection.
An outer join of A and B gives the results of A union B, i.e. the outer parts of a Venn diagram union.

Examples

Suppose you have two tables, with a single column each, and data as follows:

Note that (1,2) are unique to A, (3,4) are common, and (5,6) are unique to B.

Inner join

An inner join using either of the equivalent queries gives the intersection of the two tables, i.e. the two rows they have in common.

select * from a INNER JOIN b on a.a = b.b;
select a.*, b.*  from a,b where a.a = b.b;

a | b
--+--
3 | 3
4 | 4

Left outer join

A left outer join will give all rows in A, plus any common rows in B.

select * from a LEFT OUTER JOIN b on a.a = b.b;
select a.*, b.*  from a,b where a.a = b.b(+);

a |  b
--+-----
1 | null
2 | null
3 |    3
4 |    4

Right outer join

A right outer join will give all rows in B, plus any common rows in A.

select * from a RIGHT OUTER JOIN b on a.a = b.b;
select a.*, b.*  from a,b where a.a(+) = b.b;

a    |  b
-----+----
3    |  3
4    |  4
null |  5
null |  6

Full outer join

A full outer join will give you the union of A and B, i.e. all the rows in A and all the rows in B. If something in A doesn’t have a corresponding datum in B, then the B portion is null, and vice versa.

select * from a FULL OUTER JOIN b on a.a = b.b;

 a   |  b
-----+-----
   1 | null
   2 | null
   3 |    3
   4 |    4
null |    6
null |    5

Summary

We began this session with a discussion on the ‘Order by’ clause. This clause is used to order the records retrieved through a query. The ‘order by’ clause has a constraint in that it is always the last clause in any query.

The session also covered grouped aggregation and other aggregation functions. You learnt the ‘group by’ and other aggregation functions such as max(), min(), count(), etc. The ‘where’ clause can’t be used on aggregate functions, so the ‘having’ clause comes handy here.

Both “Having” and “Where” clauses are similar in nature and have a functionality of “FILTERING” the data with a slight difference. The WHERE clause filters on individual rows and the HAVING clause filters on the aggregated values, it applies only to groups as a whole.

Further, the session covered the concepts of nested queries and joins, including inner join, outer join, left outer join, right outer join, among others.

Categories: Machine Learning Tags: Tags: Data analysis, machine learning, sql