ADVANCED ⏱ 50 minutes

Subqueries: Queries Within Queries

A subquery (also called an inner query or nested query) is a query embedded inside another query. Subqueries let you break complex problems into smaller, logical steps: "Find customers who ordered products that cost more than the average price." Instead of calculating the average separately, you nest that calculation inside your main query. Subqueries can appear in SELECT, WHERE, FROM, and HAVING clauses, and mastering them unlocks elegant solutions to problems that would otherwise require multiple queries or complex joins. They're the bridge between basic SQL and truly advanced data analysis.

📚 What You'll Learn

What subqueries are and when to use them
How to write subqueries in WHERE clause
How to use subqueries in SELECT clause
How to use subqueries in FROM clause (derived tables)
How to use IN, EXISTS, ANY, and ALL operators
The difference between correlated and non-correlated subqueries
How subqueries compare to JOINs
Performance considerations for subqueries
Common subquery patterns and use cases

1. What is a Subquery?

A subquery is a SELECT statement nested inside another SQL statement. The inner query executes first, and its results are used by the outer query.

Basic Structure

SELECT column
FROM table
WHERE column operator (
    SELECT column        ← Subquery (inner query)
    FROM table
    WHERE condition
);
                        ← Main query (outer query)

Simple Example

-- Find products more expensive than the average
SELECT 
    product_id,
    name,
    price
FROM products
WHERE price > (
    SELECT AVG(price)    -- Subquery calculates average
    FROM products
);

-- Process:
-- 1. Subquery runs: SELECT AVG(price) FROM products → returns 49.99
-- 2. Main query runs: WHERE price > 49.99

2. Subqueries in WHERE Clause

Single-Value Subqueries

-- Find customers who placed orders today
SELECT 
    customer_id,
    name,
    email
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date = CURRENT_DATE
);

-- Subquery returns a list of customer IDs
-- Main query filters customers by that list

Comparison Operators

-- Find products in categories with more than 5 products
SELECT 
    name,
    category_id
FROM products
WHERE category_id IN (
    SELECT category_id
    FROM products
    GROUP BY category_id
    HAVING COUNT(*) > 5
);

-- Subquery finds "popular" categories
-- Main query filters products to those categories

3. Using IN with Subqueries

IN operator checks if a value matches any value in a list returned by the subquery. It's the most common way to use subqueries in WHERE.

Basic IN Pattern

-- Find customers who have placed orders
SELECT 
    customer_id,
    name,
    email
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
);

-- Returns customers who appear in the orders table

NOT IN Pattern

-- Find customers who have NEVER placed an order
SELECT 
    customer_id,
    name,
    email
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id
    FROM orders
    WHERE customer_id IS NOT NULL  -- Important: avoid NULL issues
);

-- Returns customers who don't appear in orders table

⚠️ NULL Warning: NOT IN can behave unexpectedly with NULLs. If the subquery returns any NULL values, NOT IN returns no rows! Always filter out NULLs or use NOT EXISTS instead.

4. EXISTS - Testing for Existence

EXISTS returns TRUE if the subquery returns any rows, FALSE otherwise. It's often more efficient than IN because it stops as soon as it finds a match.

Basic EXISTS

-- Find customers who have placed orders (using EXISTS)
SELECT 
    c.customer_id,
    c.name,
    c.email
FROM customers c
WHERE EXISTS (
    SELECT 1                    -- Can use 1, *, or any value
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

-- EXISTS just checks if rows exist, doesn't return values
-- More efficient than IN for large datasets

NOT EXISTS

-- Find customers who have NEVER placed an order (safer than NOT IN)
SELECT 
    c.customer_id,
    c.name,
    c.email
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

-- NOT EXISTS handles NULLs properly, unlike NOT IN

EXISTS vs IN

-- These are equivalent, but EXISTS is often faster:

-- Using IN
SELECT name FROM customers
WHERE customer_id IN (
    SELECT customer_id FROM orders
);

-- Using EXISTS (preferred for large datasets)
SELECT name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

5. Correlated Subqueries

Correlated subqueries reference columns from the outer query. They execute once for each row processed by the outer query. Non-correlated subqueries execute once and return a result used by the outer query.

Non-Correlated (Independent)

-- Non-correlated: Subquery runs ONCE
SELECT name, price
FROM products
WHERE price > (
    SELECT AVG(price) FROM products  -- Runs once, returns one value
);

-- Process:
-- 1. Subquery executes: AVG(price) = 49.99
-- 2. Main query uses that value

Correlated (Dependent)

-- Correlated: Subquery runs for EACH row
SELECT 
    p1.product_id,
    p1.name,
    p1.price,
    p1.category_id
FROM products p1
WHERE p1.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category_id = p1.category_id  -- References outer query!
);

-- Process:
-- 1. Main query looks at product #1 (category_id = 5)
-- 2. Subquery calculates AVG(price) for category 5
-- 3. Compare p1.price to that average
-- 4. Repeat for each product

-- Finds products more expensive than their category average

Another Correlated Example

-- Find each customer's most recent order
SELECT 
    c.name,
    o.order_id,
    o.order_date,
    o.total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date = (
    SELECT MAX(order_date)
    FROM orders o2
    WHERE o2.customer_id = c.customer_id  -- Correlated
);

6. Subqueries in SELECT Clause

Subqueries in SELECT must return a single value (scalar subquery). They execute once per row in the result set.

Basic Pattern

-- Show each customer with their order count
SELECT 
    c.customer_id,
    c.name,
    (
        SELECT COUNT(*)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS order_count
FROM customers c
ORDER BY order_count DESC;

-- Subquery runs once per customer

Multiple Scalar Subqueries

-- Customer summary with multiple calculations
SELECT 
    c.customer_id,
    c.name,
    (
        SELECT COUNT(*)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS total_orders,
    (
        SELECT SUM(total)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS total_spent,
    (
        SELECT MAX(order_date)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS last_order_date
FROM customers c;

⚠️ Performance Note: Multiple subqueries in SELECT can be slow because each executes for every row. Consider using JOINs with GROUP BY instead for better performance.

7. Subqueries in FROM (Derived Tables)

A subquery in FROM creates a temporary result set (derived table) that you can query like a regular table. You must give it an alias.

Basic Derived Table

-- Find categories with average price > $50
SELECT 
    category_name,
    avg_price,
    product_count
FROM (
    SELECT 
        c.name AS category_name,
        AVG(p.price) AS avg_price,
        COUNT(*) AS product_count
    FROM categories c
    INNER JOIN products p ON c.category_id = p.category_id
    GROUP BY c.name
) AS category_stats        -- Must have alias!
WHERE avg_price > 50
ORDER BY avg_price DESC;

-- Process:
-- 1. Subquery creates temporary result set
-- 2. Main query filters and orders that result

Complex Derived Table

-- Find customers whose average order is above the global average
SELECT 
    customer_name,
    avg_order_value,
    order_count
FROM (
    SELECT 
        c.name AS customer_name,
        AVG(o.total) AS avg_order_value,
        COUNT(o.order_id) AS order_count
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.name
) AS customer_stats
WHERE avg_order_value > (
    SELECT AVG(total) FROM orders
)
ORDER BY avg_order_value DESC;

8. ANY and ALL Operators

ANY (or SOME)

-- Find products more expensive than ANY product in category 5
SELECT 
    product_id,
    name,
    price
FROM products
WHERE price > ANY (
    SELECT price
    FROM products
    WHERE category_id = 5
);

-- Returns TRUE if price is greater than at least one value
-- Equivalent to: price > MIN(subquery)

ALL

-- Find products more expensive than ALL products in category 5
SELECT 
    product_id,
    name,
    price
FROM products
WHERE price > ALL (
    SELECT price
    FROM products
    WHERE category_id = 5
);

-- Returns TRUE if price is greater than every value
-- Equivalent to: price > MAX(subquery)

Equivalents

-- These pairs are equivalent:

price > ANY (subquery)  =  price > (SELECT MIN(...) FROM subquery)
price > ALL (subquery)  =  price > (SELECT MAX(...) FROM subquery)
price < ANY (subquery)  =  price < (SELECT MAX(...) FROM subquery)
price < ALL (subquery)  =  price < (SELECT MIN(...) FROM subquery)

9. Common Subquery Patterns

Pattern 1: Top N per Group

-- Find the most expensive product in each category
SELECT 
    p1.category_id,
    p1.name,
    p1.price
FROM products p1
WHERE p1.price = (
    SELECT MAX(p2.price)
    FROM products p2
    WHERE p2.category_id = p1.category_id
);

Pattern 2: Difference from Average

-- Show how much each product differs from category average
SELECT 
    p1.name,
    p1.price,
    (
        SELECT AVG(p2.price)
        FROM products p2
        WHERE p2.category_id = p1.category_id
    ) AS category_avg_price,
    p1.price - (
        SELECT AVG(p2.price)
        FROM products p2
        WHERE p2.category_id = p1.category_id
    ) AS price_difference
FROM products p1
ORDER BY price_difference DESC;

Pattern 3: Ranking Without Window Functions

-- Rank customers by total spending
SELECT 
    c.name,
    (
        SELECT SUM(o.total)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS total_spent,
    (
        SELECT COUNT(*) + 1
        FROM customers c2
        WHERE (
            SELECT SUM(o2.total)
            FROM orders o2
            WHERE o2.customer_id = c2.customer_id
        ) > (
            SELECT SUM(o3.total)
            FROM orders o3
            WHERE o3.customer_id = c.customer_id
        )
    ) AS rank
FROM customers c
ORDER BY total_spent DESC;

10. When to Use Subqueries vs JOINs

Scenario	Use Subquery	Use JOIN
Filtering based on aggregate	✅ Good	❌ Complex
Checking existence (IN/EXISTS)	✅ Good	✅ Also good
Need columns from both tables	❌ Awkward	✅ Better
Multiple aggregations per row	⚠️ Can be slow	✅ Usually faster
Readability matters	✅ Sometimes clearer	✅ Sometimes clearer

Same Query: Subquery vs JOIN

-- Using subquery
SELECT name
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);

-- Using JOIN (often faster)
SELECT DISTINCT c.name
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';

11. 🎯 Practice Exercises

Exercise 1: Basic Subqueries

Given this schema:

products (product_id, name, price, category_id, stock)
categories (category_id, name)
orders (order_id, customer_id, order_date, total)
order_items (item_id, order_id, product_id, quantity)

Tasks:

Find products priced above the average price
Find products that have never been ordered (using NOT IN)
Find products that have never been ordered (using NOT EXISTS)

Show Solution

-- Task 1: Products above average price
SELECT 
    product_id,
    name,
    price,
    price - (SELECT AVG(price) FROM products) AS above_avg
FROM products
WHERE price > (
    SELECT AVG(price)
    FROM products
)
ORDER BY price DESC;


-- Task 2: Never ordered (NOT IN)
SELECT 
    product_id,
    name,
    price,
    stock
FROM products
WHERE product_id NOT IN (
    SELECT product_id
    FROM order_items
    WHERE product_id IS NOT NULL
)
ORDER BY price DESC;


-- Task 3: Never ordered (NOT EXISTS - preferred)
SELECT 
    p.product_id,
    p.name,
    p.price,
    p.stock
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM order_items oi
    WHERE oi.product_id = p.product_id
)
ORDER BY p.price DESC;

Key Concepts:

✅ Non-correlated subquery for average (runs once)
✅ NOT IN requires NULL check
✅ NOT EXISTS is safer and often faster
✅ Correlated subquery in EXISTS

Exercise 2: Correlated Subqueries

Given this schema:

employees (employee_id, name, salary, department_id, hire_date)
departments (department_id, name, location)

Tasks:

Find employees earning more than their department's average
Find the highest-paid employee in each department
Show each employee with their department's employee count

Show Solution

-- Task 1: Above department average
SELECT 
    e1.employee_id,
    e1.name,
    e1.salary,
    e1.department_id,
    (
        SELECT AVG(e2.salary)
        FROM employees e2
        WHERE e2.department_id = e1.department_id
    ) AS dept_avg_salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id
)
ORDER BY e1.department_id, e1.salary DESC;


-- Task 2: Highest paid per department
SELECT 
    e1.employee_id,
    e1.name,
    e1.salary,
    d.name AS department_name
FROM employees e1
INNER JOIN departments d ON e1.department_id = d.department_id
WHERE e1.salary = (
    SELECT MAX(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id
)
ORDER BY e1.salary DESC;


-- Task 3: Employee count per department
SELECT 
    e.employee_id,
    e.name,
    e.salary,
    d.name AS department_name,
    (
        SELECT COUNT(*)
        FROM employees e2
        WHERE e2.department_id = e.department_id
    ) AS dept_employee_count
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
ORDER BY dept_employee_count DESC, e.name;

Correlated Subquery Techniques:

✅ Subquery references outer table (e1.department_id)
✅ Executes once per outer row
✅ Useful for per-group comparisons
✅ Can be used in SELECT and WHERE

Exercise 3: Derived Tables

Given this schema:

customers (customer_id, name, country, registration_date)
orders (order_id, customer_id, order_date, total, status)

Tasks:

Find countries with average customer spending over $500
Find customers whose total spending is in the top 10%
Create a summary showing customer tier (based on spending)

Show Solution

-- Task 1: Countries with high average spending
SELECT 
    country,
    customer_count,
    avg_spending
FROM (
    SELECT 
        c.country,
        COUNT(DISTINCT c.customer_id) AS customer_count,
        AVG(o.total) AS avg_spending
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.status = 'completed'
    GROUP BY c.country
) AS country_stats
WHERE avg_spending > 500
ORDER BY avg_spending DESC;


-- Task 2: Top 10% spenders
SELECT 
    customer_id,
    name,
    total_spent
FROM (
    SELECT 
        c.customer_id,
        c.name,
        SUM(o.total) AS total_spent
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.status = 'completed'
    GROUP BY c.customer_id, c.name
) AS customer_spending
WHERE total_spent >= (
    SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total_spent)
    FROM (
        SELECT SUM(total) AS total_spent
        FROM orders
        WHERE status = 'completed'
        GROUP BY customer_id
    ) AS spending_dist
)
ORDER BY total_spent DESC;


-- Task 3: Customer tiers
SELECT 
    customer_id,
    name,
    total_spent,
    CASE 
        WHEN total_spent >= 5000 THEN 'Platinum'
        WHEN total_spent >= 2000 THEN 'Gold'
        WHEN total_spent >= 500 THEN 'Silver'
        ELSE 'Bronze'
    END AS customer_tier
FROM (
    SELECT 
        c.customer_id,
        c.name,
        COALESCE(SUM(o.total), 0) AS total_spent
    FROM customers c
    LEFT JOIN orders o 
        ON c.customer_id = o.customer_id 
        AND o.status = 'completed'
    GROUP BY c.customer_id, c.name
) AS customer_totals
ORDER BY total_spent DESC;

Derived Table Patterns:

✅ Subquery in FROM creates temporary table
✅ Must have alias (country_stats, customer_spending)
✅ Can filter/sort derived results
✅ Useful for multi-step calculations

🔥 Challenge: Complex Analysis with Multiple Subqueries

You have an e-commerce database:

products (product_id, name, price, category_id)
categories (category_id, name)
orders (order_id, customer_id, order_date, status)
order_items (item_id, order_id, product_id, quantity, price)
reviews (review_id, product_id, customer_id, rating, review_date)

Challenge: Create a product performance report showing: products with above-average ratings, sales metrics, comparison to category average, and identify top performers.

Show Solution

-- Comprehensive product performance analysis
SELECT 
    p.product_id,
    p.name AS product_name,
    cat.name AS category_name,
    p.price,
    
    -- Review metrics
    (
        SELECT COUNT(*)
        FROM reviews r
        WHERE r.product_id = p.product_id
    ) AS review_count,
    (
        SELECT AVG(rating)
        FROM reviews r
        WHERE r.product_id = p.product_id
    ) AS avg_rating,
    
    -- Sales metrics
    (
        SELECT COUNT(DISTINCT oi.order_id)
        FROM order_items oi
        INNER JOIN orders o ON oi.order_id = o.order_id
        WHERE oi.product_id = p.product_id
            AND o.status = 'completed'
    ) AS orders_containing_product,
    (
        SELECT SUM(oi.quantity)
        FROM order_items oi
        INNER JOIN orders o ON oi.order_id = o.order_id
        WHERE oi.product_id = p.product_id
            AND o.status = 'completed'
    ) AS total_units_sold,
    
    -- Category comparison
    (
        SELECT AVG(price)
        FROM products p2
        WHERE p2.category_id = p.category_id
    ) AS category_avg_price,
    p.price - (
        SELECT AVG(price)
        FROM products p2
        WHERE p2.category_id = p.category_id
    ) AS price_vs_category_avg,
    
    -- Performance indicator
    CASE 
        WHEN (
            SELECT AVG(rating)
            FROM reviews r
            WHERE r.product_id = p.product_id
        ) >= 4.5 
        AND (
            SELECT SUM(oi.quantity)
            FROM order_items oi
            INNER JOIN orders o ON oi.order_id = o.order_id
            WHERE oi.product_id = p.product_id
                AND o.status = 'completed'
        ) > (
            SELECT AVG(units_sold)
            FROM (
                SELECT SUM(oi2.quantity) AS units_sold
                FROM order_items oi2
                INNER JOIN orders o2 ON oi2.order_id = o2.order_id
                WHERE o2.status = 'completed'
                GROUP BY oi2.product_id
            ) AS product_sales
        ) THEN 'Top Performer'
        WHEN (
            SELECT AVG(rating)
            FROM reviews r
            WHERE r.product_id = p.product_id
        ) >= 4.0 THEN 'Strong'
        ELSE 'Needs Improvement'
    END AS performance_category

FROM products p
INNER JOIN categories cat ON p.category_id = cat.category_id

WHERE EXISTS (
    -- Only products with reviews
    SELECT 1
    FROM reviews r
    WHERE r.product_id = p.product_id
)
AND (
    SELECT AVG(rating)
    FROM reviews r
    WHERE r.product_id = p.product_id
) > (
    SELECT AVG(rating)
    FROM reviews
)

ORDER BY avg_rating DESC, total_units_sold DESC;

Advanced Subquery Techniques:

✅ Multiple correlated subqueries in SELECT
✅ Nested subqueries (subquery within subquery)
✅ EXISTS for filtering
✅ Subqueries in CASE expressions
✅ Derived tables for intermediate calculations
💡 Combines everything learned in this lesson!

Performance Note:

This query has many subqueries and could be slow. In production, consider: - Using JOINs with GROUP BY instead - Creating indexed views for frequently-used aggregations - Caching results if data doesn't change frequently

📝 Key Takeaways

Subqueries are queries nested inside other queries
Subqueries can appear in SELECT, WHERE, FROM, and HAVING
Non-correlated subqueries run once, correlated run per row
Use IN to check if value exists in subquery result
Use EXISTS to test if subquery returns any rows (faster than IN)
NOT EXISTS is safer than NOT IN (handles NULLs properly)
Scalar subqueries in SELECT return single values
Derived tables (subqueries in FROM) must have aliases
ANY returns TRUE if condition is true for any value
ALL returns TRUE if condition is true for all values
Subqueries can be nested multiple levels deep
JOINs are often faster than subqueries for large datasets
Use subqueries for filtering and comparisons
Use JOINs when you need columns from multiple tables
Subqueries enable elegant solutions to complex problems