Database Analysis

Advanced SQL Querying: Baseball Database Analysis

This project demonstrates advanced SQL techniques through comprehensive analysis of a baseball database using PostgreSQL. The analysis covers four key domains: school player production, team salary dynamics, player career trajectories, and comparative player statistics.

The complete SQL code for this project is available at GitHub Repository

Analysis Overview

The project analyzes a multi-table baseball database containing schools, players, salaries, and team information. The analysis demonstrates:

  • Complex JOIN operations across multiple tables
  • Window functions for ranking and cumulative calculations
  • Common Table Expressions (CTEs) for query organization
  • Date manipulation and temporal analysis
  • Aggregation and grouping strategies

Part I: School Analysis

The school analysis examines which educational institutions produced the most professional baseball players across different decades.

Key Queries

Decade-by-Decade School Count

Calculates the number of schools producing players in each decade using integer division for decade grouping:


SELECT (yearid/10) * 10 AS decade,
       COUNT(DISTINCT(schoolid)) AS num_schools
FROM schools
GROUP BY decade
ORDER BY decade;
								

Top Schools by Player Production

Identifies the five schools that produced the most players using LEFT JOIN to preserve school detail information:


SELECT sd.name_full,
       COUNT(DISTINCT(s.playerid)) AS num_players
FROM schools s LEFT JOIN school_details sd
ON s.schoolid = sd.schoolid
GROUP BY sd.name_full
ORDER BY num_players DESC
LIMIT 5;
								

Top 3 Schools Per Decade

Uses CTEs and window functions to rank schools within each decade:


WITH school_counts AS (
    SELECT 
        (s.yearid/10) * 10 AS decade,
        sd.name_full AS school_name,
        COUNT(distinct(s.playerid)) AS num_players
    FROM schools s LEFT JOIN school_details sd 
    ON s.schoolid = sd.schoolid
    GROUP BY decade, sd.name_full
),
ranked AS (
    SELECT 
        decade,
        school_name,
        num_players,
        ROW_NUMBER() OVER (PARTITION BY decade ORDER BY num_players DESC, school_name) AS rn
    FROM school_counts
)
SELECT decade, school_name, num_players
FROM ranked
WHERE rn <= 3
ORDER BY decade DESC, rn;
								

Part II: Salary Analysis

The salary analysis examines team spending patterns over time, including identifying high-spending teams and tracking cumulative expenditures.

Key Queries

Top 20% Spending Teams

Uses NTILE window function to segment teams into quintiles based on average annual spending:


WITH ts AS (
    SELECT yearid, teamid, SUM(salary) AS total_spend
    FROM salaries
    GROUP BY yearid, teamid
),
pct AS (
    SELECT teamid,
           AVG(total_spend) AS avg_spend,
           NTILE(5) OVER(ORDER BY AVG(total_spend) DESC) AS spend_pct
    FROM ts
    GROUP BY teamid
)
SELECT teamid,
       ROUND(avg_spend / 1000000, 1) AS avg_spend_millions
FROM pct
WHERE spend_pct = 1;
								

Cumulative Spending Over Time

Calculates running totals of team spending using window functions with PARTITION BY:


WITH ts AS (
    SELECT yearid, teamid, SUM(salary) AS total_spend
    FROM salaries
    GROUP BY yearid, teamid
)
SELECT yearid, teamid,
       ROUND(SUM(total_spend / 1000000.0) 
             OVER(ORDER BY teamid, yearid), 1) AS cumulative_sum_millions
FROM ts;
								

First Year Exceeding $1 Billion

Identifies when each team's cumulative spending first surpassed $1 billion using multiple CTEs and ROW_NUMBER:


WITH ts AS (
    SELECT yearid, teamid, SUM(salary) AS total_spend
    FROM salaries
    GROUP BY yearid, teamid
),
csb AS (
    SELECT yearid, teamid,
           ROUND(SUM(total_spend / 1000000000.0) 
                 OVER(PARTITION BY teamid ORDER BY yearid), 2) AS cumulative_sum_billions
    FROM ts
),
srnk AS (
    SELECT teamid, yearid, cumulative_sum_billions,
           ROW_NUMBER() OVER(PARTITION BY teamid 
                             ORDER BY cumulative_sum_billions) AS rnk
    FROM csb
    WHERE cumulative_sum_billions >= 1
)
SELECT teamId, yearid, cumulative_sum_billions
FROM srnk
WHERE rnk = 1
ORDER BY teamid, yearid DESC;
								

Part III: Player Career Analysis

The career analysis examines player longevity, debut and retirement ages, and career trajectories.

Key Queries

Career Length Calculation

Calculates player ages at debut and retirement using PostgreSQL date functions:


WITH fullbd AS (
    SELECT playerid,
           MAKE_DATE(birthyear, birthmonth, birthday) AS bday,
           namegiven, debut, finalgame
    FROM players
)
SELECT namegiven,
       EXTRACT(YEAR FROM AGE(debut, bday))::INT AS starting_age,
       EXTRACT(YEAR FROM AGE(finalgame, bday))::INT AS ending_age,
       EXTRACT(YEAR FROM AGE(finalgame, debut))::INT AS career_length
FROM fullbd
WHERE debut IS NOT NULL
ORDER BY career_length DESC;
								

Same-Team Career Players

Identifies players who started and ended with the same team after 10+ year careers using self-joins on the salaries table:


SELECT  
    p.nameGiven,
    s.yearID AS starting_year, 
    s.teamID AS starting_team,
    e.yearID AS ending_year, 
    e.teamID AS ending_team
FROM players p
INNER JOIN salaries s
    ON p.playerID = s.playerID
   AND EXTRACT(YEAR FROM p.debut) = s.yearID
INNER JOIN salaries e
    ON p.playerID = e.playerID
   AND EXTRACT(YEAR FROM p.finalGame) = e.yearID
WHERE s.teamID = e.teamID 
  AND e.yearID - s.yearID > 10;
								

Part IV: Player Comparison Analysis

The comparison analysis examines player attributes including birthdays, batting preferences, and physical characteristics over time.

Key Queries

Shared Birthdays

Identifies players born on the same date using STRING_AGG for aggregation:


WITH bd AS (
    SELECT CAST(CONCAT(birthyear, '-', birthmonth, '-', birthday) AS DATE) AS birthdate,
           namegiven
    FROM players
    WHERE birthyear IS NOT NULL
      AND birthmonth IS NOT NULL
      AND birthday IS NOT NULL
)
SELECT birthdate,
       STRING_AGG(namegiven, ',') AS players
FROM bd
WHERE EXTRACT(YEAR FROM birthdate) BETWEEN 1980 AND 1990
GROUP BY birthdate
ORDER BY birthdate;
								

Batting Preference Distribution

Calculates percentage of right, left, and switch batters per team using CASE statements:


SELECT 
    s.teamid,
    ROUND((SUM(CASE WHEN p.bats = 'R' THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 1) AS pct_r,
    ROUND((SUM(CASE WHEN p.bats = 'L' THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 1) AS pct_l,
    ROUND((SUM(CASE WHEN p.bats = 'B' THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 1) AS pct_b
FROM players p
JOIN salaries s ON p.playerid = s.playerid
GROUP BY s.teamid
ORDER BY s.teamid;
								

Physical Characteristics Over Time

Tracks decade-over-decade changes in average player height and weight using LAG window function:


WITH awh AS (
    SELECT (EXTRACT(YEAR FROM debut)::int/10) * 10 AS decade,
           AVG(weight) AS avg_weight,
           AVG(height) AS avg_height
    FROM players
    WHERE debut IS NOT NULL 
      AND weight IS NOT NULL 
      AND height IS NOT NULL
    GROUP BY (EXTRACT(YEAR FROM debut)::int/10) * 10
)
SELECT decade,
       ROUND(avg_height - LAG(avg_height) OVER(ORDER BY decade), 4) AS avg_height_diff,
       ROUND(avg_weight - LAG(avg_weight) OVER(ORDER BY decade), 4) AS avg_weight_diff
FROM awh;
								

Technical Implementation

The project demonstrates several advanced PostgreSQL features:

  1. Window Functions
  2. Uses ROW_NUMBER, NTILE, and LAG for ranking and comparative analysis across partitioned data sets.

  3. Common Table Expressions (CTEs)
  4. Employs WITH clauses to organize complex queries into logical, readable steps, particularly for multi-stage aggregations.

  5. Date Manipulation
  6. Leverages PostgreSQL's MAKE_DATE, EXTRACT, and AGE functions for temporal calculations and decade grouping.

  7. String Aggregation
  8. Uses STRING_AGG to concatenate multiple values into single result strings for grouped data.

  9. Conditional Aggregation
  10. Implements CASE statements within SUM and COUNT functions for category-specific calculations.

  11. Self-Joins
  12. Performs multiple joins on the same table with different temporal conditions to compare player career start and end points.

Database Schema

The analysis works with the following tables:

  • schools: Player-school associations with years
  • school_details: School names and information
  • salaries: Player salaries by year and team
  • players: Player biographical and career information

Key Techniques

The queries demonstrate:

  • Integer division for decade calculation: (yearid/10) * 10
  • Multi-level CTEs for step-by-step query construction
  • Partitioned window functions for within-group rankings
  • Type casting for date construction and calculations
  • Running totals with cumulative window aggregations
  • Filtering ranked results to extract top N per category

Development Environment

  • PostgreSQL
  • SQL