Project 03 · SQL · Data Engineering

Nashville Housing Data —
SQL Data Cleaning & Analysis

An end-to-end SQL pipeline transforming 56,477 raw property records into a clean, analysis-ready dataset

DatasetNashville Housing (Davidson County, TN)
Records56,477 rows · 19 columns (raw)
PeriodJanuary 2013 – December 2019
ToolsMySQL · Python · PyMySQL · Jupyter Notebook
56,477Records cleaned
0Rows lost in pipeline
6Quality issues resolved
5Atomic columns from 2
100%Integrity preserved
Why this matters

Raw Data Is Never Analysis-Ready

In real-world data pipelines, data cleaning is often the most critical, and most underestimated, phase. A single null field, inconsistent category label, or compound address column can silently corrupt every downstream analysis, dashboard, or model built on top of it. This project demonstrates professional-grade SQL data engineering by transforming a publicly available but genuinely messy real estate dataset into a clean, normalised, analysis-ready table, following the same four-phase approach used in production data environments.

Phase 1
Assess
  • Schema review
  • Null & blank audit
  • Value distributions
  • Data type verification
Phase 2
Clean
  • Date standardisation
  • Null imputation
  • Flag normalisation
  • PII removal
Phase 3
Enrich
  • Address decomposition
  • SaleMonth engineering
  • Atomic column creation
  • Type optimisation
Phase 4
Validate
  • Row count checks
  • Distinct value audits
  • Range verification
  • Null re-checks
Audit findings

6 Data Quality Issues Identified

A structured pre-cleaning audit is non-negotiable. Understanding the nature and scale of each issue determines the right technique, and prevents over-engineering.

Issue 01

Date Format Inconsistency

SaleDate stored as natural language strings "April 9, 2013" rather than ISO 8601. Silent type errors in date arithmetic without explicit standardisation.

STR_TO_DATE() + ALTER TABLE MODIFY
Issue 02

29 Missing Property Addresses

29 rows with blank PropertyAddress values due to data entry gaps in the source system. City-level analysis would silently exclude these properties without imputation.

Self-join on ParcelID + CREATE VIEW
Issue 03

Compound Address Columns (1NF Violation)

PropertyAddress and OwnerAddress each stored multiple pieces of data in a single field, violating First Normal Form. City-level aggregations and geographic joins were impossible.

SUBSTRING_INDEX() + ALTER TABLE ADD
Issue 04

Inconsistent Categorical Encoding

SoldAsVacant had 4 distinct values: Yes (51,403), No (4,623), Y (52), N (399), from multiple data entry operators using different conventions. GROUP BY produces 4 groups instead of 2; boolean logic breaks.

CASE WHEN + UPDATE
Issue 05

PII Column Present

OwnerName column contained individual owner names, personal identifiers with no analytical value for market-level housing analysis. Retaining them unnecessarily increases privacy risk.

ALTER TABLE DROP COLUMN
Issue 06

No Engineered Time Features

SaleDate exists but no SaleMonth, SaleYear, or SaleQuarter. Every monthly analysis required EXTRACT() inline in every query, hurting performance and readability at scale.

EXTRACT(MONTH FROM date) + UPDATE
The pipeline

Step-by-Step Cleaning

Every step follows the same pattern: audit → transform → validate. No step is assumed to have worked, each is verified with a query before moving on.

1

Date Standardisation

STR_TO_DATE() · ALTER TABLE MODIFY

The raw CSV contained dates in natural language format. An explicit re-cast ensures no edge cases in date arithmetic, filtering, or time-series aggregations.

-- Convert 'April 9, 2013' → ISO 8601 DATE
UPDATE housing_data
SET SaleDate = STR_TO_DATE(SaleDate, '%M %d, %Y');

-- Change column type to DATE
ALTER TABLE housing_data
MODIFY COLUMN SaleDate DATE;

Result: All 56,477 SaleDate values converted to ISO 8601 YYYY-MM-DD format. Date range confirmed: 2013-01-02 → 2019-12-13.

2

Property Address Imputation via Self-Join

Self-join · CREATE VIEW · INNER JOIN UPDATE

The key insight: properties on the same land parcel (same ParcelID) share a physical address. By self-joining the table on ParcelID (excluding same-row matches), we can find a "donor" row with a known address to fill in the blank.

-- Preview: find donor addresses for blank rows
SELECT a.UniqueID, a.ParcelID,
       a.PropertyAddress AS current_address,
       b.PropertyAddress AS donor_address
FROM housing_data a
JOIN housing_data b
    ON  a.ParcelID  = b.ParcelID
    AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress = '';

-- Materialise as a view for reusability
CREATE VIEW updated_address AS
SELECT a.UniqueID, a.ParcelID,
    IF(a.PropertyAddress = '',
       b.PropertyAddress,
       a.PropertyAddress) AS resolved_address
FROM housing_data a
JOIN housing_data b
    ON a.ParcelID = b.ParcelID
    AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress = '';

Result: All 29 missing PropertyAddress values imputed from sibling records on the same parcel. blank_PropertyStreet = 0, blank_PropertyCity = 0 confirmed in final validation query.

3

Address Decomposition (1NF Normalisation)

SUBSTRING_INDEX() · TRIM() · ALTER TABLE ADD/DROP

Both address columns stored multiple values in one field. Decomposing them into atomic columns enables city-level aggregations, state-level joins, and geographic analysis, none of which were possible on the raw compound fields.

-- Add atomic columns
ALTER TABLE housing_data
  ADD PropertyStreet VARCHAR(255),
  ADD PropertyCity   VARCHAR(100);

-- Parse using SUBSTRING_INDEX on comma delimiter
UPDATE housing_data SET
  PropertyStreet = TRIM(SUBSTRING_INDEX(PropertyAddress, ',', 1)),
  PropertyCity   = TRIM(SUBSTRING_INDEX(PropertyAddress, ',', -1));

-- Drop original compound column
ALTER TABLE housing_data DROP COLUMN PropertyAddress;

-- Same for OwnerAddress → Street + City + State
UPDATE housing_data SET
  OwnerStreet = TRIM(SUBSTRING_INDEX(OwnerAddress, ',', 1)),
  OwnerCity   = TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress, ',', -2), ',', 1)),
  OwnerState  = TRIM(SUBSTRING_INDEX(OwnerAddress, ',', -1));

Result: 2 compound address columns decomposed into 5 atomic columns, PropertyStreet, PropertyCity, OwnerStreet, OwnerCity, OwnerState. City-level and state-level analysis is now possible without parsing in every downstream query.

4

SoldAsVacant Flag Normalisation

CASE WHEN · UPDATE · GROUP BY validation

The SoldAsVacant column had 4 distinct values from multiple data entry operators using different conventions. GROUP BY on this column was producing 4 groups instead of 2, breaking boolean logic in every downstream filter.

-- Audit before transformation
SELECT SoldAsVacant, COUNT(*) AS count
FROM housing_data
GROUP BY SoldAsVacant;
-- Returns: 'No' (51,403) | 'N' (399) | 'Y' (52) | 'Yes' (4,623)

-- Normalise all variants to Y / N
UPDATE housing_data
SET SoldAsVacant = CASE
    WHEN SoldAsVacant IN ('Yes', 'Y') THEN 'Y'
    ELSE 'N'
END;

Result: 4 inconsistent values collapsed to exactly 2 canonical values (Y / N). Validation confirmed: sold_as_vacant_distinct_vals = 2. All 56,477 rows reclassified correctly.

5

PII Removal & Feature Engineering

ALTER TABLE DROP COLUMN · EXTRACT() · TINYINT

Owner names are personal identifiers with no analytical value for market-level housing analysis. Retaining them unnecessarily increases privacy risk. Simultaneously, a pre-computed SaleMonth column eliminates repeated EXTRACT() calls in downstream queries.

-- Remove PII: owner name has no analytical value
ALTER TABLE housing_data DROP COLUMN OwnerName;

-- Engineer SaleMonth as a stored TINYINT column
ALTER TABLE housing_data ADD SaleMonth TINYINT;

UPDATE housing_data
SET SaleMonth = EXTRACT(MONTH FROM SaleDate);

Result: PII eliminated. SaleMonth populated for all 56,477 rows, validated with GROUP BY confirming exactly 12 distinct month values (1–12). null_SaleMonth = 0.

Impact

Before vs After: Pipeline Results

Before cleaning
Missing PropertyAddress rows29
Distinct SoldAsVacant values4
Compound address columns2
PII columns1 (OwnerName)
Engineered time features0
Date formatMixed / informal
Total rows56,477
After cleaning
Missing PropertyAddress rows0
Distinct SoldAsVacant values2
Atomic address columns created5
PII columns0
Engineered features added1 (SaleMonth)
Date formatISO 8601 YYYY-MM-DD
Total rows (integrity)56,477 (none lost)
Analytical output

Post-Clean Exploratory Analysis

With a clean, normalised dataset, city-level pricing, seasonal patterns, and property type analysis become reliable. These results were impossible on the raw data.

Query 01, City Price Analysis
Average Sale Price by City (Top 10)
Which cities command the highest average property prices in the Nashville metro area?
Finding: Nashville dominates transaction volume with 40,275 sales at an average of $366,292, significantly higher than surrounding suburbs. Brentwood ($312K) and Goodlettsville ($290K) are the next most premium markets. This analysis was only possible after PropertyCity was decomposed from the compound address field, it would have been impossible on the raw data.
Query 02, Seasonal Pattern
Monthly Transaction Volume & Avg Price
Which months drive peak transaction activity, and does price vary seasonally?
Finding: Transaction volume peaks sharply in May–June (spring buying season), consistent with US housing market norms. February is the slowest month. Notably, January shows an unusually high average price ($667K), driven by a small number of high-value commercial transactions, a pattern only detectable after the SaleMonth feature was engineered. This query ran directly off the pre-computed SaleMonth column with no inline EXTRACT() needed.
Query 03, Property Type Analysis
Avg Price by Property Type (Top Residential)
What are the price differentials between residential property types in Nashville?
Finding: Single Family homes (21,378 transactions) average $284,265, the most liquid segment of the market. Multi-unit properties (Duplex, Triplex, Quadplex) range $220K–$297K. The 4+ bedroom Quadplex category at $297K with 6.1 avg bedrooms represents the highest density value in the residential market. Normalised SoldAsVacant flag enables clean segmentation of vacant vs occupied transactions, which was unreliable with the original 4-value encoding.
Final Validation Query
Post-Clean Data Quality Dashboard
SELECT
    COUNT(*)                                           AS total_rows,
    SUM(PropertyStreet = '' OR PropertyStreet IS NULL) AS blank_PropertyStreet,
    SUM(PropertyCity   = '' OR PropertyCity   IS NULL) AS blank_PropertyCity,
    COUNT(DISTINCT SoldAsVacant)                        AS sold_as_vacant_distinct_vals,
    SUM(SaleMonth IS NULL)                              AS null_SaleMonth,
    MIN(SaleDate)                                        AS date_min,
    MAX(SaleDate)                                        AS date_max
FROM housing_data;
total_rowsblank_PropertyStreetblank_PropertyCitysold_as_vacant_distinct_valsnull_SaleMonthdate_mindate_max
564770.00.020.02013-01-022019-12-13

All validation checks pass. 56,477 rows intact, zero blanks, exactly 2 SoldAsVacant values, zero null SaleMonth entries, ISO 8601 date range confirmed. The dataset is analysis-ready.

Skills demonstrated

SQL Skills Applied

DDL, Schema Management

CREATE TABLE ALTER TABLE ADD ALTER TABLE DROP MODIFY COLUMN CREATE VIEW DROP TABLE IF EXISTS DROP VIEW IF EXISTS

DML, Data Transformation

UPDATE ... SET INNER JOIN UPDATE CASE WHEN ... END IF() Self-join imputation Multi-column SET

Functions

STR_TO_DATE() SUBSTRING_INDEX() TRIM() EXTRACT(MONTH FROM) DATE() YEAR() ROUND()

Analysis & Validation

GROUP BY audits COUNT DISTINCT checks NULL detection Row integrity checks AVG / SUM / MIN / MAX LIMIT + ORDER BY