Nashville Housing SQL, Benard Mwinzi

Why this matters

Raw Data Is Never Analysis-Ready

In real-world data pipelines, data cleaning is often the most critical, and most underestimated, phase. A single null field, inconsistent category label, or compound address column can silently corrupt every downstream analysis, dashboard, or model built on top of it. This project demonstrates professional-grade SQL data engineering by transforming a publicly available but genuinely messy real estate dataset into a clean, normalised, analysis-ready table, following the same four-phase approach used in production data environments.

Phase 1

Assess

Schema review
Null & blank audit
Value distributions
Data type verification

Phase 2

Clean

Date standardisation
Null imputation
Flag normalisation
PII removal

Phase 3

Enrich

Address decomposition
SaleMonth engineering
Atomic column creation
Type optimisation

Phase 4

Validate

Row count checks
Distinct value audits
Range verification
Null re-checks

Audit findings

6 Data Quality Issues Identified

A structured pre-cleaning audit is non-negotiable. Understanding the nature and scale of each issue determines the right technique, and prevents over-engineering.

Issue 01

Date Format Inconsistency

SaleDate stored as natural language strings "April 9, 2013" rather than ISO 8601. Silent type errors in date arithmetic without explicit standardisation.

STR_TO_DATE() + ALTER TABLE MODIFY

Issue 02

29 Missing Property Addresses

29 rows with blank PropertyAddress values due to data entry gaps in the source system. City-level analysis would silently exclude these properties without imputation.

Self-join on ParcelID + CREATE VIEW

Issue 03

Compound Address Columns (1NF Violation)

PropertyAddress and OwnerAddress each stored multiple pieces of data in a single field, violating First Normal Form. City-level aggregations and geographic joins were impossible.

SUBSTRING_INDEX() + ALTER TABLE ADD

Issue 04

Inconsistent Categorical Encoding

SoldAsVacant had 4 distinct values: Yes (51,403), No (4,623), Y (52), N (399), from multiple data entry operators using different conventions. GROUP BY produces 4 groups instead of 2; boolean logic breaks.

CASE WHEN + UPDATE

Issue 05

PII Column Present

OwnerName column contained individual owner names, personal identifiers with no analytical value for market-level housing analysis. Retaining them unnecessarily increases privacy risk.

ALTER TABLE DROP COLUMN

Issue 06

No Engineered Time Features

SaleDate exists but no SaleMonth, SaleYear, or SaleQuarter. Every monthly analysis required EXTRACT() inline in every query, hurting performance and readability at scale.

EXTRACT(MONTH FROM date) + UPDATE

The pipeline

Step-by-Step Cleaning

Every step follows the same pattern: audit → transform → validate. No step is assumed to have worked, each is verified with a query before moving on.

1

Date Standardisation

STR_TO_DATE() · ALTER TABLE MODIFY

The raw CSV contained dates in natural language format. An explicit re-cast ensures no edge cases in date arithmetic, filtering, or time-series aggregations.

-- Convert 'April 9, 2013' → ISO 8601 DATE
UPDATE housing_data
SET SaleDate = STR_TO_DATE(SaleDate, '%M %d, %Y');

-- Change column type to DATE
ALTER TABLE housing_data
MODIFY COLUMN SaleDate DATE;

✓

Result: All 56,477 SaleDate values converted to ISO 8601 YYYY-MM-DD format. Date range confirmed: 2013-01-02 → 2019-12-13.

2

Property Address Imputation via Self-Join

Self-join · CREATE VIEW · INNER JOIN UPDATE

The key insight: properties on the same land parcel (same ParcelID) share a physical address. By self-joining the table on ParcelID (excluding same-row matches), we can find a "donor" row with a known address to fill in the blank.

-- Preview: find donor addresses for blank rows
SELECT a.UniqueID, a.ParcelID,
       a.PropertyAddress AS current_address,
       b.PropertyAddress AS donor_address
FROM housing_data a
JOIN housing_data b
    ON  a.ParcelID  = b.ParcelID
    AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress = '';

-- Materialise as a view for reusability
CREATE VIEW updated_address AS
SELECT a.UniqueID, a.ParcelID,
    IF(a.PropertyAddress = '',
       b.PropertyAddress,
       a.PropertyAddress) AS resolved_address
FROM housing_data a
JOIN housing_data b
    ON a.ParcelID = b.ParcelID
    AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress = '';

✓

Result: All 29 missing PropertyAddress values imputed from sibling records on the same parcel. blank_PropertyStreet = 0, blank_PropertyCity = 0 confirmed in final validation query.

3

Address Decomposition (1NF Normalisation)

SUBSTRING_INDEX() · TRIM() · ALTER TABLE ADD/DROP

Both address columns stored multiple values in one field. Decomposing them into atomic columns enables city-level aggregations, state-level joins, and geographic analysis, none of which were possible on the raw compound fields.

-- Add atomic columns
ALTER TABLE housing_data
  ADD PropertyStreet VARCHAR(255),
  ADD PropertyCity   VARCHAR(100);

-- Parse using SUBSTRING_INDEX on comma delimiter
UPDATE housing_data SET
  PropertyStreet = TRIM(SUBSTRING_INDEX(PropertyAddress, ',', 1)),
  PropertyCity   = TRIM(SUBSTRING_INDEX(PropertyAddress, ',', -1));

-- Drop original compound column
ALTER TABLE housing_data DROP COLUMN PropertyAddress;

-- Same for OwnerAddress → Street + City + State
UPDATE housing_data SET
  OwnerStreet = TRIM(SUBSTRING_INDEX(OwnerAddress, ',', 1)),
  OwnerCity   = TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress, ',', -2), ',', 1)),
  OwnerState  = TRIM(SUBSTRING_INDEX(OwnerAddress, ',', -1));

✓

Result: 2 compound address columns decomposed into 5 atomic columns, PropertyStreet, PropertyCity, OwnerStreet, OwnerCity, OwnerState. City-level and state-level analysis is now possible without parsing in every downstream query.

4

SoldAsVacant Flag Normalisation

CASE WHEN · UPDATE · GROUP BY validation

The SoldAsVacant column had 4 distinct values from multiple data entry operators using different conventions. GROUP BY on this column was producing 4 groups instead of 2, breaking boolean logic in every downstream filter.

-- Audit before transformation
SELECT SoldAsVacant, COUNT(*) AS count
FROM housing_data
GROUP BY SoldAsVacant;
-- Returns: 'No' (51,403) | 'N' (399) | 'Y' (52) | 'Yes' (4,623)

-- Normalise all variants to Y / N
UPDATE housing_data
SET SoldAsVacant = CASE
    WHEN SoldAsVacant IN ('Yes', 'Y') THEN 'Y'
    ELSE 'N'
END;

✓

Result: 4 inconsistent values collapsed to exactly 2 canonical values (Y / N). Validation confirmed: sold_as_vacant_distinct_vals = 2. All 56,477 rows reclassified correctly.

5

PII Removal & Feature Engineering

ALTER TABLE DROP COLUMN · EXTRACT() · TINYINT

Owner names are personal identifiers with no analytical value for market-level housing analysis. Retaining them unnecessarily increases privacy risk. Simultaneously, a pre-computed SaleMonth column eliminates repeated EXTRACT() calls in downstream queries.

-- Remove PII: owner name has no analytical value
ALTER TABLE housing_data DROP COLUMN OwnerName;

-- Engineer SaleMonth as a stored TINYINT column
ALTER TABLE housing_data ADD SaleMonth TINYINT;

UPDATE housing_data
SET SaleMonth = EXTRACT(MONTH FROM SaleDate);

✓

Result: PII eliminated. SaleMonth populated for all 56,477 rows, validated with GROUP BY confirming exactly 12 distinct month values (1–12). null_SaleMonth = 0.

Impact

Before vs After: Pipeline Results

Before cleaning

Missing PropertyAddress rows29

Distinct SoldAsVacant values4

Compound address columns2

PII columns1 (OwnerName)

Engineered time features0

Date formatMixed / informal

Total rows56,477

After cleaning

Missing PropertyAddress rows0

Distinct SoldAsVacant values2

Atomic address columns created5

PII columns0

Engineered features added1 (SaleMonth)

Date formatISO 8601 YYYY-MM-DD

Total rows (integrity)56,477 (none lost)

Analytical output

Post-Clean Exploratory Analysis

With a clean, normalised dataset, city-level pricing, seasonal patterns, and property type analysis become reliable. These results were impossible on the raw data.

Query 01, City Price Analysis

Average Sale Price by City (Top 10)

Which cities command the highest average property prices in the Nashville metro area?

↗

Finding: Nashville dominates transaction volume with 40,275 sales at an average of $366,292, significantly higher than surrounding suburbs. Brentwood ($312K) and Goodlettsville ($290K) are the next most premium markets. This analysis was only possible after PropertyCity was decomposed from the compound address field, it would have been impossible on the raw data.

Query 02, Seasonal Pattern

Monthly Transaction Volume & Avg Price

Which months drive peak transaction activity, and does price vary seasonally?

↗

Finding: Transaction volume peaks sharply in May–June (spring buying season), consistent with US housing market norms. February is the slowest month. Notably, January shows an unusually high average price ($667K), driven by a small number of high-value commercial transactions, a pattern only detectable after the SaleMonth feature was engineered. This query ran directly off the pre-computed SaleMonth column with no inline EXTRACT() needed.

Query 03, Property Type Analysis

Avg Price by Property Type (Top Residential)

What are the price differentials between residential property types in Nashville?

↗

Finding: Single Family homes (21,378 transactions) average $284,265, the most liquid segment of the market. Multi-unit properties (Duplex, Triplex, Quadplex) range $220K–$297K. The 4+ bedroom Quadplex category at $297K with 6.1 avg bedrooms represents the highest density value in the residential market. Normalised SoldAsVacant flag enables clean segmentation of vacant vs occupied transactions, which was unreliable with the original 4-value encoding.

Final Validation Query

Post-Clean Data Quality Dashboard

SELECT
    COUNT(*)                                           AS total_rows,
    SUM(PropertyStreet = '' OR PropertyStreet IS NULL) AS blank_PropertyStreet,
    SUM(PropertyCity   = '' OR PropertyCity   IS NULL) AS blank_PropertyCity,
    COUNT(DISTINCT SoldAsVacant)                        AS sold_as_vacant_distinct_vals,
    SUM(SaleMonth IS NULL)                              AS null_SaleMonth,
    MIN(SaleDate)                                        AS date_min,
    MAX(SaleDate)                                        AS date_max
FROM housing_data;

total_rows	blank_PropertyStreet	blank_PropertyCity	sold_as_vacant_distinct_vals	null_SaleMonth	date_min	date_max
56477	0.0	0.0	2	0.0	2013-01-02	2019-12-13

✓

All validation checks pass. 56,477 rows intact, zero blanks, exactly 2 SoldAsVacant values, zero null SaleMonth entries, ISO 8601 date range confirmed. The dataset is analysis-ready.

Skills demonstrated

SQL Skills Applied

DDL, Schema Management

CREATE TABLE ALTER TABLE ADD ALTER TABLE DROP MODIFY COLUMN CREATE VIEW DROP TABLE IF EXISTS DROP VIEW IF EXISTS

DML, Data Transformation

UPDATE ... SET INNER JOIN UPDATE CASE WHEN ... END IF() Self-join imputation Multi-column SET

Functions

STR_TO_DATE() SUBSTRING_INDEX() TRIM() EXTRACT(MONTH FROM) DATE() YEAR() ROUND()

Analysis & Validation

GROUP BY audits COUNT DISTINCT checks NULL detection Row integrity checks AVG / SUM / MIN / MAX LIMIT + ORDER BY

Nashville Housing Data —SQL Data Cleaning & Analysis