Raw Data Is Never Analysis-Ready
In real-world data pipelines, data cleaning is often the most critical, and most underestimated, phase. A single null field, inconsistent category label, or compound address column can silently corrupt every downstream analysis, dashboard, or model built on top of it. This project demonstrates professional-grade SQL data engineering by transforming a publicly available but genuinely messy real estate dataset into a clean, normalised, analysis-ready table, following the same four-phase approach used in production data environments.
- Schema review
- Null & blank audit
- Value distributions
- Data type verification
- Date standardisation
- Null imputation
- Flag normalisation
- PII removal
- Address decomposition
- SaleMonth engineering
- Atomic column creation
- Type optimisation
- Row count checks
- Distinct value audits
- Range verification
- Null re-checks
6 Data Quality Issues Identified
A structured pre-cleaning audit is non-negotiable. Understanding the nature and scale of each issue determines the right technique, and prevents over-engineering.
Date Format Inconsistency
SaleDate stored as natural language strings "April 9, 2013" rather than ISO 8601. Silent type errors in date arithmetic without explicit standardisation.
29 Missing Property Addresses
29 rows with blank PropertyAddress values due to data entry gaps in the source system. City-level analysis would silently exclude these properties without imputation.
Self-join on ParcelID + CREATE VIEWCompound Address Columns (1NF Violation)
PropertyAddress and OwnerAddress each stored multiple pieces of data in a single field, violating First Normal Form. City-level aggregations and geographic joins were impossible.
SUBSTRING_INDEX() + ALTER TABLE ADDInconsistent Categorical Encoding
SoldAsVacant had 4 distinct values: Yes (51,403), No (4,623), Y (52), N (399), from multiple data entry operators using different conventions. GROUP BY produces 4 groups instead of 2; boolean logic breaks.
CASE WHEN + UPDATEPII Column Present
OwnerName column contained individual owner names, personal identifiers with no analytical value for market-level housing analysis. Retaining them unnecessarily increases privacy risk.
ALTER TABLE DROP COLUMNNo Engineered Time Features
SaleDate exists but no SaleMonth, SaleYear, or SaleQuarter. Every monthly analysis required EXTRACT() inline in every query, hurting performance and readability at scale.
EXTRACT(MONTH FROM date) + UPDATEStep-by-Step Cleaning
Every step follows the same pattern: audit → transform → validate. No step is assumed to have worked, each is verified with a query before moving on.
Date Standardisation
STR_TO_DATE() · ALTER TABLE MODIFYThe raw CSV contained dates in natural language format. An explicit re-cast ensures no edge cases in date arithmetic, filtering, or time-series aggregations.
-- Convert 'April 9, 2013' → ISO 8601 DATE UPDATE housing_data SET SaleDate = STR_TO_DATE(SaleDate, '%M %d, %Y'); -- Change column type to DATE ALTER TABLE housing_data MODIFY COLUMN SaleDate DATE;
Result: All 56,477 SaleDate values converted to ISO 8601 YYYY-MM-DD format. Date range confirmed: 2013-01-02 → 2019-12-13.
Property Address Imputation via Self-Join
Self-join · CREATE VIEW · INNER JOIN UPDATEThe key insight: properties on the same land parcel (same ParcelID) share a physical address. By self-joining the table on ParcelID (excluding same-row matches), we can find a "donor" row with a known address to fill in the blank.
-- Preview: find donor addresses for blank rows SELECT a.UniqueID, a.ParcelID, a.PropertyAddress AS current_address, b.PropertyAddress AS donor_address FROM housing_data a JOIN housing_data b ON a.ParcelID = b.ParcelID AND a.UniqueID <> b.UniqueID WHERE a.PropertyAddress = ''; -- Materialise as a view for reusability CREATE VIEW updated_address AS SELECT a.UniqueID, a.ParcelID, IF(a.PropertyAddress = '', b.PropertyAddress, a.PropertyAddress) AS resolved_address FROM housing_data a JOIN housing_data b ON a.ParcelID = b.ParcelID AND a.UniqueID <> b.UniqueID WHERE a.PropertyAddress = '';
Result: All 29 missing PropertyAddress values imputed from sibling records on the same parcel. blank_PropertyStreet = 0, blank_PropertyCity = 0 confirmed in final validation query.
Address Decomposition (1NF Normalisation)
SUBSTRING_INDEX() · TRIM() · ALTER TABLE ADD/DROPBoth address columns stored multiple values in one field. Decomposing them into atomic columns enables city-level aggregations, state-level joins, and geographic analysis, none of which were possible on the raw compound fields.
-- Add atomic columns ALTER TABLE housing_data ADD PropertyStreet VARCHAR(255), ADD PropertyCity VARCHAR(100); -- Parse using SUBSTRING_INDEX on comma delimiter UPDATE housing_data SET PropertyStreet = TRIM(SUBSTRING_INDEX(PropertyAddress, ',', 1)), PropertyCity = TRIM(SUBSTRING_INDEX(PropertyAddress, ',', -1)); -- Drop original compound column ALTER TABLE housing_data DROP COLUMN PropertyAddress; -- Same for OwnerAddress → Street + City + State UPDATE housing_data SET OwnerStreet = TRIM(SUBSTRING_INDEX(OwnerAddress, ',', 1)), OwnerCity = TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress, ',', -2), ',', 1)), OwnerState = TRIM(SUBSTRING_INDEX(OwnerAddress, ',', -1));
Result: 2 compound address columns decomposed into 5 atomic columns, PropertyStreet, PropertyCity, OwnerStreet, OwnerCity, OwnerState. City-level and state-level analysis is now possible without parsing in every downstream query.
SoldAsVacant Flag Normalisation
CASE WHEN · UPDATE · GROUP BY validationThe SoldAsVacant column had 4 distinct values from multiple data entry operators using different conventions. GROUP BY on this column was producing 4 groups instead of 2, breaking boolean logic in every downstream filter.
-- Audit before transformation SELECT SoldAsVacant, COUNT(*) AS count FROM housing_data GROUP BY SoldAsVacant; -- Returns: 'No' (51,403) | 'N' (399) | 'Y' (52) | 'Yes' (4,623) -- Normalise all variants to Y / N UPDATE housing_data SET SoldAsVacant = CASE WHEN SoldAsVacant IN ('Yes', 'Y') THEN 'Y' ELSE 'N' END;
Result: 4 inconsistent values collapsed to exactly 2 canonical values (Y / N). Validation confirmed: sold_as_vacant_distinct_vals = 2. All 56,477 rows reclassified correctly.
PII Removal & Feature Engineering
ALTER TABLE DROP COLUMN · EXTRACT() · TINYINTOwner names are personal identifiers with no analytical value for market-level housing analysis. Retaining them unnecessarily increases privacy risk. Simultaneously, a pre-computed SaleMonth column eliminates repeated EXTRACT() calls in downstream queries.
-- Remove PII: owner name has no analytical value ALTER TABLE housing_data DROP COLUMN OwnerName; -- Engineer SaleMonth as a stored TINYINT column ALTER TABLE housing_data ADD SaleMonth TINYINT; UPDATE housing_data SET SaleMonth = EXTRACT(MONTH FROM SaleDate);
Result: PII eliminated. SaleMonth populated for all 56,477 rows, validated with GROUP BY confirming exactly 12 distinct month values (1–12). null_SaleMonth = 0.
Before vs After: Pipeline Results
Post-Clean Exploratory Analysis
With a clean, normalised dataset, city-level pricing, seasonal patterns, and property type analysis become reliable. These results were impossible on the raw data.