Urban Space: real estate pricing and lead mining for São Paulo.

Three integrated subsystems built to price, recommend, and prospect São Paulo properties. XGBoost regression with 10 macroeconomic time series, KNN recommendation over 100k+ active listings, SARIMAX 60-month price projection, and automated owner contact mining from 11M+ public IPTU records.

Built as a research project: can freely available public data IPTU, FipeZAP, BCB, IBGE, FGV produce actionable price estimates at the neighborhood level? The answer is yes. The pricing model achieves a median absolute percentage error of 14.9% on held-out data (MdAPE), with full macro context attached to every estimate.

Python XGBoost SARIMAX (statsmodels) scikit-learn KNN StandardScaler PCA cKDTree ThreadPoolExecutor pandas / BigQuery yfinance numpy-financial Selenium unidecode

Three subsystems, one data foundation

Recommendation, pricing, and lead mining share the same ingestion layer but run independently. All results land in BigQuery or Google Sheets for downstream use.

1

XGBoost pricing model

Predicts price per m² using property-level and macroeconomic features. Property features: type, zoning class, FAR, built area, lot area, rooms, suites, bathrooms, parking, IPTU, condo fee, amenities. Macro features: 10 monthly indicators (Selic, IPCA, IGP-M, INCC, IBC-BR, IGMI-R, IVG-R, IIE-BR, CubSP, FipeZAP). FipeZAP matched via 4-level fallback: exact city/type/bedrooms, then bedroom aggregate, then residential type, then state capital, then national average. Spatial joins via cKDTree attach census and zoning data. 3,000 estimators, learning rate 0.01, L1+L2 regularization, GPU acceleration. Held-out MdAPE: 14.9%.

XGBoost cKDTree spatial join FipeZAP fallback chain 10 macro indicators GPU hist
2

SARIMAX 60-month price projection

Projects price per m² 60 months forward using SARIMAX(1,1,1)(1,1,1,12) with 9 macroeconomic exogenous variables (IGP-M, Selic, IPCA, dollar, INCC, IVG-R, IPAM, IIE-BR, CubSP). Series stratified by bedroom count (1q, 2q, 3q, 4q+) to match FipeZAP index structure. Projected index variation applied against a Nov/2023 base to produce per-property price forecasts. Output written to Google Sheets (dbPredicao): 60 rows per property, one per month.

SARIMAX 9 exogenous vars 60-month horizon bedroom stratification Google Sheets
3

KNN recommendation engine

Matches buyer intent profiles against active VivaReal listings. Parallel BigQuery ingestion (ThreadPoolExecutor) for listings and buyer forms. Hard pre-filters on type, subtype, municipality, and neighborhood before KNN runs. 80+ Portuguese amenity strings normalized to a canonical vocabulary via unidecode and a fixed mapping table (BARBECUE_GRILL, POOL, FITNESS_ROOM, HELIPAD, etc). Continuous features scaled with StandardScaler. Up to 100K candidates per query. Results written to Warehouse.Recomendacoes with Google Maps links per match.

KNN ThreadPoolExecutor StandardScaler 80+ amenity normalization BigQuery
4

Owner contact mining

Queries 11M+ GeoSampa IPTU records for underpriced lots. Scrapes Notcertiptu (Prefeitura SP) via Selenium to extract owner name and CPF from public fiscal records, filtering out banks, developers, and corporations to target physical persons only. CPF fuzzy-matched via difflib when partially masked. SeekLoc API called per CPF to retrieve phone numbers (landline and mobile via regex) and email. Output: ranked contact list with lot ID, owner, CPF, phone, email, fiscal value, estimated market value, and pricing gap.

GeoSampa IPTU 11M+ Selenium scraping CPF fuzzy match SeekLoc API

Data sources

Seven data sources assembled into a single pricing and recommendation layer. Most are public; VivaReal and SeekLoc require scraping and API access respectively.

GeoSampa / PMSP

11M+ IPTU records. Zoning, FAR, land use, fiscal value per lot.

IPTUzoneamentoFARuso do solo

VivaReal

Scraped listings with 80+ amenities per property, normalized from Portuguese free-text.

listingsamenitiesprice/m2

FipeZAP

300+ cities, 50+ Excel sheets. Price/m2 by city, type, and bedroom count. Used as the pricing fallback chain.

price/m2300+ citiesExcel

Macro indicators

Monthly series from BCB, IBGE, FGV, and SINDUSCON " all assembled into a single macro feature matrix.

SelicIPCAIGP-MINCCIBC-BRIGMI-RIVG-RIIE-BRCubSP

RAIS + UDH

Labor market by district (RAIS/MTE) and sub-municipal HDI from the UN/IPEA Urban Development Units.

RAISUDHIDH sub-municipal

SeekLoc API

Owner contact extraction from name + CPF. Returns phone numbers and emails for outreach.

nomeCPFtelefoneemail

Outputs

Each pipeline stage produces a concrete, queryable artifact. Nothing lives only in memory: results land in BigQuery or structured files for downstream use.

📊

Warehouse.Recomendacoes (BigQuery)

One row per buyer-property match. Fields: buyer_id, ranked property features, similarity score, Google Maps link. Queryable by buyer or by neighborhood cluster.

📈

Pricing model output

Estimated market value per lot, with the macro snapshot used at time of calculation. Deviation from FipeZAP reference is surfaced as a pricing gap flag.

📄

Lead contact lists

Structured CSV: lot ID, owner name, CPF, phone, email, fiscal value, estimated market value, gap estimate. Ranked by opportunity size.

Personal research project.