Three integrated subsystems built to price, recommend, and prospect São Paulo properties. XGBoost regression with 10 macroeconomic time series, KNN recommendation over 100k+ active listings, SARIMAX 60-month price projection, and automated owner contact mining from 11M+ public IPTU records.
Built as a research project: can freely available public data IPTU, FipeZAP, BCB, IBGE, FGV produce actionable price estimates at the neighborhood level? The answer is yes. The pricing model achieves a median absolute percentage error of 14.9% on held-out data (MdAPE), with full macro context attached to every estimate.
Recommendation, pricing, and lead mining share the same ingestion layer but run independently. All results land in BigQuery or Google Sheets for downstream use.
Predicts price per m² using property-level and macroeconomic features. Property features: type, zoning class, FAR, built area, lot area, rooms, suites, bathrooms, parking, IPTU, condo fee, amenities. Macro features: 10 monthly indicators (Selic, IPCA, IGP-M, INCC, IBC-BR, IGMI-R, IVG-R, IIE-BR, CubSP, FipeZAP). FipeZAP matched via 4-level fallback: exact city/type/bedrooms, then bedroom aggregate, then residential type, then state capital, then national average. Spatial joins via cKDTree attach census and zoning data. 3,000 estimators, learning rate 0.01, L1+L2 regularization, GPU acceleration. Held-out MdAPE: 14.9%.
Projects price per m² 60 months forward using SARIMAX(1,1,1)(1,1,1,12) with 9 macroeconomic exogenous variables (IGP-M, Selic, IPCA, dollar, INCC, IVG-R, IPAM, IIE-BR, CubSP). Series stratified by bedroom count (1q, 2q, 3q, 4q+) to match FipeZAP index structure. Projected index variation applied against a Nov/2023 base to produce per-property price forecasts. Output written to Google Sheets (dbPredicao): 60 rows per property, one per month.
Matches buyer intent profiles against active VivaReal listings. Parallel BigQuery ingestion (ThreadPoolExecutor) for listings and buyer forms. Hard pre-filters on type, subtype, municipality, and neighborhood before KNN runs. 80+ Portuguese amenity strings normalized to a canonical vocabulary via unidecode and a fixed mapping table (BARBECUE_GRILL, POOL, FITNESS_ROOM, HELIPAD, etc). Continuous features scaled with StandardScaler. Up to 100K candidates per query. Results written to Warehouse.Recomendacoes with Google Maps links per match.
Queries 11M+ GeoSampa IPTU records for underpriced lots. Scrapes Notcertiptu (Prefeitura SP) via Selenium to extract owner name and CPF from public fiscal records, filtering out banks, developers, and corporations to target physical persons only. CPF fuzzy-matched via difflib when partially masked. SeekLoc API called per CPF to retrieve phone numbers (landline and mobile via regex) and email. Output: ranked contact list with lot ID, owner, CPF, phone, email, fiscal value, estimated market value, and pricing gap.
Seven data sources assembled into a single pricing and recommendation layer. Most are public; VivaReal and SeekLoc require scraping and API access respectively.
11M+ IPTU records. Zoning, FAR, land use, fiscal value per lot.
Scraped listings with 80+ amenities per property, normalized from Portuguese free-text.
300+ cities, 50+ Excel sheets. Price/m2 by city, type, and bedroom count. Used as the pricing fallback chain.
Monthly series from BCB, IBGE, FGV, and SINDUSCON " all assembled into a single macro feature matrix.
Labor market by district (RAIS/MTE) and sub-municipal HDI from the UN/IPEA Urban Development Units.
Owner contact extraction from name + CPF. Returns phone numbers and emails for outreach.
Each pipeline stage produces a concrete, queryable artifact. Nothing lives only in memory: results land in BigQuery or structured files for downstream use.
One row per buyer-property match. Fields: buyer_id, ranked property features, similarity score, Google Maps link. Queryable by buyer or by neighborhood cluster.
Estimated market value per lot, with the macro snapshot used at time of calculation. Deviation from FipeZAP reference is surfaced as a pricing gap flag.
Structured CSV: lot ID, owner name, CPF, phone, email, fiscal value, estimated market value, gap estimate. Ranked by opportunity size.
Personal research project.