São Paulo's urban core is paradoxically underpopulated: large, outdated properties inflate land prices and push residents to the periphery. The 2023 Master Plan unlocks higher density near transit axes. This system identifies those lots before the market reprices the new rules.
Replaces the manual work of "perdigueiros" " professionals who comb Google Maps for lots " with a pipeline that analyzes 11M+ IPTU records, zoning shapefiles, and legal documents to generate georeferenced PDF viability reports per lot. Two Jupyter notebooks: one for data engineering, one for ML and RAG.
Two notebooks run sequentially. The first handles all geospatial data engineering and produces normalized parquets. The second ingests those parquets, applies ML for lot scoring, and invokes GPT-4o with hybrid retrieval to generate the final PDF report per lot.
Parallel shapefile ingestion via ThreadPoolExecutor and GeoPandas. Merges lot geometries with IPTU records by sector+block+lot. Normalizes dual zoning: Lei 18177 (current) vs Lei 16402 (legacy) " covering ZEU, ZEM, ZC, ZM, ZEIS, ZDE, ZPI, ZPR, ZER. Applies flood risk overlays at three return periods (5/25/100 years), geological risk, and watershed restrictions. Builds a cKDTree (scipy.spatial) over five transit modalities " Metro, Train, Bus Stop, Terminal, UDH " and computes Euclidean distance in meters from each sector-block to the nearest node per modality. Scales area, obsolescence, owner age, and UDH features with MinMaxScaler (0.1). Merges IPTU + UDH by Lat/Long. Output: normalized parquets ready for ML.
PCA + KNN in normalized feature space to surface lots similar to high-potential parcels. FAISS vector store indexes unstructured legal documents (escrituras, environmental reports) with OpenAI text-embedding-ada-002. Hybrid retrieval combines dense FAISS (cosine similarity) with sparse BM25 for better recall on Brazilian zoning terminology. GPT-4o receives structured lot data from SQL alongside k=3 FAISS excerpts and outputs a structured report. GeoPandas + Contextily render a situation map (OpenStreetMap basemap) as PNG. ReportLab composes the final A4 PDF: three-bullet summary, lot data, zoning classification, conclusions, regulatory inconsistencies, regularization paths, and the georeferenced site plan.
Six primary data sources " federal rural registries, municipal urban records, geospatial shapefiles, and unstructured legal documents from Google Drive. The pipeline ingests all of them and normalizes into a unified lot-level feature space.
Federal rural property registry " owner, area, location.
Cadastro Ambiental Rural " environmental restrictions per parcel.
Certified rural properties with INCRA-validated polygon boundaries.
11M+ urban IPTU records. Zoning, FAR, use, lot geometry.
Zoning laws, geological risk, watershed areas, and flood extent.
Unstructured legal documents indexed into FAISS for RAG.
Every run produces normalized parquets for the full lot universe, and a georeferenced PDF report for each lot flagged by the ML scoring step.
Intermediate datasets produced by the data engineering notebook and consumed by the ML notebook. One file per geospatial layer.
A4 document generated by ReportLab for each flagged lot. Structured sections produced by GPT-4o with hybrid retrieval context.
GeoPandas renders the lot polygon with surrounding context. Contextily overlays a live OpenStreetMap basemap and Matplotlib composes the final image. The PNG is embedded directly in the PDF report.
Personal research project.