Solo Inteligente " automated land prospecting via AI pipeline.

São Paulo's urban core is paradoxically underpopulated: large, outdated properties inflate land prices and push residents to the periphery. The 2023 Master Plan unlocks higher density near transit axes. This system identifies those lots before the market reprices the new rules.

Replaces the manual work of "perdigueiros" " professionals who comb Google Maps for lots " with a pipeline that analyzes 11M+ IPTU records, zoning shapefiles, and legal documents to generate georeferenced PDF viability reports per lot. Two Jupyter notebooks: one for data engineering, one for ML and RAG.

Python GeoPandas pandas / numpy scipy (cKDTree) scikit-learn (PCA . KNN . MinMaxScaler) FAISS OpenAI ada-002 + GPT-4o BM25 ReportLab Contextily Matplotlib Google Drive API Google Sheets API

Pipeline

Two notebooks run sequentially. The first handles all geospatial data engineering and produces normalized parquets. The second ingests those parquets, applies ML for lot scoring, and invokes GPT-4o with hybrid retrieval to generate the final PDF report per lot.

01
organiza_terrenos.ipynb " Data Engineering

Geospatial normalization and feature engineering

Parallel shapefile ingestion via ThreadPoolExecutor and GeoPandas. Merges lot geometries with IPTU records by sector+block+lot. Normalizes dual zoning: Lei 18177 (current) vs Lei 16402 (legacy) " covering ZEU, ZEM, ZC, ZM, ZEIS, ZDE, ZPI, ZPR, ZER. Applies flood risk overlays at three return periods (5/25/100 years), geological risk, and watershed restrictions. Builds a cKDTree (scipy.spatial) over five transit modalities " Metro, Train, Bus Stop, Terminal, UDH " and computes Euclidean distance in meters from each sector-block to the nearest node per modality. Scales area, obsolescence, owner age, and UDH features with MinMaxScaler (0.1). Merges IPTU + UDH by Lat/Long. Output: normalized parquets ready for ML.

GeoPandas ThreadPoolExecutor cKDTree O(log N) MinMaxScaler GeoSampa (PMSP) Lei 16402 / 18177
02
rag_terrenos.ipynb " ML + RAG

Lot scoring, semantic retrieval, and PDF generation

PCA + KNN in normalized feature space to surface lots similar to high-potential parcels. FAISS vector store indexes unstructured legal documents (escrituras, environmental reports) with OpenAI text-embedding-ada-002. Hybrid retrieval combines dense FAISS (cosine similarity) with sparse BM25 for better recall on Brazilian zoning terminology. GPT-4o receives structured lot data from SQL alongside k=3 FAISS excerpts and outputs a structured report. GeoPandas + Contextily render a situation map (OpenStreetMap basemap) as PNG. ReportLab composes the final A4 PDF: three-bullet summary, lot data, zoning classification, conclusions, regulatory inconsistencies, regularization paths, and the georeferenced site plan.

PCA + KNN FAISS (cosine) BM25 sparse GPT-4o ada-002 ReportLab A4 Contextily + OSM

Data sources

Six primary data sources " federal rural registries, municipal urban records, geospatial shapefiles, and unstructured legal documents from Google Drive. The pipeline ingests all of them and normalizes into a unified lot-level feature space.

INCRA / SNCR

Federal rural property registry " owner, area, location.

ownerarea_halocation

CAR

Cadastro Ambiental Rural " environmental restrictions per parcel.

APPreserva_legaluso_restrito

SIGEF

Certified rural properties with INCRA-validated polygon boundaries.

geometryINCRA cert

GeoSampa / PMSP

11M+ urban IPTU records. Zoning, FAR, use, lot geometry.

IPTUzoneamentoFARarea_construida

Shapefiles

Zoning laws, geological risk, watershed areas, and flood extent.

Lei 18177Lei 16402risco_geologicomanancialinundação 5/25/100a

Google Drive PDFs

Unstructured legal documents indexed into FAISS for RAG.

escriturasrel. ambientaisfichas INCRA

Outputs

Every run produces normalized parquets for the full lot universe, and a georeferenced PDF report for each lot flagged by the ML scoring step.

Normalized parquets

Intermediate datasets produced by the data engineering notebook and consumed by the ML notebook. One file per geospatial layer.

lotes_iptu lei16 lei18 mancha_inundacao manancial_completo uso_predominante risco_geologico

PDF viability report (per lot)

A4 document generated by ReportLab for each flagged lot. Structured sections produced by GPT-4o with hybrid retrieval context.

3 bullets iniciais dados do imóvel zoneamento conclusões inconsistências regularização planta PNG
. º

Situation map (PNG)

GeoPandas renders the lot polygon with surrounding context. Contextily overlays a live OpenStreetMap basemap and Matplotlib composes the final image. The PNG is embedded directly in the PDF report.

GeoPandas Contextily OSM basemap Matplotlib PIL

Personal research project.