ML pipeline built for a São Paulo factoring company. Scrapes company data from the Receita Federal public database, combines it with portfolio concentration metrics, and runs an XGBoost model to score the default probability of each exposure daily. In production.
Factoring companies buy receivables " invoices, bills of exchange " from cedentes (assignors) and collect from sacados (drawees). Every credit decision depends on knowing who you're exposed to: fiscal status, capital structure, company age, concentration in the portfolio. This pipeline automates that entire assessment " from raw CNPJ to a scored, auditable BigQuery table " without any manual lookup.
risco_cicor.pyRuns on a daily cron. Three data sources feed the model " portfolio concentration from bank PDFs, company data scraped from the Receita Federal, and historical default labels from the internal warehouse. XGBoost scores each exposure and writes results back to BigQuery. No human steps in the loop.
Three inputs -> feature matrix -> XGBoost -> BigQuery. Runs unattended via Google Service Account.
Automated CNPJ lookup for every cedente and sacado in the portfolio. Extracts: company status (Ativa/Inapta/Baixada), opening date, capital social, legal nature, partners, and fiscal regime. Signals that indicate distressed or shell companies flag immediately.
Bank concentration report extracted via PyMuPDF. Positional column parsing yields sacado-level exposure: number of titles, total value, and share of total portfolio per counterparty.
Gradient boosted trees trained on labeled historical portfolio data. Feature importance: Receita Federal fiscal flags, company age, capital tier, portfolio concentration ratio, and cedente advance rate. Output: prob_default score per exposure (0.1).
Two tables updated daily: company enrichment from Receita Federal and scored risk per exposure. Source of truth for credit committee dashboards and limit review.
Company age in days from data_abertura, capital tier buckets, binary fiscal_flag from situação cadastral, concentration percentile rank within portfolio, and cedente advance rate from the contract register.
Google Service Account with Drive, Sheets, and BigQuery scopes. JSON keyfile loaded at runtime " no OAuth user flow. Runs unattended via Windows Task Scheduler / cron.
Client project " not open source.