Cicor " credit risk scoring for factoring operations.

ML pipeline built for a São Paulo factoring company. Scrapes company data from the Receita Federal public database, combines it with portfolio concentration metrics, and runs an XGBoost model to score the default probability of each exposure daily. In production.

Factoring companies buy receivables " invoices, bills of exchange " from cedentes (assignors) and collect from sacados (drawees). Every credit decision depends on knowing who you're exposed to: fiscal status, capital structure, company age, concentration in the portfolio. This pipeline automates that entire assessment " from raw CNPJ to a scored, auditable BigQuery table " without any manual lookup.

Python XGBoost scikit-learn Receita Federal scraping pandas PyMuPDF (fitz) Google Drive API v3 Google Sheets API v4 BigQuery Service Account auth Cron

Risk scoring pipeline `risco_cicor.py`

Runs on a daily cron. Three data sources feed the model " portfolio concentration from bank PDFs, company data scraped from the Receita Federal, and historical default labels from the internal warehouse. XGBoost scores each exposure and writes results back to BigQuery. No human steps in the loop.

Daily scoring pipeline Scheduled . cron

Three inputs -> feature matrix -> XGBoost -> BigQuery. Runs unattended via Google Service Account.

Input 1

Receita Federal scraping

Automated CNPJ lookup for every cedente and sacado in the portfolio. Extracts: company status (Ativa/Inapta/Baixada), opening date, capital social, legal nature, partners, and fiscal regime. Signals that indicate distressed or shell companies flag immediately.

-> features: age, capital, fiscal_flag, status_code

Input 2

Portfolio concentration (PDF)

Bank concentration report extracted via PyMuPDF. Positional column parsing yields sacado-level exposure: number of titles, total value, and share of total portfolio per counterparty.

-> features: concentration_ratio, n_titles, exposure_pct

Model

XGBoost " default probability

Gradient boosted trees trained on labeled historical portfolio data. Feature importance: Receita Federal fiscal flags, company age, capital tier, portfolio concentration ratio, and cedente advance rate. Output: prob_default score per exposure (0.1).

-> BigQuery: Tab_Score_Risco

Output

BigQuery warehouse

Two tables updated daily: company enrichment from Receita Federal and scored risk per exposure. Source of truth for credit committee dashboards and limit review.

-> Tab_CNPJ_RF . Tab_Score_Risco

Feature engineering

Company age in days from data_abertura, capital tier buckets, binary fiscal_flag from situação cadastral, concentration percentile rank within portfolio, and cedente advance rate from the contract register.

Auth & scheduling

Google Service Account with Drive, Sheets, and BigQuery scopes. JSON keyfile loaded at runtime " no OAuth user flow. Runs unattended via Windows Task Scheduler / cron.

Stack

ML model

XGBoost scikit-learn prob_default score

Data collection

Receita Federal scraping PyMuPDF (fitz) positional parsing

Warehouse

BigQuery Drive API v3 Service Account

Client project " not open source.