Cicero F.

About Me

Senior Data Engineer and Al Consultant with 13+ years of experience designing production-grade data platforms centered on data integrity, entity resolution, and master data management across insurance, fintech, and media domains. Deep expertise in Python and SQL for building high-throughput record linkage systems, probabilistic matching pipelines, and Data Vault-style architectures that consolidate inconsistent, multi-source input data into authoritative canonical records with end-to-end lineage tracking. Proven ability to architect Al-augmented extraction workflows combining LLM-based structured parsing with deterministic validation layers, and to serve as a technical thought partner to leadership teams on platform decisions with significant commercial and regulatory consequences.

AI, ML & LLM

Apache Airflow Airbyte

Backend

Database

DevOps

Workflow

Git GitHub Actions

Other

EMR Lambda Apache Spark Databricks Scala Shell Scripting Dagster FiveTran Debezium Apache Kafka Apache Nifi pgvector Scikit Learn Hugging Face Snowflake Amazon Redshift Amazon S3 Google BigQuery

Work history

Target Work
Senior Data Engineer | Al Consultant
2019 - 2026 (7 years)
Remote
  • Architected an end-to-end master data platform for a property and contractor data aggregation system, designing a Data Vault 2.0 backbone that ingested registry exports, crawled PDFs, and third-party APIs into Hub-Satellite Link structures with full historical lineage tracking across 9,000+ entities and 80 TB of raw source data.

  • Designed a probabilistic entity resolution engine in Python combining phonetic normalization, token-sorted fuzzy matching, and address parsing heuristics to deduplicate contractor and property records with conflicting name and address representations, achieving 94% precision on a labeled validation corpus and consolidating 40M candidate pairs into 8M authoritative master records.

  • Engineered an LLM-assisted structured extraction pipeline using the Anthropic API and a Dagster orchestration graph to parse unstructured construction permit PDFs and registry documents, implementing a deterministic post processing validation layer with rule-based checks that flagged low-confidence extractions for human review and maintained 97% field-level accuracy in production.

PythonSQLData VaultLLM Anthropic API Dagster pgvector PostgreSQL
Seno
Senior Al | Data Engineer
2016 - 2019 (3 years)
Remote
  • Designed a greenfield MDM platform on Amazon Redshift consolidating insurance policy, claims, and financial records from 30+ carrier systems, implementing a canonical customer entity model that resolved policyholder identities across conflicting source representations and unified 2.8B monthly records into a single authoritative customer view for underwriting and actuarial teams.

  • Built a metadata-driven record linkage framework in Python using configurable blocking strategies and weighted field similarity scoring to match customer entities across Guidewire PolicyCenter, Oracle Financials, and legacy mainframe extracts, reducing duplicate policyholder records by 38% and improving loss reserve calculation accuracy by eliminating cross-system double-counting artifacts.

  • Engineered CDC streaming pipelines using Debezium on Amazon MSK to capture real-time policy state changes from Guidewire PostgreSQL databases with sub-60-second end-to-end latency, feeding a stream processing layer that kept the canonical customer entity model current and surfaced underwriting queue metrics with 50% faster decisioning responsiveness.

Amazon Redshift PythonOracle Financials Debezium Amazon MSK AirflowDBTAWS CloudWatchSSIS Power BI
Tecno IT
Backend Developer | Data Engineer
2012 - 2016 (4 years)
Remote
  • Developed backend services and data pipelines for large-scale sports streaming and radio content platforms, building high-throughput RESTful APIs in Java Spring Boot and Python Django that supported 100K+ active users with reliable real-time content delivery and engagement tracking at peak sporting event concurrency.

  • Reduced content delivery latency by 30% by integrating Apache Kafka event streams, WebSockets, and AWS Lambda-based triggers for real-time broadcast updates, designing a fan-out architecture that decoupled live event ingestion from downstream analytics consumers and enabled concurrent audience scaling during high-traffic periods.

  • Improved database performance across PostgreSQL, MongoDB, and SQL Server through schema redesign, composite index tuning, and query plan optimization, achieving up to 30% faster response times for high-volume concurrent read/write workloads during live event traffic spikes exceeding 50K simultaneous requests.

Education

Education
B.S. in Computer Science
Federal University of Santa Maria