Insurance Database
Overview
My role: Architecture to deployed system, run and maintain.
Challenge: Provide timely information about insurance producers to management organisations through a convenient web portal.
Solution: Established an extraction-transformation-loading (ETL) pipeline from the National Insurance Producer Registry (NIPR) to the information portal.
Impact: Significantly improved error rates, deduplication and matching, while establishing timely updates from the gold-standard data source.
Technologies: Quarkus, Java 21, WordPress
Insurance Producers
Insurance producers are companies and individuals who sell insurance. They are licensed by states to sell certain kinds of insurance products. Knowing which insurance producer holds which licences in which states is key information for organisations managing insurance producers.
Records and Alerts
The main national body that tracks insurance producers and their licenses is the NIPR. They provide full records about insurance producers showing every possible state and license they have or had in the past. They also provide daily incremental updates of any changes. The challenge is to manage the list of insurance producers that the management organisations track, ensure that we have the full record and then keep it up to date through the daily alerts.
XML and SOAP
The information we need to display to the management organisations is not presented in a directly consumable format for the portal. The ETL pipeline maps existing data, newly downloaded full records, as well as daily alerts, and ensures consistent, reliable updates. For full transparency, the system shows each version and changes detected on a timeline for administrators to verify that all mapping and changes are correct. For the end users, all this is transparent as they can use the modern, convenient web UI to access, search, and read the information.
Delivery and Technical Approach
Backfill and Data Quality
A full NIPR record snapshot per producer is a paid download, and with ~20,000 existing records the cost would have been substantial. The legacy database also suffered from misalignment and poor matching inherited from an upstream provider, so we started from imperfect data quality.
The approach was to:
- perform an initial scrub of the existing dataset
- backfill into the new data model
- progressively replace backfilled records with fresh, authoritative data from NIPR as daily alerts arrived
Matching, Deduplication, and Review Workflow
Many fields that need to be aligned (e.g., lines of authority, producer types, and licence descriptions) are free-text rather than stable codes. The matching layer therefore combines:
- high-confidence textual matching
- additional heuristics to detect likely equivalents
- a daily review report for administrators when the system cannot match with sufficient confidence
As the backfill data was replaced by authoritative NIPR-derived records, manual review volume reduced significantly (from ~10–15 reviews per day after go-live to a few per week).
Versioning and Audit Trail
Each producer record is fully versioned from its starting point (backfill or full record download). From that point onward, every daily alert is stored and can be inspected, so administrators can see what changed and when, going back in time.
Each ingestion activity (full downloads and daily alerts) also produces a detailed log linked to the corresponding update for traceability and troubleshooting.
Operations, Resilience, and Security
Operational events and failures are surfaced via Google Chat notifications. The system automatically retries NIPR API calls when the API is unavailable (aligned to the NIPR publishing schedule).
The integration between the WordPress frontend and the Quarkus backend is intentionally split:
- Quarkus performs all ETL, matching, versioning, and database management
- WordPress provides the portal UI and administrative experience
The Quarkus→WordPress synchronisation uses an HMAC-secured connection and publishes full data snapshots after each update. WordPress detects updates and replaces its frontend dataset entirely, ensuring it always serves a consistent copy. This design allows the portal to continue serving the latest available data even if the backend is temporarily unavailable.