Skip to content
Gilang Adrian

— Case study

Multi-channel OTP — from one SMS provider to four

Expanded OTP delivery from a single SMS vendor to two SMS + two WhatsApp providers with failover. Critical authentication flows stopped breaking when one upstream rate-limited.

Role
Product Engineer (Notification platform)
Period
Oct 2022 – Jul 2023
Stack
Java · Spring Boot · Crafter CMS · SMS / WhatsApp APIs

Context

Authentication OTPs ran through one SMS provider. When that provider rate-limited or had a regional issue, login and high-value transaction confirmation broke for everyone, and support tickets piled up faster than they could be triaged. The fix had been in the backlog for months. The work wasn’t hard; the political question of which providers to pick was.

I owned the CMS Notification backlog at the time, wrote the technical RFC, defined acceptance criteria, and ran vendor integration.

Decisions

I built the abstraction first and chose providers second. The routing layer ranks providers by channel, region, cost, and recent delivery success. Adding a new provider is a config change, not a code change, which means we can swap or extend without re-shipping.

WhatsApp came in as a peer channel rather than a fallback. In several of our segments, WhatsApp delivery actually beats SMS. Treating both as equals lets the routing layer pick the best channel per user and region instead of always trying SMS first.

Health gets scored per message, not per provider. Vendor status pages miss partial degradation, so heartbeat checks aren’t enough. Rolling delivery success catches the cases where a provider returns 200s but messages aren’t actually arriving.

Results

  • 1 → 4 providers (2 SMS + 2 WhatsApp), routed dynamically per request.
  • Auth-OTP delivery during a vendor incident: from “outage” to “no customer-visible impact.”
  • Same OTP path now reused by transaction confirmations and high-risk action flows.

Stack notes

Acceptance criteria were the hardest part. “Failover works” is not testable. We landed on specific scenarios: provider returns 5xx, provider returns 200 but nothing arrives, provider rate-limits with 429, provider goes silent for N seconds. Each one had a runbook entry and a synthetic monitor. The first three months in production found two real edge cases that the synthetic suite didn’t cover; both became new test scenarios.