Race Conditions in Reactive Microservices: Beyond the Sleep Workaround
A hard-coded sleep in a Lambda callback handler masking a timing race between a DB write and an external notification. Three ways to actually fix it — and why adding a database index doesn't help.
The Symptom
After a payment submission succeeds, the partner fires a callback notification almost immediately. The backend's callback handler throws an exception because it can't find the transaction record. The workaround in the notification handler:
DELAY_SECONDS = 4
if notification_type in {"BIS018", "BIS007"}:
time.sleep(DELAY_SECONDS)
This masks the problem but doesn't fix it. Under load — when the database is busy with concurrent writes — the write takes longer than 4 seconds and the race fires again.
The Exact Race
[Backend async thread] [External partner]
callSubmitPayment(request)
│ Partner processes payment
│◄──── response with transaction ID ─────── Partner returns reference ID
│
▼
handleSubmitSuccess()
→ entity.setPartnerTransactionId(...)
→ entity.setStatus(COMPLETED)
→ repository.save(entity) ← DB write in progress Partner fires callback immediately
│
▼
notification-handler receives callback
time.sleep(4s) ← workaround
│
▼
call handleCallback(partnerTxnId)
findByPartnerTransactionId(id)
│
DB write still running ───► ROW NOT FOUND → exception
The critical insight: the partner fires its callback the moment their processing completes — which is the same moment your backend begins writing the partner's transaction ID to the database. The 4-second sleep is a bet that your DB write finishes within 4 seconds, which is not guaranteed under load.
Why a Database Index Doesn't Help
The proposed fix in this case was to add an index on partner_transaction_id to speed up the lookup. This is the wrong diagnosis.
A database index improves SELECT performance when the row exists. The race condition is a write timing problem — the row doesn't exist yet when the callback arrives.
| Index | Race Condition | |
|---|---|---|
| What it solves | Slow SELECT when row exists |
Row not yet committed when SELECT runs |
| Effect here | None — faster lookup on a missing row still returns empty | Requires the write to complete before the read |
Adding an index on partner_transaction_id is a valid general performance improvement — but it is a completely separate concern from this issue. The exception is thrown not because the lookup is slow, but because the row isn't there.
Three Actual Solutions
Option 1 — Retry with backoff in the callback handler (backend)
Add repeatWhenEmpty inside handleCallback so it waits adaptively until the record appears, instead of failing immediately:
// Before — fails immediately if row not found
transactionRepository.findByPartnerTransactionId(request.getTransactionId())
.switchIfEmpty(Mono.error(new NotFoundException(...)));
// After — retries with incremental backoff
transactionRepository.findByPartnerTransactionId(request.getTransactionId())
.repeatWhenEmpty(flux -> flux
.zipWith(Flux.range(1, 5))
.flatMap(t -> Mono.delay(Duration.ofSeconds(t.getT2())))
)
.switchIfEmpty(Mono.error(new NotFoundException(...)));
The retry stops the moment the row appears. No fixed sleep anywhere. The Lambda handler can be simplified to a single call with no delay.
| Fixes root cause | Partially — tolerates the race |
| Lambda change needed | Remove the sleep |
| Risk | Holds a reactive subscription during retry; needs a timeout cap |
Option 2 — Save the lookup key before processing (root-cause fix)
The race exists because the partner's transaction ID is written inside handleSubmitSuccess — the same operation that competes with the callback. If you write just the lookup key to the DB immediately after receiving the partner's response (before any status update), the callback handler always finds the row:
callSubmitPayment(request)
│◄──── response with partnerTransactionId ──── Partner returns ID
│
▼
savePartnerTransactionId(entity) ← minimal write, blocking, completes fast
│
▼
handleSubmitSuccess() ← full status update, can be async
→ entity.setStatus(COMPLETED)
→ repository.save(entity)
The callback handler finds the record immediately. handleCallback needs to handle the case where status is still PENDING — either wait, return 200 without processing, or return 202 Accepted.
| Fixes root cause | Yes — eliminates the race |
| Lambda sleep needed | No |
| Extra DB writes | One additional minimal write per submission |
| Complexity | Callback handler must tolerate intermediate status |
Option 3 — Replace fixed sleep with retry loop in the notification handler
A lower-risk immediate fix: replace time.sleep(4) with a retry loop that calls the backend immediately and retries on 404:
MAX_RETRIES = 5
BACKOFF_SECONDS = [1, 2, 3, 4, 5]
for attempt in range(MAX_RETRIES):
response = call_backend_callback_api(callback_data)
if response.status_code == 404:
time.sleep(BACKOFF_SECONDS[attempt])
continue
break
Requires: the backend to return 404 specifically when the transaction is not found, rather than 500. If the backend conflates "not found" with "internal error", the retry loop can't distinguish them.
| Eliminates fixed delay | Yes — only waits when actually needed |
| Backend change needed | Must distinguish 404 from 500 |
| Risk | Partner callback has a response timeout — retries must complete within it |
Recommended Phasing
Immediate: Implement Option 3 in the notification handler — remove the fixed sleep, add retry-on-404. Also ensure the backend returns 404 for "not found" vs 500 for real errors.
Short-term: Implement Option 1 in handleCallback — move retry logic to the backend. The notification handler becomes a simple forwarding call with no retry logic.
Long-term: Implement Option 2 — write the partner transaction ID as a separate early commit. This eliminates the race entirely and makes both the notification handler and the callback handler simple again.
Broader Pattern: Application-Layer Timing Problems
Race conditions between services are not database problems. They're timing problems that require application-layer solutions:
| Pattern | When to use |
|---|---|
| Retry with exponential backoff | The race window is short; you can afford to retry |
| Early write of lookup key | You control the write ordering on the originating side |
| Idempotent upsert | INSERT ... ON CONFLICT DO UPDATE — safe regardless of which operation arrives first |
| Optimistic locking + retry | Concurrent updates on the same row; catch OptimisticLockingFailureException and reload |
Adding a database index to solve a write timing problem is category confusion. Indexes help reads. Write ordering is an architectural decision.
Based on a production incident in a Spring Boot WebFlux + AWS Lambda callback architecture. Company-specific identifiers removed.