Microservice Timeouts

Published: Tuesday, Feb 9, 2021 Last modified: Wednesday, Feb 24, 2021

I’m still cynical since my earlier poor Microservice pattern experience, nonetheless I see the pattern increasingly.

Microservices are attractive to people familiar with the Unix philosophy:

However…

Microservices fail often

basic microservice sequence diagram
@startuml
"Purchase" -> "Auth" : Check user
"Auth" --> "Purchase" : 2s
"Purchase" -> "Payment" : Deduct balance
"Payment" --> "Purchase" : 2s
"Purchase" -> "Provision" : Supply
"Provision" --> "Purchase" : 3s
@enduml

APIs without a SLO & a strict response time budget, inevitably have response times that grow. And don’t forget unavoidable network issues!

So in the above case we can easily timeout, aka 504 Gateway Timeout.

What are the approaches to handle this case?

  1. Introduce a queue to retry
  2. Make the purchase API asynchronous
  3. Monitor the APIs to start strictly enforcing low response times
  4. Introduce caching
  5. Make the API idempotent

Each of these solutions are actually very hard to implement!

  1. Queues sound trivial, but they are not. You need AWS SQS with support for DLQs
  2. Asynchronous APIs often require a callback, which require addressable endpoints
  3. Monitoring each API via Prometheus / Grafana is non-trivial .. you will go down a rabbit hole when it comes to distributed tracing
  4. Caching is hard
  5. To make an API idempotent aka “retry-abble”, often you need to store state so you know where you left off

Let’s imagine the Payment API also has microservice dependencies:

Microservices calling microservices
@startuml
"Purchase" -> "Auth" : Check user
"Auth" --> "Purchase" : 2s
"Purchase" -> "Payment" : Take payment
"Payment" -> "Account" : Check balance
"Account" --> "Payment" : 1s
"Payment" -> "Account" : Deduct balance
"Account" --> "Payment" : 1s
"Payment" --> "Purchase" : 2s
"Purchase" -> "Provision" : Supply
"Provision" --> "Purchase" : 3s
@enduml

Distributed systems are really hard

  1. Microservices delegate things outside their domain, and introduce inter dependencies
  2. There isn’t a Universal interface. REST/HTTP is great for synchronous, but what happens when you need to go asynchronous?

Rebuttal via a YT comment

Luis Santos correctly points out the bad practices here:

  1. Using Queues without proper monitoring or a deadletter strategy
  2. Not having proper monitoring
  3. Neglecting performance
  4. Sharing a database between services
  5. Distributed transactions
  6. Incorrect service boundaries (this is cause of the previous 2 bad practices)

Regarding the idempotence problem. You don’t need a cache. You just need to take advantage of your database optimistic locking mechanisms. You could use something like upsert, insert ignore or a conditional insert.

I’m not sure “Sharing a database between services” (read-only) is such a bad practice. Since duplicate data on several databases can be far worse.