Kafka Consumer Error Handling and Retries - Part 1

One of the aspects I truly love about my job as a Software Developer is the constant opportunity to learn new things. Even after years of working with Kafka, I continue to encounter new concepts that ultimately deepen my understanding of its inner workings. Through this blog series, I aim to share my insights on Kafka error handling and retries. Join me on this exciting journey as we unravel the intricacies of Kafka's error handling mechanisms.This is a three part series, in the first installment you will learn about fundamentals of error handling in Kafka with basic error handling scenarios the second and third part will guide through these concepts with a working examples.

In this article I will work with a fictitious online store, let’s call it Gir (Fun fact Gir is a forest just like Amazon, known for its majestic Lions). Gir is the most sustainable shop on the Planet selling only ethically sourced products handled in the most sustainable way possible.

Gir Architecture

Following describes the different components in Gir architecture and how they interact with each other.

Gir Architecture

Components

  • eCommerce Engine is responsible for order checkout experience for the customers
  • Order Management System (OMS) is responsible for managing the life cycle of the order, post order submission.
  • Warehouse Management System (WMS) is responsible for shipping/fulfilling the ordered items to users.
  • Payment Gateway is responsible for processing the order payment and refunds.

Create Order Flow

Following describes what happens when an order is submitted. Place Order Flow

  1. Customer placed an order
  2. eCommerce application(which exposed the checkout functionality) publishes a Create Order event to Kafka topic named "digital_orders"
  3. Order Management System (OMS) is one of the consumer for the "digital_orders" topic and it store order information in its database

This might seem like a straightforward use case—what could possibly go wrong? Well, as it turns out, a lot more than you'd expect!

  • Database down/failure
  • Kafka broker down/failure
  • Network failures causing application not able to communicate with downstream systems
  • Authentication authorization error
  • Kafka offset commit failed
  • Kafka rebalancing
  • Serialization/Deserialization error because of message format mismatch between producer and consumer
  • Duplicate messages
  • Message processed in wrong orders

Categorization of Consumer Errors

Now let’s discuss these errors and how we can handle these errors. Some of these errors are recoverable like DB down or N/W failure could be intermittent issue and we just have to retry retry and hope that the dependent systems selfheal The other errors are not recoverable like serialization/deserialization or duplicate message (unless your application is idempotent), no matter how many times to retry it will continue to fail indefinitely.

Non-Recoverable Errors

Non recoverable errors are errors that cannot be handled programmatically, this includes unknown errors or failures that require manual intervention. Serialization/Deserialization errors are non-recoverable, if a message is invalid (e.g. schema mismatch between producer and consumer), no matter how many times you’ll retry it will always fail. Authentication and authorization errors due to expired password or permission revoked are non-recoverable. The only way to fix it is to reconfigure your application with a new password or grant the required permission to the service account.

Handle Non-Recoverable Error Following describes the handling of non-recoverable errors. Non-Recoverable Error Handling

  1. Kafka Consumer message processing failed with a non-recoverable error (i.e. error that cannot be resolved by retrying)
  2. If message failure is tolerable (e.g. duplicate order message causes OMS processing to fail as order already existed in OMS database) send message to DLQ topic.
  3. If failure is not tolerable (Poison pill) then stop processing of messages and generate alerts.

Recoverable Errors

Recoverable errors are intermittent errors that are temporary and usually get resolved by itself. For example, the call to Payment Gateway might fail due to maintenance activity and hence Kafka consumer will have no control over it, all it can do is keep retrying and hope the Payment Gateway recovers itself. This can be considered a recoverable error.

Handling Recoverable Error

Following describes the handling of recoverable errors.

Recoverable Error Handling

  1. Kafka Consumer message processing failed with a recoverable error (i.e. error that can may be resolved by retrying)
  2. Wait for some time before retrying the operation that resulted in failure
  3. Keep retrying after a delay, potentially utilizing exponential backoff (keep incrementing delay on subsequent failures)
  4. After a certain number of retries, consider this message a failure and stop retrying.
  5. Now what to do with the failed message. If message failure can be tolerated, you publish a message to a Dead Letter Topic (DLT/DLQ) for further investigation.
  6. If message failure cannot be tolerated, for example when messages are ordered and failure of one message could result in failure of subsequent messages.

In some cases determining whether a failure can be tolerated may not always be straightforward and may require implementing a circuit breaker pattern. Like if one call to write to DB failed, it may be tolerable but if 10 back-to-back calls failed, it may be a signal that there is a system wide failure of some sort and hence consumption of message should be stopped.

Determining when an error is recoverable when not can sometimes be tricky. An authorization request failing can be attributed to be the authorization server temporarily down in which case it could recover in time (recoverable) or there the permissions may be revoked or misconfigured which may require consumers to be reconfigured and redeployed (non-recoverable).
If this particular scenario if you are making a REST API call you may be able to differentiate the two by HTTP response status code, 4XX status code (Unauthorized/Forbidden) as non recoverable error and 5XX(Service Unavailable) as recoverable.

Having covered the theoretical aspects, we will now delve into practical implementation. In the upcoming two parth to this series, we will explore various error handling scenarios with hands-on examples, addressing both recoverable and non-recoverable errors.