Recently I had an interesting case where in the data pipeline I’ve found duplicate messages in the Kafka topics.
Duplicate records in Kafka topics might appear for many different reasons, but most of them you can find only those related to the Kafka settings (especially those related to the Kafka settings).
In this article you will see another one, very frequent cause which is hard to detect – TCP Retrans rate.
Network errors and especially relation between the network and duplicate messages are difficult to troubleshoot for several reasons:
- Persons with knowledge of every segment of IT infrastructure that might impact your system with analytic and troubleshooting capabilities are difficult to find.
- Network and Kafka admin are different persons that are sitting in different departments (inter departments information flow is usually slow and inefficient).
- Monitoring the whole stack with all components to be able to analyze and troubleshoot information from different subsystem. Having excellent tool for one component, but not have another component under monitoring at all is a very common case. The other case I can see very often is incorrectly setup of the monitoring tools, in which case you will most probably miss some important information.
- Permission to view and analyze information from different subsystems are difficult to obtain.
- Duplicate messages in the Kafka topics occur from time to time. Reproducing such type of errors are often difficult to find.
- Error type. Usually when troubleshooting network issues, you are looking for dropped packages. Retransmission is what provides safety for TCP protocol and is a normal thing to appear. The problem arises in case you have spikes like the ones in the following to graphs.
- Subsystem affected – usually network errors the most difficult to find. You might heard for a joke: If you don’t know what’s the problem, just declare the network to be guilty.
On the following picture you can see TCP Retrans Segs:
and on the next the same graph with TCP Retrans Error Rate:
If you can connect duplicate messages with time interval from this graph, you are on the right way to resolve your issue.
Kafka default “at least once delivery” strategy along with the spikes in the network packages retransmission is the real cause of duplicates messages.
The next step is to resolve this issue.
First you need to find the current and the maximum values for your NIC:
jp@jp.localdomain:~>ethtool -g enp0s8
Ring parameters for enp0s8:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256
Next step is to increase the values (first temporary, if it fix the problem, you can make it permanent) for RX and TX by executing the following commands:
jp@jp.localdomain:~>sudo ethtool -G enp0s8 rx 4096
jp@jp.localdomain:~>sudo ethtool -G enp0s8 tx 4096
and finally check to confirm that new values have been applied:
jp@jp.localdomain:~>ethtool -g enp0s8
Ring parameters for enp0s8:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
In case you are using VMWare ESX you can use GUI instead of terminal.
Summary:
Hopefully, if your analysis is correct, by changing the NIC parameters duplicate messages won’t occur any longer.
Comments