I am getting some random timeout errors while publishing messages using Confluent.Kafka. The application runs in a Kubernetes cluster and is built using the .NET 6 framework. When the default timeout (60000ms) is reached, the message is successfully published as a result of the configured retries. These are the logs:
[08:03:34 INF] Publishing message with key dada33d9-067c-4528-8e85-b9f50858fb1b and version v1 to topic topic.name ...
[08:04:34 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 60018ms, timeout #0)
[08:04:34 WRN] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[08:04:34 ERR] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 1247331ms in state UP)
[08:04:34 INF] A non fatal error with error code Local_TimedOut occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 1247331ms in state UP). The client will automatically try to recover from this error.
[08:04:34 INF] Message with key dada33d9-067c-4528-8e85-b9f50858fb1b and version v1 successfully published to topic topic.name in partition [4]:4765 .
And by changing SocketTimeoutMs
from 60000ms to 5000ms, the message is published after 5 seconds:
[13:45:46 INF] Publishing message with key db380009-a910-43e4-8a1b-457e5d163f9a and version v1 to topic topic.name ...
[13:45:51 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 5103ms, timeout #0)
[13:45:51 WRN] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[13:45:51 ERR] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 309267ms in state UP)
[13:45:51 INF] A non fatal error with error code Local_TimedOut occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 309267ms in state UP). The client will automatically try to recover from this error.
[13:45:52 INF] Message with key db380009-a910-43e4-8a1b-457e5d163f9a and version v1 successfully published to topic topic.name in partition [7]:14638 .
I can t find a pattern for this behavior, but it happens pretty often I would say, both when the application is deployed and on my local environment.
Based on the logs from App Insights for one of our environments:
- Last 30 Days: 305 timeouts out of 55.72k messages published => 0.55%
- Last 7 Days: 92 timeouts out of 14.60k messages published => 0.62%
And from another environment where we don t have to many messages published:
- Last 30 Days: 63 timeouts out of 467 messages published => 13.49%
- Last 7 Days: 20 timeouts out of 244 messages published => 8.2%
I tried changing the default values for some of the configuration settings, like the default value for LingerMs
from 5ms to 100ms, but didn t help.
Does anyone have any idea what could be the reason behind this? I don t know if there is something wrong with my code/configuration, or something wrong with the Kafka broker. The broker is not owned or maintained by our team, but I know it is built on top of Apache Kafka, using Strimzi, running in Kubernetes.
Client configuration:
- CompressionType: CompressionType.Snappy
- Acks: Acks.Leader
- SecurityProtocol: SecurityProtocol.SaslSsl
- EnableSslCertificateVerification: true
- MessageMaxBytes: 100000000
- SocketKeepaliveEnable: true
- MessageSendMaxRetries: 10
- RetryBackoffMs: 100
- LingerMs: 100
- SocketTimeoutMs: 5000
Edit:
More logs with Debug = "broker,topic,msg"
[12:24:24 INF] Publishing message with key cd23c386-0855-4207-9868-c1ef0acf059c and version v2 to topic topic.name ...
[12:24:24 INF] rdkafka#producer-1: PRODUCE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: Produce MessageSet with 1 message(s) (1041 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, snappy)
[12:24:27 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 3016ms, timeout #0)
[12:24:27 INF] rdkafka#producer-1: REQERR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: ProduceRequest failed: Local: Timed out: explicit actions Retry,MsgPossiblyPersisted
[12:24:27 INF] rdkafka#producer-1: MSGSET - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) encountered error: Local: Timed out (actions Retry,MsgPossiblyPersisted)
[12:24:27 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[12:24:27 INF] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP) (_TIMED_OUT)
[12:24:27 INF] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP)
[12:24:27 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state UP -> DOWN
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [1]: broker is down: re-query
[12:24:29 INF] A non fatal error with error code Local_TimedOut occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP). The client will automatically try to recover from this error.
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [4]: broker is down: re-query
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [7]: broker is down: re-query
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state DOWN -> INIT
[12:24:29 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state INIT -> TRY_CONNECT
[12:24:29 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: broker in state TRY_CONNECT connecting
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state TRY_CONNECT -> CONNECT
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 0 Leader 0
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 1 Leader 1
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 2 Leader 2
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 3 Leader 0
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 4 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 5 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 6 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 7 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 8 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 9 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: sasl_ssl://oauth-2.test.bin.az.company.tech:9094/2: 1/1 requested topic(s) seen in metadata
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 0 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 1 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 2 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 3 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 4 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 5 Leader 2
[12:24:30 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connecting to ipv4#13.95.89.147:9094 (sasl_ssl) with socket 5896
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 6 Leader 0
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 7 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 8 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: Topic topic.name partition 9 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: sasl_ssl://oauth-2.test.bin.az.company.tech:9094/2: 1/1 requested topic(s) seen in metadata
[12:24:30 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connected to ipv4#13.95.89.147:9094
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state CONNECT -> SSL_HANDSHAKE
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: CONNECTED - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connected (#2)
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: AUTH - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Auth in state APIVERSION_QUERY (handshake supported)
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state APIVERSION_QUERY -> AUTH_HANDSHAKE
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: SASLMECHS - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker supported SASL mechanisms: OAUTHBEARER
[12:24:31 INF] rdkafka#producer-1: AUTH - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Auth in state AUTH_HANDSHAKE (handshake supported)
[12:24:31 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state AUTH_HANDSHAKE -> AUTH_REQ
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: OAUTHBEARER - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: SASL OAUTHBEARER authentication successful (principal=)
[12:24:31 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state AUTH_REQ -> UP
[12:24:31 INF] rdkafka#producer-1: PRODUCE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: Produce MessageSet with 1 message(s) (1041 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, snappy)
[12:24:31 INF] rdkafka#producer-1: MSGSET - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
[12:24:31 INF] Message with key cd23c386-0855-4207-9868-c1ef0acf059c and version v2 successfully published to topic topic.name in partition [7]:18182 .