English 中文(简体)
Random "Timed out ProduceRequest in flight" messages
原标题:

I am getting some random timeout errors while publishing messages using Confluent.Kafka. The application runs in a Kubernetes cluster and is built using the .NET 6 framework. When the default timeout (60000ms) is reached, the message is successfully published as a result of the configured retries. These are the logs:

[08:03:34 INF] Publishing message with key  dada33d9-067c-4528-8e85-b9f50858fb1b  and version  v1  to topic  topic.name ...
[08:04:34 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 60018ms, timeout #0)
[08:04:34 WRN] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[08:04:34 ERR] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 1247331ms in state UP)
[08:04:34 INF] A non fatal error with error code  Local_TimedOut  occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 1247331ms in state UP). The client will automatically try to recover from this error.
[08:04:34 INF] Message with key  dada33d9-067c-4528-8e85-b9f50858fb1b  and version  v1  successfully published to topic  topic.name  in partition  [4]:4765 .

And by changing SocketTimeoutMs from 60000ms to 5000ms, the message is published after 5 seconds:

[13:45:46 INF] Publishing message with key  db380009-a910-43e4-8a1b-457e5d163f9a  and version  v1  to topic  topic.name ...
[13:45:51 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 5103ms, timeout #0)
[13:45:51 WRN] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[13:45:51 ERR] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 309267ms in state UP)
[13:45:51 INF] A non fatal error with error code  Local_TimedOut  occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 309267ms in state UP). The client will automatically try to recover from this error.
[13:45:52 INF] Message with key  db380009-a910-43e4-8a1b-457e5d163f9a  and version  v1  successfully published to topic  topic.name  in partition  [7]:14638 .

I can t find a pattern for this behavior, but it happens pretty often I would say, both when the application is deployed and on my local environment.
Based on the logs from App Insights for one of our environments:

  • Last 30 Days: 305 timeouts out of 55.72k messages published => 0.55%
  • Last 7 Days: 92 timeouts out of 14.60k messages published => 0.62%

And from another environment where we don t have to many messages published:

  • Last 30 Days: 63 timeouts out of 467 messages published => 13.49%
  • Last 7 Days: 20 timeouts out of 244 messages published => 8.2%

I tried changing the default values for some of the configuration settings, like the default value for LingerMs from 5ms to 100ms, but didn t help.

Does anyone have any idea what could be the reason behind this? I don t know if there is something wrong with my code/configuration, or something wrong with the Kafka broker. The broker is not owned or maintained by our team, but I know it is built on top of Apache Kafka, using Strimzi, running in Kubernetes.

Client configuration:

  • CompressionType: CompressionType.Snappy
  • Acks: Acks.Leader
  • SecurityProtocol: SecurityProtocol.SaslSsl
  • EnableSslCertificateVerification: true
  • MessageMaxBytes: 100000000
  • SocketKeepaliveEnable: true
  • MessageSendMaxRetries: 10
  • RetryBackoffMs: 100
  • LingerMs: 100
  • SocketTimeoutMs: 5000

Edit: More logs with Debug = "broker,topic,msg"

[12:24:24 INF] Publishing message with key  cd23c386-0855-4207-9868-c1ef0acf059c  and version  v2  to topic  topic.name ...
[12:24:24 INF] rdkafka#producer-1: PRODUCE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: Produce MessageSet with 1 message(s) (1041 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, snappy)
[12:24:27 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out ProduceRequest in flight (after 3016ms, timeout #0)
[12:24:27 INF] rdkafka#producer-1: REQERR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: ProduceRequest failed: Local: Timed out: explicit actions Retry,MsgPossiblyPersisted
[12:24:27 INF] rdkafka#producer-1: MSGSET - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) encountered error: Local: Timed out (actions Retry,MsgPossiblyPersisted)
[12:24:27 INF] rdkafka#producer-1: REQTMOUT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
[12:24:27 INF] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP) (_TIMED_OUT)
[12:24:27 INF] rdkafka#producer-1: FAIL - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP)
[12:24:27 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state UP -> DOWN
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [1]: broker is down: re-query
[12:24:29 INF] A non fatal error with error code  Local_TimedOut  occurred: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: 1 request(s) timed out: disconnect (after 298222ms in state UP). The client will automatically try to recover from this error.
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [4]: broker is down: re-query
[12:24:29 INF] rdkafka#producer-1: QRYLEADER - [thrd:main]: Topic topic.name [7]: broker is down: re-query
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state DOWN -> INIT
[12:24:29 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state INIT -> TRY_CONNECT
[12:24:29 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: broker in state TRY_CONNECT connecting
[12:24:29 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state TRY_CONNECT -> CONNECT
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 0 Leader 0
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 1 Leader 1
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 2 Leader 2
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 3 Leader 0
[12:24:29 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 4 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 5 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 6 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 7 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 8 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 9 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: sasl_ssl://oauth-2.test.bin.az.company.tech:9094/2: 1/1 requested topic(s) seen in metadata
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 0 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 1 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 2 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 3 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 4 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 5 Leader 2
[12:24:30 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connecting to ipv4#13.95.89.147:9094 (sasl_ssl) with socket 5896
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 6 Leader 0
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 7 Leader 1
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 8 Leader 2
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]:   Topic topic.name partition 9 Leader 0
[12:24:30 INF] rdkafka#producer-1: METADATA - [thrd:main]: sasl_ssl://oauth-2.test.bin.az.company.tech:9094/2: 1/1 requested topic(s) seen in metadata
[12:24:30 INF] rdkafka#producer-1: CONNECT - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connected to ipv4#13.95.89.147:9094
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state CONNECT -> SSL_HANDSHAKE
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: CONNECTED - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Connected (#2)
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:30 INF] rdkafka#producer-1: AUTH - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Auth in state APIVERSION_QUERY (handshake supported)
[12:24:30 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state APIVERSION_QUERY -> AUTH_HANDSHAKE
[12:24:30 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: SASLMECHS - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker supported SASL mechanisms: OAUTHBEARER
[12:24:31 INF] rdkafka#producer-1: AUTH - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Auth in state AUTH_HANDSHAKE (handshake supported)
[12:24:31 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state AUTH_HANDSHAKE -> AUTH_REQ
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: TOPPAR - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7] 1 message(s) queued but broker not up
[12:24:31 INF] rdkafka#producer-1: OAUTHBEARER - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: SASL OAUTHBEARER authentication successful (principal=)
[12:24:31 INF] rdkafka#producer-1: STATE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: Broker changed state AUTH_REQ -> UP
[12:24:31 INF] rdkafka#producer-1: PRODUCE - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: Produce MessageSet with 1 message(s) (1041 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, snappy)
[12:24:31 INF] rdkafka#producer-1: MSGSET - [thrd:sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1]: sasl_ssl://oauth-1.test.bin.az.company.tech:9094/1: topic.name [7]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) delivered
[12:24:31 INF] Message with key  cd23c386-0855-4207-9868-c1ef0acf059c  and version  v2  successfully published to topic  topic.name  in partition  [7]:18182 .
问题回答

I have the same issue with Kafka consumer Java. I have tried to change linger.ms but it doesn t work.





相关问题
Random "Timed out ProduceRequest in flight" messages

I am getting some random timeout errors while publishing messages using Confluent.Kafka. The application runs in a Kubernetes cluster and is built using the .NET 6 framework. When the default timeout (...

如何将.Net 6.0的dll引用到.Net Framework 4.8中

大家好! 我有一个类库项目,目标是.NET Framework 6.0。当我将此DLL引用到目标为.NET Framework 4.8的另一个项目中时,我会收到以下错误消息 ...

热门标签