Skip to content

basicPublish can freeze for very long time on network interface removalΒ #994

@sebek64

Description

@sebek64
  • RabbitMQ version: 3.9.21
  • Erlang version: 12.3.2.2
  • Client library version: 5.16.0
  • Operating system, version, and patch level: Linux, kernel 5.10.0
  • Java: openjdk version "17.0.5" 2022-10-18 LTS

Rabbit client can freeze during writing to socket when the network interface is removed. For example, we can run an app in docker, disconnect the network with docker network disconnect ... command. If the connection is currently handling basicPublish, it is very likely that this call get stuck for a long time. No timeout configurations seem to help (SO_TIMEOUT, heartbeats, SO_KEEPALIVE, ...).

The thread is stuck with this stacktrace:

"DefaultDispatcher-worker-5" #315 daemon prio=5 os_prio=0 cpu=64.27ms elapsed=120.00s tid=0x00007fe9ecb2c650 nid=0x201 runnable  [0x00007fe9d74f6000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.Net.poll([email protected]/Native Method)
        at sun.nio.ch.NioSocketImpl.park([email protected]/NioSocketImpl.java:181)
        at sun.nio.ch.NioSocketImpl.park([email protected]/NioSocketImpl.java:190)
        at sun.nio.ch.NioSocketImpl.implWrite([email protected]/NioSocketImpl.java:415)
        at sun.nio.ch.NioSocketImpl.write([email protected]/NioSocketImpl.java:440)
        at sun.nio.ch.NioSocketImpl$2.write([email protected]/NioSocketImpl.java:826)
        at java.net.Socket$SocketOutputStream.write([email protected]/Socket.java:1045)
        at java.io.BufferedOutputStream.flushBuffer([email protected]/BufferedOutputStream.java:81)
        at java.io.BufferedOutputStream.flush([email protected]/BufferedOutputStream.java:142)
        - locked <0x00000000c8b84988> (a java.io.BufferedOutputStream)
        at java.io.DataOutputStream.flush([email protected]/DataOutputStream.java:128)
        at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:197)
        at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:636)
        at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:134)
        at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:455)
        - locked <0x00000000c8b2b308> (a java.lang.Object)
        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:428)
        - locked <0x00000000c8b2b308> (a java.lang.Object)
        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:710)
        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:685)
        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:675)
...
   Locked ownable synchronizers:
        - <0x00000000c8b820b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

We can see that the sending buffer is occupied somehow in netstat output.

By the analysis of this library source code and NioSocketImpl sources, it is clear that the socket seems to be still in "recoverable" state. The flush call is blocked, the implWrite is still optimistic about the possibility to write more (but not yet).

Ideally, either the flush will throw an exception (but that doesn't happen), or we can detect "heartbeat timeouts" in this library and close the connection from outside.

If we try to implement this kind of behavior in the application itself, we fail. For example, if we time-out the basicPublish call and then try to close/abort the connection, it always tries to write something to the socket, so therefore it blocks as well.

For this reason, we believe that this is a bug in the library itself. However, very subtle and hard to fix.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions