Akka shard region buffering messages #7411

giovanni-ferrari · 2024-12-18T08:29:55Z

giovanni-ferrari
Dec 18, 2024

Scenario:

An actor receiving messages from MQTT (called MqttActor) - The actor is part of a shard, with only one instance running at a time (one instance per MQTT configuration, so only one in this scenario).
MqttActor forward messages to the ShardingRegion that handles ForwardActor(s)
N Actors (called ForwardActor) in shard, that receive the messages, process them (according to their own internal state), and forward on different busses.

Problem:
The application is stable for days, consuming and forwarding ~2500msg/s. From time to time, without any change in the messages volume or size, I see CPU spikes, and memory usage increse. The CPU goes to 100% (and stays at 100% for about 1h), the memory start slowly increasing until the application stops (on one node of the cluster - the one where the MqttActor is running), and the cluster is then re-balanced.

When the CPU is at 100% i see logs releated to the "heartbeat interval is growing too large" and "Scheduled sending of heartbeat was delayed"
When the CPU is at 100%, I see messages forwarded by ForwardActor has a drops and spkies, on the cluster nodes where the MqttActor is not running, while they are stable on the node where the MqttActor is running

My hypotheses, is that memory increase is due to the shard region buffering messages, because the other nodes cannot be considered reachable - because of the heartbeat is not stable.
Messages drop in ForwardActors that are not running in the same node of the MqttActor, could be due to the fact their messages were buffered. And the spikes could be due to the fact the node for a moment was considered reachable again - so messages in buffer were delivered to the actor.

My second hypotheses, is that CPU at 100%, is due to the shard region having to buffer messages, and then to forward when the other nodes are available.

This scenario would then be triggered by an "initial" CPU spike (for which I do not know the reason) and then the application is not able to recover until it stops itself.

Are the hypotheses plausible, and has anyone any idea why this could happen - or how to prevent this situation?

Aaronontheweb · 2024-12-20T19:42:08Z

Aaronontheweb
Dec 20, 2024
Maintainer

Hi @giovanni-ferrari

Do you see any warnings or info logs about buffered messages in the logs prior to the CPU hitting 100%? If the system was operating normally and then suddenly stops working, that could be the result of shard message buffering but you'd be seeing movement in the cluster first (i.e. scaling up, down, etc.) Does this issue occur when there's no movement first?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka shard region buffering messages #7411

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Akka shard region buffering messages #7411

giovanni-ferrari Dec 18, 2024

Replies: 1 comment

Aaronontheweb Dec 20, 2024 Maintainer

giovanni-ferrari
Dec 18, 2024

Aaronontheweb
Dec 20, 2024
Maintainer