Question: Actor Persistence / Stateful actors #254

TheBeachMaster · 2026-03-13T09:34:01Z

TheBeachMaster
Mar 13, 2026

What is the recommended way of implementing stateful actors that can recover their states after an event such as application upgrade.

Example use-case:
A long running task (maybe a heavy computation that takes several seconds, e.g. part of an analytics pipeline); message has been received by the worker/actor, but in the middle of processing a new version of the app has been been deployed(picture this happening in a K8s environment). Naturally the actor state will be lost.

What would be the best way to go about this?

Weird workarounds (not yet tested):

Implement a custom durable state backend (this is the easy part)
Persist message when HandleMessageName callback is invoked via Split-Handle - still not that hard

But this presents a problem, in classic Akka

The processing of the next command will not start until the state has been successfully stored in the database.

In such a case, how do we handover to stored messages after persisting?

Also, on Init() how do we replay stored messages to the actor for processing after a fresh start?

I am sure that you've thought about this and decided not to implement it for a reason, but I would be glad if you pointed me to the right direction so that I can implement a solution that would help with this.

halturin · 2026-03-13T19:56:00Z

halturin
Mar 13, 2026
Maintainer

Great question. This is a deliberate design decision - Ergo Framework follows the Erlang/OTP philosophy where actor processes are ephemeral and state lives outside the actor. There is no built-in "Persistent Actor" pattern (like Akka Persistence) and this is intentional. Here's why and what to do instead.

Why there's no built-in event sourcing / persistent actors

The Erlang/OTP philosophy (which Ergo follows) treats actors as computation units, not storage units. The "let it crash" model assumes:

Transient state is rebuilt on restart (from external sources)
Durable state belongs in purpose-built storage (databases, event logs, queues)
Mixing persistence logic into the actor model creates coupling between computation and storage that makes actors harder to reason about and test

In Akka, persistent actors are convenient but come with well-known pain points - schema evolution of persisted events, journal compaction, snapshot management, recovery time growing with event count. Ergo avoids inheriting these problems by keeping the actor model clean.

The recommended approach for your use case

For long-running computations that must survive restarts/deployments in K8s, the pattern is external queue + checkpointing:

Pattern 1: External queue with explicit acknowledgment

Instead of sending messages directly to the worker actor, put work items into an external durable queue (NATS JetStream, Kafka, Redis Streams, PostgreSQL with FOR UPDATE SKIP LOCKED, etc.). The actor pulls work from the queue and only acknowledges completion after processing finishes.

[Producer] -> [Durable Queue] -> [Worker Actor] -> ack
                                        |
                                        v
                                  [Result Store]

On restart (after deploy), the supervisor restarts the worker, and in Init() the actor simply reconnects to the queue. Unacknowledged messages are automatically redelivered by the queue. No replay logic needed - the queue handles it.

This is the simplest and most reliable pattern. It completely avoids the "how do we replay stored messages" problem.

Pattern 2: Checkpoint-based recovery for multi-step computation

If the computation is genuinely long (seconds to minutes) and can be broken into steps:

type AnalyticsWorker struct {
    act.Actor
    store CheckpointStore  // your backend (Redis, PostgreSQL, etc.)
}

func (w *AnalyticsWorker) Init(args ...any) error {
    // On restart, check if there's an incomplete computation
    checkpoint, err := w.store.LoadCheckpoint(w.Name())
    if err != nil {
        return err
    }
    if checkpoint != nil {
        // Resume from where we left off by sending ourselves a message
        w.Send(w.PID(), MessageResumeComputation{
            JobID:       checkpoint.JobID,
            Step:        checkpoint.Step,
            PartialData: checkpoint.Data,
        })
    }
    return nil
}

func (w *AnalyticsWorker) HandleMessage(from gen.PID, message any) error {
    switch msg := message.(type) {
    case MessageStartComputation:
        // Step 1: save that we accepted the job
        w.store.SaveCheckpoint(w.Name(), Checkpoint{
            JobID: msg.JobID, Step: 1, Data: nil,
        })
        // ... do step 1 ...
        result1 := computeStep1(msg.Input)

        // Step 2: checkpoint after step 1
        w.store.SaveCheckpoint(w.Name(), Checkpoint{
            JobID: msg.JobID, Step: 2, Data: result1,
        })
        // ... do step 2 ...
        result2 := computeStep2(result1)

        // Done — clear checkpoint
        w.store.DeleteCheckpoint(w.Name())
        publishResult(result2)

    case MessageResumeComputation:
        // Pick up from the checkpointed step
        switch msg.Step {
        case 2:
            result2 := computeStep2(msg.PartialData)
            w.store.DeleteCheckpoint(w.Name())
            publishResult(result2)
        default:
            // Step 1 wasn't completed - restart from scratch
            w.Send(w.PID(), MessageStartComputation{JobID: msg.JobID, ...})
        }
    }
    return nil
}

The key insight: self-send in Init() is how you "replay" - you just send yourself a resume message. The actor's mailbox is empty on fresh start, so your resume message will be the first thing processed.

Pattern 3: Inbox pattern with external persistence (closest to Akka Persistence)

If you really want an Akka-like "persist-then-process" guarantee:

type DurableInbox struct {
    act.Actor
    store    MessageStore
    sequence uint64
}

func (d *DurableInbox) Init(args ...any) error {
    // Load all unprocessed messages from store
    messages, err := d.store.LoadUnprocessed(d.Name())
    if err != nil {
        return err
    }
    // Replay them in order
    for _, msg := range messages {
        d.Send(d.PID(), msg)
    }
    d.sequence = d.store.LastSequence(d.Name())
    return nil
}

func (d *DurableInbox) HandleMessage(from gen.PID, message any) error {
    switch msg := message.(type) {
    case MessageNewWork:
        // Persist first, then process (Akka-style guarantee)
        d.sequence++
        if err := d.store.Persist(d.Name(), d.sequence, msg); err != nil {
            // Can't persist — reject the message
            return err  // this will terminate and supervisor will restart
        }
        // Now process
        result := doWork(msg)
        // Mark as processed
        d.store.MarkProcessed(d.Name(), d.sequence)
        return nil
    }
    return nil
}

Answering your specific questions

"How do we handover to stored messages after persisting?"

You don't need a special handover. Once state is persisted synchronously in the handler, the actor continues processing the next message from its mailbox naturally. The sequential single-goroutine nature of Ergo actors gives you this guarantee for free - HandleMessage finishes (including the persist call) before the next message is dequeued.

"On Init() how do we replay stored messages?"

Use Send(w.PID(), message) during Init(). Messages sent to self during Init are queued into the mailbox. When Init completes and the actor transitions to the Running state, those messages are processed in order. This is the idiomatic way to "replay" - there's no special replay API needed.

"The processing of the next command will not start until the state has been successfully stored"

This is automatically guaranteed by the actor model. Since all message handling is sequential (single goroutine), if your HandleMessage does store.Save() synchronously before returning, the next message won't
be processed until after the save completes. No additional mechanism needed.

Recommendation for K8s specifically

For K8s rolling deployments, Pattern 1 (external queue) is strongly recommended because:

K8s sends SIGTERM, giving you a graceful shutdown window (default 30s)
You can handle this in Terminate() to do final cleanup
But you can't guarantee the shutdown window is enough for a multi-second computation
An external queue with at-least-once delivery solves this completely - unacked messages get redelivered to the new pod automatically
It also naturally handles horizontal scaling (multiple worker pods pulling from the same queue)

The external queue approach is what most production Erlang/Elixir systems use too (with RabbitMQ, Kafka, etc.) - persistent actors in those ecosystems are typically reserved for aggregate roots in event-sourced domains, not for computation pipelines.

0 replies

TheBeachMaster · 2026-03-13T20:04:38Z

TheBeachMaster
Mar 13, 2026
Author

Thanks for these great insights, I'll play around with the options and see which best suits our use case( we can even mix and match them depending on what's at hand).

Great library and I am looking forward to the release of v3.3.0.

1 reply

halturin Mar 13, 2026
Maintainer

moved to discussion in case anyone else has the same question

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ergo.services

Question: Actor Persistence / Stateful actors #254

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ergo.services

Question: Actor Persistence / Stateful actors #254

Uh oh!

TheBeachMaster Mar 13, 2026

Replies: 2 comments · 1 reply

Uh oh!

halturin Mar 13, 2026 Maintainer

Uh oh!

TheBeachMaster Mar 13, 2026 Author

Uh oh!

halturin Mar 13, 2026 Maintainer

TheBeachMaster
Mar 13, 2026

Replies: 2 comments 1 reply

halturin
Mar 13, 2026
Maintainer

TheBeachMaster
Mar 13, 2026
Author

halturin Mar 13, 2026
Maintainer