Skip to main content
AgentVault’s messaging reliability system is inspired by Signal’s delivery architecture. It ensures that encrypted messages reach their recipients even when WebSocket connections die silently, devices go offline, or networks switch mid-conversation.

The Problem

When a client’s WebSocket connection dies silently (iOS backgrounding, idle desktop, network switch), the backend may still consider the device “connected.” Without reliability measures, messages are stored but never delivered — the server believes the device received them, and push notifications are suppressed.

Architecture Overview

Owner App                    Backend                        Agent Plugin
    |                           |                               |
    |--- WS message ----------->|                               |
    |                           |-- store ciphertext            |
    |                           |-- create delivery record      |
    |                           |-- publish to Redis            |
    |                           |-- direct WS forward --------->|
    |                           |-- dispatch push (always-on)   |
    |                           |                               |
    |                           |<--- ACK ----------------------|
    |                           |-- mark DELIVERED              |
    |                           |                               |
    |                   [60s escalation loop]                    |
    |                           |-- query stale PENDING rows    |
    |                           |-- re-trigger push             |
    |                           |-- mark FAILED after 3 retries |
The reliability system has four interlocking phases: push always-on, delivery receipts, connection liveness detection, and push notification navigation.

Phase 1: Push Always-On

Push notifications fire on every message regardless of WebSocket connection state. A redundant push notification is harmless; a missed one is a delivery failure.
Traditional systems suppress push notifications when the backend believes a device is connected via WebSocket. This breaks when the WebSocket dies silently (no close frame is sent). AgentVault removes this suppression entirely. Push notifications are safe to send unconditionally because:
  • They show an OS-level alert (service worker on web, Expo Push on native)
  • They do not inject messages into chat state — that comes from WebSocket or history sync
  • Duplicate alerts are a minor UX inconvenience; missed messages are a reliability failure

Phase 2: Delivery Receipts (ACK System)

Every message creates a delivery tracking record for each recipient device. The ACK system closes the loop on whether a message was actually received and decrypted.

Delivery Flow

1

Message sent

After storing ciphertext in the messages table, the backend creates a MessageDelivery row for each recipient device (excluding the sender) with status PENDING.
2

Client receives and decrypts

After successful decryption, the client sends an ACK event over WebSocket.
3

ACK processed

The backend marks the delivery as DELIVERED with a timestamp.
4

Escalation (if no ACK)

A background task checks for stale PENDING rows every 60 seconds and re-triggers delivery attempts.

ACK Wire Protocol

After decrypting one or more messages, the client sends a batch ACK:
{
  "event": "ack",
  "data": {
    "message_ids": ["uuid-1", "uuid-2", "uuid-3"]
  }
}
BehaviorValue
Max IDs per ACK50
Client debounce500ms
Server processingFire-and-forget (asyncio.create_task)
Both the web app and the @agentvault/agentvault plugin implement ACK batching and debouncing automatically. You do not need to manage this manually unless building a custom client.

Delivery Statuses

CREATE TYPE delivery_status AS ENUM (
  'pending',
  'delivered',
  'read',
  'failed'
);
StatusMeaning
pendingMessage stored, delivery not yet confirmed
deliveredClient sent ACK after successful decryption
readClient confirmed the message was displayed (future)
failedDelivery failed after 3 escalation attempts

Escalation Task

A background task runs every 60 seconds to catch undelivered messages.
ParameterValueDescription
Interval60sHow often the escalation loop runs
Stale threshold120sDeliveries older than 2 minutes are considered stale
Max per cycle100Bounds work per loop iteration
Max escalations3Retries before marking as FAILED
Without a retry cap, stale deliveries cause infinite escalation loops that hammer failed endpoints every 60 seconds. The 3-attempt cap prevents this.
The escalation task:
  1. Queries deliveries where status = 'pending', created_at < 2 minutes ago, and escalation_count < 3
  2. Re-triggers push notification for owner devices
  3. Increments escalation_count on each attempt
  4. Marks delivery as FAILED after 3 unsuccessful escalations

Phase 3: Connection Liveness

AgentVault detects zombie WebSocket connections through a two-sided heartbeat system.

Server-Side: Pong Timeout + Redis TTL

ConstantValuePurpose
HEARTBEAT_INTERVAL30sHow often server sends ping
HEARTBEAT_TIMEOUT90sMax time without pong before disconnect
REDIS_ALIVE_TTL60sRedis key auto-expiry
The server tracks the last pong timestamp for each device. If no pong is received within 90 seconds, the connection is forcibly closed and the device is removed from the active connections map. Redis liveness keys provide crash resilience:
device:{uuid}:alive  ->  "1"  TTL=60s
  • SET on connect, EXPIRE refreshed on each pong, DEL on disconnect
  • Auto-expires if the backend process crashes (no explicit cleanup needed)

Client-Side: Ping Watchdog

Both the web app and plugin run a 45-second watchdog timer (1.5x the server’s ping interval).
// Reset watchdog on every server ping
const WATCHDOG_TIMEOUT = 45_000; // 1.5x server interval

let watchdog: NodeJS.Timeout;

ws.onmessage = (event) => {
  clearTimeout(watchdog);
  watchdog = setTimeout(() => {
    ws.close(4000, "Ping timeout");
    // Reconnect with exponential backoff
  }, WATCHDOG_TIMEOUT);
};
When the watchdog fires, the client closes the WebSocket with code 4000 and triggers the standard reconnect flow with exponential backoff.

Phase 4: Push Notification Handling

iOS and Android

Push notifications carry the conversation_id in their data payload. Tapping a notification routes the user directly to the relevant chat.
  • Cold start: Notifications.getLastNotificationResponseAsync() checks for a pending notification on app launch
  • Background: Notifications.addNotificationResponseReceivedListener handles taps while the app is backgrounded

Foreground Display

When the app is in the foreground, push notifications are still displayed as alerts:
Notifications.setNotificationHandler({
  handleNotification: async () => ({
    shouldShowAlert: true,
    shouldPlaySound: true,
    shouldSetBadge: true,
  }),
});

Push Payload

Each push notification includes enriched metadata for reliable delivery tracking:
FieldValuePurpose
sound"default"Audible alert
priority"high"Bypass battery optimization
channelId"messages"Android notification channel
message_idUUIDDelivery tracking correlation

Agent Delivery Model

Agents connect via the @agentvault/agentvault npm plugin over WebSocket. There is no webhook delivery — agent gateways typically run locally without a public URL.
1

Real-time delivery

The plugin maintains a persistent WebSocket connection. The server forwards ciphertext directly over the socket.
2

History sync on reconnect

On reconnect, the plugin fetches missed messages via the history endpoint:
GET /api/v1/devices/{id}/messages?since={last_seen_timestamp}
3

ACK after decryption

After decrypting each message, the plugin batches ACKs and sends them back to the server.
After updating the @agentvault/agentvault npm package, the OpenClaw gateway must be restarted (openclaw gateway restart) for new code to take effect. Node.js caches modules in memory at import time.

Reconnection Strategy

Both the web app and plugin use exponential backoff with jitter for reconnection.
AttemptBase DelayMax Delay
11s
22s
34s
48s
5+16s30s
Additionally, the web app listens for visibilitychange events to detect when a browser tab returns to the foreground and immediately attempts reconnection, bypassing the backoff timer.
The plugin includes a wake detector that monitors gaps in the event loop. If no events are processed for more than 120 seconds (indicating the host machine was asleep or the LLM was running a long inference), the plugin immediately reconnects and performs a history sync to catch any missed messages.The 120-second threshold is intentionally generous — LLM inference can block the Node.js event loop for 30-60 seconds, and a shorter threshold would trigger false wake detections.

Design Principles

These principles guided the reliability system’s design:
  1. Never suppress push based on WS state. Silent disconnections are common on mobile. Push is cheap; missed messages are not.
  2. ACK means decrypted, not received. The delivery receipt is sent only after the client successfully decrypts the ciphertext, not merely when the WebSocket frame arrives.
  3. Retry caps are essential. Without them, stale deliveries cause infinite escalation loops.
  4. Redundancy over elegance. It is better to deliver a message twice than to miss it entirely.