The Problem
When a client’s WebSocket connection dies silently (iOS backgrounding, idle desktop, network switch), the backend may still consider the device “connected.” Without reliability measures, messages are stored but never delivered — the server believes the device received them, and push notifications are suppressed.Architecture Overview
Phase 1: Push Always-On
Push notifications fire on every message regardless of WebSocket connection state.
A redundant push notification is harmless; a missed one is a delivery failure.
- They show an OS-level alert (service worker on web, Expo Push on native)
- They do not inject messages into chat state — that comes from WebSocket or history sync
- Duplicate alerts are a minor UX inconvenience; missed messages are a reliability failure
Phase 2: Delivery Receipts (ACK System)
Every message creates a delivery tracking record for each recipient device. The ACK system closes the loop on whether a message was actually received and decrypted.Delivery Flow
Message sent
After storing ciphertext in the
messages table, the backend creates a MessageDelivery
row for each recipient device (excluding the sender) with status PENDING.Client receives and decrypts
After successful decryption, the client sends an ACK event over WebSocket.
ACK Wire Protocol
After decrypting one or more messages, the client sends a batch ACK:| Behavior | Value |
|---|---|
| Max IDs per ACK | 50 |
| Client debounce | 500ms |
| Server processing | Fire-and-forget (asyncio.create_task) |
Delivery Statuses
| Status | Meaning |
|---|---|
pending | Message stored, delivery not yet confirmed |
delivered | Client sent ACK after successful decryption |
read | Client confirmed the message was displayed (future) |
failed | Delivery failed after 3 escalation attempts |
Escalation Task
A background task runs every 60 seconds to catch undelivered messages.| Parameter | Value | Description |
|---|---|---|
| Interval | 60s | How often the escalation loop runs |
| Stale threshold | 120s | Deliveries older than 2 minutes are considered stale |
| Max per cycle | 100 | Bounds work per loop iteration |
| Max escalations | 3 | Retries before marking as FAILED |
- Queries deliveries where
status = 'pending',created_at < 2 minutes ago, andescalation_count < 3 - Re-triggers push notification for owner devices
- Increments
escalation_counton each attempt - Marks delivery as
FAILEDafter 3 unsuccessful escalations
Phase 3: Connection Liveness
AgentVault detects zombie WebSocket connections through a two-sided heartbeat system.Server-Side: Pong Timeout + Redis TTL
| Constant | Value | Purpose |
|---|---|---|
HEARTBEAT_INTERVAL | 30s | How often server sends ping |
HEARTBEAT_TIMEOUT | 90s | Max time without pong before disconnect |
REDIS_ALIVE_TTL | 60s | Redis key auto-expiry |
- SET on connect, EXPIRE refreshed on each pong, DEL on disconnect
- Auto-expires if the backend process crashes (no explicit cleanup needed)
Client-Side: Ping Watchdog
Both the web app and plugin run a 45-second watchdog timer (1.5x the server’s ping interval).4000 and triggers
the standard reconnect flow with exponential backoff.
Phase 4: Push Notification Handling
iOS and Android
Push notifications carry theconversation_id in their data payload. Tapping a notification
routes the user directly to the relevant chat.
- Cold start:
Notifications.getLastNotificationResponseAsync()checks for a pending notification on app launch - Background:
Notifications.addNotificationResponseReceivedListenerhandles taps while the app is backgrounded
Foreground Display
When the app is in the foreground, push notifications are still displayed as alerts:Push Payload
Each push notification includes enriched metadata for reliable delivery tracking:| Field | Value | Purpose |
|---|---|---|
sound | "default" | Audible alert |
priority | "high" | Bypass battery optimization |
channelId | "messages" | Android notification channel |
message_id | UUID | Delivery tracking correlation |
Agent Delivery Model
Agents connect via the@agentvault/agentvault npm plugin over WebSocket. There is no
webhook delivery — agent gateways typically run locally without a public URL.
Real-time delivery
The plugin maintains a persistent WebSocket connection. The server forwards ciphertext
directly over the socket.
History sync on reconnect
On reconnect, the plugin fetches missed messages via the history endpoint:
After updating the
@agentvault/agentvault npm package, the OpenClaw gateway must be
restarted (openclaw gateway restart) for new code to take effect. Node.js caches modules
in memory at import time.Reconnection Strategy
Both the web app and plugin use exponential backoff with jitter for reconnection.| Attempt | Base Delay | Max Delay |
|---|---|---|
| 1 | 1s | — |
| 2 | 2s | — |
| 3 | 4s | — |
| 4 | 8s | — |
| 5+ | 16s | 30s |
visibilitychange events to detect when a browser tab
returns to the foreground and immediately attempts reconnection, bypassing the backoff timer.
Wake Detection
Wake Detection
The plugin includes a wake detector that monitors gaps in the event loop. If no events
are processed for more than 120 seconds (indicating the host machine was asleep or the
LLM was running a long inference), the plugin immediately reconnects and performs a
history sync to catch any missed messages.The 120-second threshold is intentionally generous — LLM inference can block the Node.js
event loop for 30-60 seconds, and a shorter threshold would trigger false wake detections.
Design Principles
These principles guided the reliability system’s design:- Never suppress push based on WS state. Silent disconnections are common on mobile. Push is cheap; missed messages are not.
- ACK means decrypted, not received. The delivery receipt is sent only after the client successfully decrypts the ciphertext, not merely when the WebSocket frame arrives.
- Retry caps are essential. Without them, stale deliveries cause infinite escalation loops.
- Redundancy over elegance. It is better to deliver a message twice than to miss it entirely.