Herald & Orra: A Rust-Native Approach to AI Agents That Actually Makes Sense
There's a category of AI library that I mentally file under "prompt engineering with extra steps." You get model abstraction, maybe some chain-of-thought helpers, perhaps a vector store wrapper — and then you're on your own the moment a second user shows up and their conversation starts leaking into someone else's context. Or the token count explodes and the library just errors out. Or you need to actually control which tools a specific user can call.
Orra starts from a different premise. Written in Rust, it's a library for embedding AI into existing applications — not for building standalone chatbot prototypes. It's designed around the problems that appear when you're running agents for real users in production: session isolation, context budget management, tool access policies, and lifecycle observability.
Herald is the reference implementation built on top of it — a self-hostable AI assistant with a web UI, Discord bot, cron scheduling, memory persistence, and multi-agent delegation. It's what orra looks like when you actually ship something.
The Problems Orra Is Actually Solving
The orra README opens with three questions that most libraries dodge entirely:
"How do I run this for multiple users without their conversations bleeding into each other? How do I control which tools each user can access? How do I keep context from blowing past the token limit?"
These aren't exotic edge cases. They're the first three walls you hit when you try to put an AI agent in front of real users. The fact that they're the stated design motivation is the most honest thing about the project.
Namespaced Sessions: Isolation Done Right
Orra's session model uses a hierarchical namespace system. Every conversation lives under
a path like tenant:acme:user:bob. Sessions are fully isolated. Policies
cascade from parent namespaces — so you can configure tool access at the organization
level and have it apply automatically to every user under that org.
This is a clean solution to a genuinely annoying problem. Most frameworks treat the session
store as an afterthought — a key-value map you thread through yourself. Orra makes it a
first-class structural primitive. The Namespace type expresses your tenant
hierarchy directly:
let ns = Namespace::new("tenant").child("acme").child("user").child("bob");
let result = runtime.run(&ns, Message::user("Hello!")).await?;
Access control at the namespace level means you can express "org A gets web search, org B doesn't" in the same place you express the hierarchy — not in a separate ACL table bolted on later.
Token-Aware Context Management
The runtime tracks token budgets and auto-truncates conversation history when context gets too long. Crucially, it preserves the system prompt and the most recent messages — the parts that actually matter — rather than blindly chopping from the end or the beginning.
This matters more than it sounds. The naive approach is to let the context grow until the API returns a 400, then tell the developer to handle it. Orra treats the token budget as a first-class constraint that the runtime manages on your behalf. In herald's config, it's two lines:
[context]
max_tokens = 200000
reserved_for_output = 4096
The CharEstimator in the quick-start example hints at something interesting:
the estimator is pluggable. If you need exact token counts for a specific model, you can
swap in a proper tokenizer without touching the rest of the runtime.
The Hook System: Lifecycle Observability Without Patching
This is the part that separates orra from most of its peers. The Hook trait
lets you intercept any point in the agent lifecycle — before and after LLM calls, before
and after tool execution, session load and save. Three hooks ship out of the box:
- hooks::logging — logs runtime activity and tracks token usage per call
- hooks::approval — gates tool execution on user approval, with a per-session "chaos mode" to auto-approve in development
-
hooks::working_directory — injects a working directory into
execandclaude_codetool calls from session metadata, so you can scope file operations per-user without the agent needing to know about it
The approval hook pattern is worth dwelling on. It communicates with the UI via a tokio
channel — your WebSocket handler receives an ApprovalRequest, asks the user,
and sends back a boolean via a oneshot channel. The agent loop blocks until it gets an answer.
No polling, no database rows, no special protocol. Just a channel.
let (tx, rx) = tokio::sync::mpsc::channel::<ApprovalRequest>(32);
hooks.register(Arc::new(ApprovalHook::new(tx)));
// Your WebSocket handler receives from rx, sends back true/false
That's clean design. The hook system doesn't mandate how you build your UI; it gives you a typed channel and gets out of the way.
Pluggable Everything: Traits All the Way Down
The architecture is genuinely trait-first. Provider wraps any LLM — Claude,
OpenAI, or any OpenAI-compatible endpoint (Ollama, vLLM, local servers). Tool
exposes operations. SessionStore handles persistence. There's even a
providers::dynamic module for hot-swappable providers at runtime.
The feature flag system is well-considered. You opt in to what you need:
claude/openai— provider backendsdiscord— Discord gateway channel and bot toolsmcp— Model Context Protocol client for external tool serversdocuments— Document knowledge store with TF-IDF searchgithub— GitHub issue and PR toolsclaude-code— Claude Code CLI delegationweb-fetch— Web page fetching with HTML-to-text extractionweb-search— Brave Search API toolbrowser— Web page reading with readability extractionimage-gen— DALL-E image generationauth— OAuth2 token managementvoice— TTS and STT traitsparallel-tools— Concurrent tool executionfile-store— File-based session persistencegateway— HTTP/WebSocket gateway channel
You compile in exactly what you use. No dragging in the full dependency graph because someone thought image generation should be a core feature.
MCP Support: The Right Extensibility Primitive
Model Context Protocol support is included as a first-class feature flag. This means you
can connect orra agents to any external MCP tool server — language servers, database
connectors, custom APIs — without implementing the Tool trait yourself.
It's the right call for extensibility: rather than inventing a plugin system, adopt
the protocol the ecosystem is standardizing on.
Herald: What Production Looks Like
Herald is where orra's abstractions get assembled into something you'd actually run. It's a self-hostable AI assistant configured via a single TOML file, and the design choices reflect real deployment experience.
Zero-config startup is a genuinely nice touch. If you have the Claude
CLI installed, Herald auto-detects your credentials from the system keychain — no API
key configuration required. Fall back to ANTHROPIC_API_KEY, fall back to
a web UI setup screen. The happy path is one command.
The tool surface is production-scoped. Shell execution (exec)
is disabled by default. When you enable it, you configure an explicit allowlist of
permitted commands. Claude Code delegation requires explicit configuration including
which tools the sub-agent can use. This is the right set of defaults for something
you're going to run persistently on a real machine.
Discord integration has two modes: mentions (only respond
when @mentioned, the sane default) and all (respond to everything). The
namespace prefix for Discord sessions is configurable — you can run multiple Herald
instances against the same Discord server and keep their session spaces separate.
Multi-agent delegation is a first-class feature, not an afterthought.
The delegation tool lets the agent spawn independent sub-agents for complex
subtasks. Combined with claude_code delegation for coding work, you get a
capable orchestration setup out of the box.
The cron scheduler integrates with the agent runtime — scheduled tasks run in the same session context, with memory and tools available. This isn't just a crontab wrapper; it's AI-managed scheduling where the agent can introspect and reason about its own scheduled work.
What Sets It Apart
Most AI agent frameworks are built by people whose primary experience is Python and whose primary concern is getting an impressive demo working quickly. Orra reads like it was built by someone who had to maintain an agentic system through production incidents.
The session isolation model actually handles multi-tenancy. The token budget management actually handles long-running conversations. The hook system actually gives you observability without patching library internals. The tool access policies actually let you express per-user permissions without a separate ACL system.
None of these things are technically difficult. But most libraries don't do them, because the path of least resistance is to solve the demo problem, not the production problem. Orra solves the production problem and uses Rust to do it — which means you get memory safety, predictable performance, and a binary that doesn't require a Python virtualenv to run.
It's early (v0.0.2), the docs are sparse outside the README, and the surface area is large enough that there's plenty of rough edge still to polish. But the core design decisions are sound, and herald demonstrates that it actually composes into something useful. Worth watching.