Building an AI Agent on Cloudflare
本文有 简体中文 版本。
On a New Year's Eve in 2026, while fireworks lit up the sky outside, a thought struck me: could my nearly ten-year-old blog finally have a genuinely useful AI assistant?
Useful — meaning it can't just be a chatbot that rambles about anything. It needs to be an AI that truly understands me and my content. It should know every article I've written, understand my views, positions, and experiences, and be able to find the most relevant content from a knowledge base in response to a visitor's question — then synthesize a natural, meaningful answer from that context.
It should be a living, breathing personal calling card.
This isn't a particularly complex requirement, and the open-source tools and commercial infrastructure are already mature. But once I actually started building, I stumbled into plenty of pits, took a lot of wrong turns, and scrapped more than a few "elegant-looking" approaches. This article is the complete record of Surmon.me's AI Agent — from the initial idea to a fully working implementation.
Breaking Down the Requirements
Within this blog ecosystem, I divided the AI capabilities into two parts:
- Content generation for admins. Mainly: generating article summaries, writing article reviews, and auto-replying to user comments.
- Intelligent conversation for frontend users. Users should be able to get most of the information that exists on the site through an Agent window — not just articles, but also personal bios from static pages, social updates, community activity……
The admin-side AI capabilities are essentially tool invocation at their core. Input an article, output a summary or review — short context, defined inputs and outputs, no state storage required. Routing this through the Cloudflare AI Gateway API to reach the LLM is sufficient, and integrating it directly into NodePress (the blog's backend service) is the most natural approach.
The frontend user-facing AI conversation is a completely different scenario: it needs a RAG knowledge base, persistent conversation history, rate limiting, and an admin interface to view all users' chat records. The infrastructure involved is entirely different.
So I split it into two projects:
- NodePress AI Assistant: Integrated directly into NodePress, accessing Gemini / DeepSeek indirectly via Cloudflare AI Gateway, handling summary generation, article reviews, and comment auto-replies for admins. The key characteristic: short context, stateless — once the API call completes, the job is done.
- Surmon.me AI Service: A standalone AI Agent service focused entirely on intelligent conversation for frontend users. Full-site article data is vectorized via RAG for the Agent to retrieve, a set of tools is integrated, HTTP streaming responses are supported, conversation history is persisted to a database, and an admin interface is provided for conversation management.
The benefit of this split is clear: the two parts have no relationship with each other, each iterates independently, zero coupling. Surmon.me AI Service is an AI Agent application that serves only frontend user interactions. NodePress remains the content management system that serves admins. There is no intertwining of authentication or business logic between them.
Implementing the NodePress AI Assistant
Integrating an AI request service based on Cloudflare AI Gateway directly into NodePress is all it takes. Usage and logs are available in the AI Gateway dashboard.
NodePress endpoints:
/ai/generate-article-summary
Generate an article summary (input: full article + prompt)/ai/generate-article-review
Generate an article review (input: full article + prompt)/ai/generate-comment-reply
Reply to a user comment (input: article summary or excerpt + comment context + prompt)/ai/config
Fetch the preset models / prompts config; the frontend can override and cache locally.
This part is straightforward — the server is stateless, logging and ops are fully delegated to AI Gateway, and rate limiting isn't even needed. The code lives in the AI module of the NodePress project.
Here's what the final result looks like:
Building the RAG Knowledge Base
The core capability of an AI Agent is RAG search — it's also the primary knowledge source for answering questions. To implement RAG, the first question is: where does the knowledge base data come from? And how do data cleaning and vectorized storage get handled?
Simple Approach: Keyword Search as a RAG Stand-in
If you're cost-conscious and want to move fast, there's a simpler approach: use Algolia plus LLM-driven keyword decomposition to simulate RAG.
Traditional web systems either support keyword search natively (like NodePress) or integrate a third-party search engine like Algolia. Once a user's question reaches the LLM, the model can be instructed to extract explicit keywords when calling a tool function. The flow looks roughly like this:
- User asks: Has the author written anything about Vue's reactivity system?
- LLM decomposes this into:
["Vue", "reactivity", "reactivity system", "响应式"] - Multi-keyword queries run against Algolia or the system search.
- The resulting snippets are passed back to the LLM to synthesize a final answer.
The keyword decomposition step matters — you can't just throw the user's natural language directly at Algolia. Traditional search engines match text fragments; they can't understand semantic intent. But for simple scenarios, this is good enough.
This approach is highly cost-effective. In a well-structured traditional web system, keyword coverage tends to be much higher than in general-purpose scenarios, and semantic drift isn't as painful as you'd expect. The minimum cost to implement it: just add one API that calls an LLM, and you have single-turn intelligent conversation.
If your scenario is simple enough, starting here is entirely reasonable. But be clear about its limitations: vector RAG excels at semantic understanding — synonyms, near-synonyms, cross-language queries, and fuzzy intent all get handled naturally. Keyword search is simpler and lower-latency, but synonym coverage depends on search engine configuration, and cross-language queries are essentially a non-starter.
If you want high-quality question answering, you'll eventually need a RAG vector database.
Standard Approach: Conventional RAG
The ideal RAG workflow is: get clean, structured raw data → clean it → embed and store in a vector database.
There are many mature products and platforms available. Weighing operational overhead, stability, and cost-effectiveness, I landed on Cloudflare AI Search . It's an integrated wrapper over several Cloudflare primitives — raw data gets vectorized via an embedding model and stored in Vectorize (a vector database running on Cloudflare's global network), then Workers can access the RAG service directly via env.AI.search() or the REST API. The entire pipeline stays within the Cloudflare ecosystem.
AI Search supports two data source types: crawler (Sitemap/Crawler) and R2 bucket.
I started with the crawler approach — just provide a sitemap URL and it automatically fetches and vectorizes the whole site. Simple. But after testing for a while, I hit a critical problem: the crawler fetches HTML and converts it to Markdown, and it can only capture above-the-fold content.
What does that mean in practice? Some of my longer articles run tens of thousands of words. The frontend uses progressive rendering for long content, so the crawler only captures the first few thousand characters. Worse, it can't distinguish between body text and other UI elements — related article links, AI Review blocks, and so on all get mixed in and pushed into the vector database as noise.
This noise directly pollutes the embedding vector space, meaning a user's question might surface irrelevant non-body fragments in the recall results. It's not catastrophic, but if you want the highest possible answer quality, this approach falls short.
After switching to the R2 bucket approach, these problems disappeared:
- 100% content control: I maintain the Markdown file for each article myself — no UI noise, only the core content.
- No length ceiling: Full long-form articles go in as-is; AI Search handles chunking internally based on the configured chunk size.
- Structured metadata: Via Markdown Frontmatter , each article can carry tags, publish dates, and other metadata — giving the model more structured context to work with during retrieval.
Data in R2 is organized one file per article, named article-<id>.md. The file structure looks roughly like:
markdown
12345678910111213
---
id: article ID
title: "Article Title"
summary: "Article summary"
categories: ["Category One", "Category Two"]
tags: ["Tag One", "Tag Two"]
date: "Publication date"
url: "Article URL"
---
# Article Title
Article body……
The same R2 bucket also holds static files like /static/author_info.md, which may contain the author's basic info, site declarations, or other low-churn data. This content gets injected directly into each conversation's System Prompt. (These static files need to be excluded from RAG indexing in the AI Search configuration.)
I deliberately exclude comment data from the RAG knowledge base. RAG should only contain content the author produced — user comments should be fetched on demand via tool calls.
Recall quality in the RAG knowledge base can be tested in the Cloudflare AI Search product's Playground — simple and easy to use.
Webhook-Driven Knowledge Base Sync
With the knowledge base in place, the next question: how do article updates get synced to R2?
I briefly considered adding a "manual sync" button to the admin panel — but that's inelegant, and things inevitably get forgotten. I also considered having the admin panel call an AI service endpoint on every article publish, but that would create direct coupling between the admin panel and the AI service in terms of communication and authentication.
Is there a more elegant approach? One where neither side depends on the other, and updates happen automatically without any manual intervention?
Yes. The solution I designed: NodePress notifies the AI service via webhook.
The flow: when NodePress creates, updates, or deletes an article, or when key site configuration changes, it sends a webhook request signed with HMAC-SHA256 to the AI service. The AI service verifies the signature (with 5-minute replay protection), then directly consumes the latest data from the request payload, generates the corresponding Markdown file, and writes it to R2. Once R2 changes, AI Search automatically handles incremental indexing.
This design has several advantages: NodePress doesn't need to know R2 exists — it just fires events. The AI service has zero dependency on NodePress. AI tasks are async and don't affect NodePress's main process at all. Even if an admin publishes directly via API, the webhook fires as expected, with no sync gaps.
The full knowledge base data flow is complete: admins do normal CRUD on blog content upstream, and all changes automatically flow into the RAG knowledge base in the background — no manual maintenance required.
With the RAG architecture in place, the core Agent logic became the focus: use a framework? Which one? Where to store data? What storage type — KV or a database?
While I was puzzling over this, Cloudflare Agents SDK caught my eye.
Pitfall One: Cloudflare Agents SDK
Bottom line up front: Cloudflare Agents SDK looks great and the name sounds impressive, but it's not the right fit for most AI Agent applications.
Before writing any code for the user-facing conversation, I spent time studying Cloudflare's official Agents SDK .
The Agents SDK is built on Durable Objects — a Cloudflare primitive that's genuinely interesting: a persistent JS runtime object, with a built-in micro SQLite database, deployed at the edge, with native support for WebSockets, state persistence, and lifecycle management.
In short: a globally unique, stateful Serverless Actor — writing a JS class is writing data. The storage structure and logic are defined by the class itself. Developers write business code directly without worrying about any infrastructure.
AIChatAgent is a third-layer abstraction built on top of the Agents SDK, specifically for AI chat. Because it's backed by Durable Objects, it naturally supports:
- Automatic message persistence (no need to define tables, no need to write D1 code)
- Streaming recovery after client disconnects
- Multi-client WebSocket broadcast sync
- Tool system (server tool / client tool / approval tool)
Looking at that feature list alone, it seems powerful, comprehensive, and perfectly tailored. Then I looked closely at the Durable Object design philosophy. The core assumption of Durable Objects is: one DO instance = one isolated data island (Data Isolation).
Under the Cloudflare Agents architecture, each user's Agent instance is essentially an independent micro-server with its own private micro SQLite database. With 1,000 users, there are 1,000 completely separate databases under the hood — not one central database holding 1,000 users' records.
This is elegant for "real-time multi-user collaboration" scenarios. But my use case doesn't need any of that — I have one conversation window, a one-to-many AI chat relationship, and no interaction needed between users.
The fatal problem: I need admins to be able to view all users' conversation records. Under the DO architecture, doing this would require simultaneously waking up 1,000 DO instances, sending each one an RPC request to pull data into memory, then assembling everything. That's a classic antipattern — completely unworkable.
Final verdict: my needs don't fit Agents SDK. What I need is the traditional Workers + D1 centralized database architecture.
This is the first lesson from this project: an architecturally elegant solution doesn't mean it's appropriate for the business scenario. Durable Objects aren't a "premium architecture" — they're a tool for specific scenarios. Blunt, straightforward centralized CRUD is the right answer for my needs.




