The Intelligent Edge Architecture
A walkthrough of our strategic plan to build a hyper-performant AI agent by combining a specialized knowledge base with state-of-the-art, "always-warm" models on Cloudflare's edge.
Explore The BlueprintThe Blueprint
An automated, resilient pipeline for transforming knowledge into intelligence.
1. Ingestion into R2
Our bespoke knowledge base (exported from Google Docs as PDFs) is uploaded to Cloudflare R2, our permanent, raw data store. This triggers the entire automated workflow.
2. AutoRAG Processing
Cloudflare's AutoRAG pipeline takes over. It extracts text, intelligently splits it into semantic chunks, and stores the plain text in our D1 database—the ground truth for our AI.
3. Vectorize Indexing
Each text chunk is converted into a numerical vector by our Embedding Model. These vectors are stored in Vectorize, creating a high-speed semantic search index.
The Two-Model Strategy
We use two distinct, specialized AI models. This is the core of our efficient and high-quality architecture.
The Librarian: Embedding Model
This model's only job is to understand meaning. It reads our knowledge chunks and user questions, then converts them into vectors. It's the "librarian" that knows where to find the most relevant information in our vast library instantly.
@cf/baai/bge-large-en-v1.5
We chose the 'large' version for its superior ability to capture the nuance of our specialized content, ensuring the highest quality search results.
The Synthesizer: Generation Model
This is the "brain" of the operation. After the Librarian finds the right information, this model's job is to read it, understand the user's original question, and synthesize a coherent, intelligent answer. It doesn't need to know everything; it just needs to be an expert reasoner.
@cf/meta/llama-3.1-8b-instruct-fast
We chose the highly efficient 8B model because it provides state-of-the-art reasoning without the overhead of a massive model.
Llama Showdown: 8B vs. 70B
Choosing the right reasoning engine is a strategic decision. This interactive chart shows why.
| Feature | Llama 3.1 8B (Our Choice) | Llama 3.3 70B |
|---|---|---|
| Best For | High-speed, cost-effective, RAG-based reasoning. | Complex, multi-step logic and general knowledge tasks. |
| Performance | Extremely fast inference, ideal for real-time user interaction. | Slightly higher latency, but superior raw reasoning power. |
| Analogy | A brilliant specialist who can instantly synthesize an answer from provided notes. | A university professor who can reason deeply on any topic from first principles. |
| Our Use Case | We provide the notes (from D1), so we need the fast specialist. | Overkill for our needs, as we don't rely on its internal knowledge base. |
Our Winning Strategy
Efficiency over brute force. Our RAG architecture is smarter, not just bigger.
Why We Don't Need the 405B Behemoth
A common misconception is that bigger is always better. A model like Meta's 405-billion-parameter Llama 3.1 is a marvel of engineering, but using it for our task would be like using a sledgehammer to crack a nut. It's a generalist designed to know everything.
Our approach is more surgical. We don't need the AI to already know our niche subject. We need it to be an expert at understanding and synthesizing the precise, high-quality information we feed it in real-time.
By combining a high-quality knowledge base with a fast, "always-warm" reasoning engine, we achieve with .
Live RAG Simulator
See our architecture in action. Ask a question about our chosen models.
Ask a Question
Try asking: "Why is the 8B model a good choice?" or "What is quantization?"
1. Generate Query Embedding
The user's question is converted into a vector.
2. Semantic Search
The query vector is used to find the most relevant text chunks from our knowledge base.
3. Augment Prompt
The retrieved chunks are combined with the original question to form a detailed prompt for the LLM.
4. Generate Final Answer
AI Cost Calculator
Interactively explore the cost and performance trade-offs of different models.
Estimated Usage
Neuron Cost
0
Estimated Cost (USD)
$0.00
Daily Free Tier Usage
0%