Project: Sentinel

The Intelligent Edge Architecture

A walkthrough of our strategic plan to build a hyper-performant AI agent by combining a specialized knowledge base with state-of-the-art, "always-warm" models on Cloudflare's edge.

Explore The Blueprint

The Blueprint

An automated, resilient pipeline for transforming knowledge into intelligence.

1. Ingestion into R2

Our bespoke knowledge base (exported from Google Docs as PDFs) is uploaded to Cloudflare R2, our permanent, raw data store. This triggers the entire automated workflow.

2. AutoRAG Processing

Cloudflare's AutoRAG pipeline takes over. It extracts text, intelligently splits it into semantic chunks, and stores the plain text in our D1 database—the ground truth for our AI.

3. Vectorize Indexing

Each text chunk is converted into a numerical vector by our Embedding Model. These vectors are stored in Vectorize, creating a high-speed semantic search index.

The Two-Model Strategy

We use two distinct, specialized AI models. This is the core of our efficient and high-quality architecture.

The Librarian: Embedding Model

This model's only job is to understand meaning. It reads our knowledge chunks and user questions, then converts them into vectors. It's the "librarian" that knows where to find the most relevant information in our vast library instantly.

Our Choice: @cf/baai/bge-large-en-v1.5

We chose the 'large' version for its superior ability to capture the nuance of our specialized content, ensuring the highest quality search results.

The Synthesizer: Generation Model

This is the "brain" of the operation. After the Librarian finds the right information, this model's job is to read it, understand the user's original question, and synthesize a coherent, intelligent answer. It doesn't need to know everything; it just needs to be an expert reasoner.

Our Choice: @cf/meta/llama-3.1-8b-instruct-fast

We chose the highly efficient 8B model because it provides state-of-the-art reasoning without the overhead of a massive model.

Llama Showdown: 8B vs. 70B

Choosing the right reasoning engine is a strategic decision. This interactive chart shows why.

Feature Llama 3.1 8B (Our Choice) Llama 3.3 70B
Best For High-speed, cost-effective, RAG-based reasoning. Complex, multi-step logic and general knowledge tasks.
Performance Extremely fast inference, ideal for real-time user interaction. Slightly higher latency, but superior raw reasoning power.
Analogy A brilliant specialist who can instantly synthesize an answer from provided notes. A university professor who can reason deeply on any topic from first principles.
Our Use Case Perfect Fit. We provide the notes (from D1), so we need the fast specialist. Overkill for our needs, as we don't rely on its internal knowledge base.

Our Winning Strategy

Efficiency over brute force. Our RAG architecture is smarter, not just bigger.

Why We Don't Need the 405B Behemoth

A common misconception is that bigger is always better. A model like Meta's 405-billion-parameter Llama 3.1 is a marvel of engineering, but using it for our task would be like using a sledgehammer to crack a nut. It's a generalist designed to know everything.

Our approach is more surgical. We don't need the AI to already know our niche subject. We need it to be an expert at understanding and synthesizing the precise, high-quality information we feed it in real-time.

By combining a high-quality knowledge base with a fast, "always-warm" reasoning engine, we achieve state-of-the-art quality with best-in-class performance and efficiency.

Live RAG Simulator

See our architecture in action. Ask a question about our chosen models.

Ask a Question

Try asking: "Why is the 8B model a good choice?" or "What is quantization?"

1. Generate Query Embedding

The user's question is converted into a vector.

2. Semantic Search

The query vector is used to find the most relevant text chunks from our knowledge base.

3. Augment Prompt

The retrieved chunks are combined with the original question to form a detailed prompt for the LLM.

4. Generate Final Answer

AI Cost Calculator

Interactively explore the cost and performance trade-offs of different models.

1k 100,000 1M

Estimated Usage

Neuron Cost

0

Estimated Cost (USD)

$0.00

Daily Free Tier Usage

0%