AI educationIntermediate

RAG Pipeline for Student Documents

Retrieval-augmented generation turns messy course files into ranked evidence before an AI tutor answers.

RAGEmbeddingsHybrid searchLykke

Site connection

Lykke describes document ingestion, embeddings, hybrid search, reranking, and Canvas-connected study workflows.

Lykke project

Visual model

Ranked evidence before generation

Move from a student question to candidate chunks, hybrid scores, and the evidence that should enter the answer context.

Interactive

Hybrid retrieval turns a vague study question into ranked evidence

Query

Lecture: embeddingskeyword 0.34 / vector 0.94

0.87

Canvas calendar exportkeyword 0.76 / vector 0.42

0.79

Syllabus policieskeyword 0.38 / vector 0.62

0.72

Study guide draftkeyword 0.28 / vector 0.68

0.61

What RAG Adds

A language model by itself answers from parameters. A RAG system first retrieves source material, then asks the model to answer with that material in context.

For course tools, the source material is unusually fragmented: syllabi, lecture notes, Canvas pages, assignment text, PDFs, calendar events, and sometimes handwritten study notes.

The retrieval step is the difference between an AI tutor that sounds plausible and one that can point back to the class material.

The Working Pipeline

The document is parsed into chunks, each chunk is embedded, and metadata such as course, week, file name, and assignment date is attached.

At question time, keyword search catches exact terms while vector search catches semantic matches. A reranker can then reorder the mixed candidate set before the model sees it.

Stage	Job	Failure if skipped
Chunking	Split source material into retrievable units	The model gets either too little context or a huge noisy passage
Embedding	Represent meaning as vectors	Semantic questions miss relevant notes
Keyword search	Preserve exact names, formulas, and course terms	Acronyms and proper nouns disappear
Reranking	Put the best evidence first	The answer uses convenient but weak context

Worked Example

A query like 'what should I study before the vector search quiz?' should retrieve the quiz date, the relevant lecture, and the study guide. Those chunks probably live in different files.

A strong answer should synthesize them without hiding the evidence chain: quiz timing from Canvas, core terms from lecture notes, and practice prompts from the study guide.

Common Pitfalls

Treating embedding search as enough when exact course terms matter.
Chunking by arbitrary character count instead of document structure.
Letting retrieved chunks into the prompt without source labels.
Using old course files after Canvas content changes.

Quick check

Quiz

Why combine keyword and vector search?

To make retrieval slower
To balance exact term matching with semantic matching
To avoid storing metadata
To replace reranking entirely

Keyword search preserves exact terms; vector search captures meaning. Hybrid retrieval uses both signals.

What should a reranker do?

Generate the final answer
Delete all low-frequency words
Reorder candidate chunks by relevance
Parse PDFs into text

A reranker scores candidate passages after first-stage retrieval and improves the final context set.

Sources and Further Reading

Google Cloud: hybrid search overview Pinecone: rerankers and two-stage retrieval