AI educationIntermediate

RAG Pipeline for Student Documents

Retrieval-augmented generation turns messy course files into ranked evidence before an AI tutor answers.

RAGEmbeddingsHybrid searchLykke

Site connection

Lykke describes document ingestion, embeddings, hybrid search, reranking, and Canvas-connected study workflows.

Visual model

Ranked evidence before generation

Move from a student question to candidate chunks, hybrid scores, and the evidence that should enter the answer context.

Interactive

Hybrid retrieval turns a vague study question into ranked evidence

Lecture: embeddingskeyword 0.34 / vector 0.94
0.87
Canvas calendar exportkeyword 0.76 / vector 0.42
0.79
Syllabus policieskeyword 0.38 / vector 0.62
0.72
Study guide draftkeyword 0.28 / vector 0.68
0.61

What RAG Adds

A language model by itself answers from parameters. A RAG system first retrieves source material, then asks the model to answer with that material in context.

For course tools, the source material is unusually fragmented: syllabi, lecture notes, Canvas pages, assignment text, PDFs, calendar events, and sometimes handwritten study notes.

The retrieval step is the difference between an AI tutor that sounds plausible and one that can point back to the class material.

The Working Pipeline

The document is parsed into chunks, each chunk is embedded, and metadata such as course, week, file name, and assignment date is attached.

At question time, keyword search catches exact terms while vector search catches semantic matches. A reranker can then reorder the mixed candidate set before the model sees it.

StageJobFailure if skipped
ChunkingSplit source material into retrievable unitsThe model gets either too little context or a huge noisy passage
EmbeddingRepresent meaning as vectorsSemantic questions miss relevant notes
Keyword searchPreserve exact names, formulas, and course termsAcronyms and proper nouns disappear
RerankingPut the best evidence firstThe answer uses convenient but weak context

Worked Example

A query like 'what should I study before the vector search quiz?' should retrieve the quiz date, the relevant lecture, and the study guide. Those chunks probably live in different files.

A strong answer should synthesize them without hiding the evidence chain: quiz timing from Canvas, core terms from lecture notes, and practice prompts from the study guide.

Common Pitfalls

  • Treating embedding search as enough when exact course terms matter.
  • Chunking by arbitrary character count instead of document structure.
  • Letting retrieved chunks into the prompt without source labels.
  • Using old course files after Canvas content changes.

Quick check

Quiz

Why combine keyword and vector search?
  1. To make retrieval slower
  2. To balance exact term matching with semantic matching
  3. To avoid storing metadata
  4. To replace reranking entirely

Keyword search preserves exact terms; vector search captures meaning. Hybrid retrieval uses both signals.

What should a reranker do?
  1. Generate the final answer
  2. Delete all low-frequency words
  3. Reorder candidate chunks by relevance
  4. Parse PDFs into text

A reranker scores candidate passages after first-stage retrieval and improves the final context set.

Sources and Further Reading

Related Explainers