AI-Powered Search for Healthcare LMS: A Proof of Concept Journey

Marta Kozłowska - Product Delivery Manager
7 minutes read

The Challenge We Set Out to Solve


I've been working with organizations that manage health-related learning content, and I've consistently seen the same problem arise: their Learning Management Systems are packed with valuable educational materials, PDFs, videos, SCORM packages, but finding specific information is becoming increasingly difficult.

The existing search mechanisms weren't cutting it. Users would know exactly what information they needed, but had no idea where to find it. Traditional keyword search didn't understand context, couldn't handle natural language queries, and didn't provide clear sources for the information it returned. For healthcare educators and learners dealing with complex medical content, this created real difficulties in their daily work.

That's when we decided to explore whether AI-powered natural language search could actually solve these challenges.

What We Were Trying to Prove

I want to be clear upfront: this was a Proof of Concept, not a production-ready system. We weren't trying to build the perfect solution; we were trying to validate whether this approach could work at all.

Our main questions were:

  1. Can we actually ingest and process all these different content formats (PDFs, SCORM packages, videos)?

  2. Will the answers be accurate and contextually relevant?

  3. Can we reliably show users where the information came from?

  4. What technical challenges and risks should we expect?

We kept the scope intentionally narrow. This was a low-risk feasibility exercise designed to help our client make informed decisions about whether to invest further.

How We Scoped the Work

We built the PoC as a standalone application, completely separate from the existing LMS and databases. Only selected stakeholders had access, and we used simple authentication to keep things secure enough for testing. The scope covered everything we needed to validate the concept:

  • Getting content into the system and preprocessing it

  • Extracting text and transcribing videos

  • Creating semantic indexes

  • Handling natural language queries

  • Retrieving relevant content

  • Generating answers with proper source references

What we deliberately left out: advanced security features, personalization, monitoring systems, direct integration with existing LMS database, and anything related to long-term maintenance. Those would come later if the PoC proved successful.

Our Technical Approach: RAG-Based Search

We based our solution on large language models using a retrieval-augmented generation (RAG) approach. Here's how it worked in practice:

Users could ask questions in plain English, like "What are the contraindications for this treatment?" and get contextual answers drawn exclusively from the indexed learning materials. Crucially, every answer came with explicit references to its sources.

The key advantage of this approach? We could dramatically reduce AI hallucinations by constraining answers to actual retrieved content fragments. No making things up—everything had to be traceable back to a real source document.

👉 AI‑Powered Natural Language Search for Health‑Related LMS Content

The Architecture We Built

We designed the PoC with a modular, cloud-based architecture that follows the RAG pattern.
Here's the end-to-end flow we implemented:

  • Learning content gets ingested and converted into a unified text representation

  • Text is segmented into chunks and indexed in a vector database

  • When a user asks a question, we use semantic retrieval to find relevant fragments

  • Those fragments get passed to the language model as context

  • The model generates a natural language answer with source attribution

The key components we selected:

  • Workflow orchestration: We used n8n to manage the entire ingestion pipeline, parsing, transformation, and downstream processing.

  • Data ingestion and parsing: We built Node.js pipelines to handle various content types, including PDFs, SCORM packages, and videos.

  • Vector storage and retrieval: We chose Pinecone as our vector database, with embeddings generated using OpenAI models.

  • Answer generation: At query time, we provide the retrieved fragments as context to the language model.

Processing Different Content Types

This is where things got interesting. Each content type required its own ingestion pipeline:

PDF Documents

These were the easiest. We extracted textual content directly and prepared it for normalization and chunking. PDFs with predominantly textual content gave us the most reliable input for semantic indexing.

Images

We converted images using OpenAI’s vision capabilities to extract textual content and generate a semantic description. To optimize processing cost, images were resized and quality-reduced before analysis.

PPTX and DOCX Files

Presentation slides and text documents were parsed and transformed into structured text representations.

Video Files

Video processing was more involved. We had to:

  • Convert video to a compressed mono audio track optimized for transcription

  • Remove silent segments (to save on transcription costs)

  • Speed up the audio without impacting transcription accuracy

  • Split into segments for transcription

SCORM Packages

These were by far the most complex. SCORM ingestion required:

  • Parsing the manifest to understand the package structure

  • Identifying all contained assets

  • Analyzing code files to extract learning-relevant text

  • Processing embedded PDFs, images, and videos using our existing pipelines

How We Generated Answers with Source Attribution

The answer generation followed a clear retrieval-augmented generation flow:

  • User submits a query in natural language

  • We resolve it through semantic retrieval

  • Relevant fragments are provided as context to the language model

  • The model generates a response constrained to the retrieved content

  • We include attribution at the source file or module level

One thing we noted for future improvement: we implemented attribution at the file level (like "PDF document X" or "Video Y"), but finer-grained attribution - including specific page numbers or timestamps - would be even more valuable for users.

How We Evaluated the Results

We took a qualitative, exploratory approach to evaluation. We assembled a dataset including PDF documents, video materials, and SCORM packages, then randomly selected several dozen concrete information points to test. We also deliberately included unanswerable questions to see how the system would handle them.

I evaluated based on:

  • Answer accuracy and factual correctness

  • How well did the retrieval actually worked

  • Response clarity and coherence

  • Whether attribution was correct

  • Consistency across different content formats

  • Proper handling of questions we couldn't answer

What We Found

The good news:

  • Text extraction from PDFs was consistently high-quality

  • Answers were generally clear, accurate, and aligned with the source material

  • The system correctly indicated when information wasn't available

The challenges:

  • Video transcription occasionally had issues

  • SCORM package processing varied widely depending on how they were structured

  • We saw some inconsistencies that would need addressing in production

Performance, Costs, and Limitations

The longest processing times happened during initial ingestion, primarily because of video transcription. Once content was indexed, query response times typically ranged from 3 to 5 seconds - fast enough for a good user experience.

Cost drivers I identified:

  • Ingestion and indexing (especially video transcription)

  • Query-time inference costs

  • Vector database infrastructure

  • Compute resources required for media preprocessing (audio/video transcoding and optimization)

Key limitations we need to address:

  • Heavy dependency on input data quality and structure

  • High variability in SCORM package composition

  • No access control mechanisms yet

  • Reliance on third-party AI services (vendor lock-in risk)

What We Learned and What's Next

The bottom line: this Proof of Concept confirmed that AI-powered natural language search can absolutely support information retrieval across diverse healthcare LMS content types. We now have a validated technical foundation to build on.

Our recommendations for next steps:

  1. Evaluate the actual content structure to see how well it fits this approach.

  2. Implement proper cost estimation based on real data volumes.

  3. Further decouple ingestion and indexing pipelines for better scalability.

  4. Introduce query logging and analytics to understand usage patterns.

  5. Conduct a limited user rollout before broader deployment.

I'm excited about where this could go. The PoC proved the concept works - now it's about refining the approach and scaling it responsibly.

On-demand webinar: Moving Forward From Legacy Systems

We’ll walk you through how to think about an upgrade, refactor, or migration project to your codebase. By the end of this webinar, you’ll have a step-by-step plan to move away from the legacy system.

Watch Recording
moving forward from legacy systems - webinar

Latest Blog Posts

AI-Powered Search for Healthcare LMS: From Proof of Concept to Scalable Innovation

1.

Discover the Opportunity

See how AI-powered natural language search can transform content discovery across complex healthcare LMS environments.

2.

Explore the Approach

Learn how a RAG-based architecture enables contextual answers with reliable source attribution across PDFs, videos, and SCORM packages.

3.

Start Your AI Search Journey

Let’s evaluate your LMS content and design a tailored AI search PoC to validate feasibility and business value.