
The Challenge We Set Out to Solve
I've been working with organizations that manage health-related learning content, and I've consistently seen the same problem arise: their Learning Management Systems are packed with valuable educational materials, PDFs, videos, SCORM packages, but finding specific information is becoming increasingly difficult.
The existing search mechanisms weren't cutting it. Users would know exactly what information they needed, but had no idea where to find it. Traditional keyword search didn't understand context, couldn't handle natural language queries, and didn't provide clear sources for the information it returned. For healthcare educators and learners dealing with complex medical content, this created real difficulties in their daily work.
That's when we decided to explore whether AI-powered natural language search could actually solve these challenges.

AI-Powered Search for Healthcare LMS: A Proof of Concept Journey
The Challenge We Set Out to Solve
I've been working with organizations that manage health-related learning content, and I've consistently seen the same problem arise: their Learning Management Systems are packed with valuable educational materials, PDFs, videos, SCORM packages, but finding specific information is becoming increasingly difficult.
The existing search mechanisms weren't cutting it. Users would know exactly what information they needed, but had no idea where to find it. Traditional keyword search didn't understand context, couldn't handle natural language queries, and didn't provide clear sources for the information it returned. For healthcare educators and learners dealing with complex medical content, this created real difficulties in their daily work.
That's when we decided to explore whether AI-powered natural language search could actually solve these challenges.
What We Were Trying to Prove
I want to be clear upfront: this was a Proof of Concept, not a production-ready system. We weren't trying to build the perfect solution; we were trying to validate whether this approach could work at all.
Our main questions were:
Can we actually ingest and process all these different content formats (PDFs, SCORM packages, videos)?
Will the answers be accurate and contextually relevant?
Can we reliably show users where the information came from?
What technical challenges and risks should we expect?
We kept the scope intentionally narrow. This was a low-risk feasibility exercise designed to help our client make informed decisions about whether to invest further.
How We Scoped the Work
We built the PoC as a standalone application, completely separate from the existing LMS and databases. Only selected stakeholders had access, and we used simple authentication to keep things secure enough for testing. The scope covered everything we needed to validate the concept:
Getting content into the system and preprocessing it
Extracting text and transcribing videos
Creating semantic indexes
Handling natural language queries
Retrieving relevant content
Generating answers with proper source references
What we deliberately left out: advanced security features, personalization, monitoring systems, direct integration with existing LMS database, and anything related to long-term maintenance. Those would come later if the PoC proved successful.
Our Technical Approach: RAG-Based Search
We based our solution on large language models using a retrieval-augmented generation (RAG) approach. Here's how it worked in practice:
Users could ask questions in plain English, like "What are the contraindications for this treatment?" and get contextual answers drawn exclusively from the indexed learning materials. Crucially, every answer came with explicit references to its sources.
The key advantage of this approach? We could dramatically reduce AI hallucinations by constraining answers to actual retrieved content fragments. No making things up—everything had to be traceable back to a real source document.
👉 AI‑Powered Natural Language Search for Health‑Related LMS Content
The Architecture We Built
We designed the PoC with a modular, cloud-based architecture that follows the RAG pattern.
Here's the end-to-end flow we implemented:
Learning content gets ingested and converted into a unified text representation
Text is segmented into chunks and indexed in a vector database
When a user asks a question, we use semantic retrieval to find relevant fragments
Those fragments get passed to the language model as context
The model generates a natural language answer with source attribution
The key components we selected:
Workflow orchestration: We used n8n to manage the entire ingestion pipeline, parsing, transformation, and downstream processing.
Data ingestion and parsing: We built Node.js pipelines to handle various content types, including PDFs, SCORM packages, and videos.
Vector storage and retrieval: We chose Pinecone as our vector database, with embeddings generated using OpenAI models.
Answer generation: At query time, we provide the retrieved fragments as context to the language model.
Processing Different Content Types
This is where things got interesting. Each content type required its own ingestion pipeline:
PDF Documents
These were the easiest. We extracted textual content directly and prepared it for normalization and chunking. PDFs with predominantly textual content gave us the most reliable input for semantic indexing.
Images
We converted images using OpenAI’s vision capabilities to extract textual content and generate a semantic description. To optimize processing cost, images were resized and quality-reduced before analysis.
PPTX and DOCX Files
Presentation slides and text documents were parsed and transformed into structured text representations.
Video Files
Video processing was more involved. We had to:
Convert video to a compressed mono audio track optimized for transcription
Remove silent segments (to save on transcription costs)
Speed up the audio without impacting transcription accuracy
Split into segments for transcription
SCORM Packages
These were by far the most complex. SCORM ingestion required:
Parsing the manifest to understand the package structure
Identifying all contained assets
Analyzing code files to extract learning-relevant text
Processing embedded PDFs, images, and videos using our existing pipelines
How We Generated Answers with Source Attribution
The answer generation followed a clear retrieval-augmented generation flow:
User submits a query in natural language
We resolve it through semantic retrieval
Relevant fragments are provided as context to the language model
The model generates a response constrained to the retrieved content
We include attribution at the source file or module level
One thing we noted for future improvement: we implemented attribution at the file level (like "PDF document X" or "Video Y"), but finer-grained attribution - including specific page numbers or timestamps - would be even more valuable for users.
How We Evaluated the Results
We took a qualitative, exploratory approach to evaluation. We assembled a dataset including PDF documents, video materials, and SCORM packages, then randomly selected several dozen concrete information points to test. We also deliberately included unanswerable questions to see how the system would handle them.
I evaluated based on:
Answer accuracy and factual correctness
How well did the retrieval actually worked
Response clarity and coherence
Whether attribution was correct
Consistency across different content formats
Proper handling of questions we couldn't answer
What We Found
The good news:
Text extraction from PDFs was consistently high-quality
Answers were generally clear, accurate, and aligned with the source material
The system correctly indicated when information wasn't available
The challenges:
Video transcription occasionally had issues
SCORM package processing varied widely depending on how they were structured
We saw some inconsistencies that would need addressing in production
Performance, Costs, and Limitations
The longest processing times happened during initial ingestion, primarily because of video transcription. Once content was indexed, query response times typically ranged from 3 to 5 seconds - fast enough for a good user experience.
Cost drivers I identified:
Ingestion and indexing (especially video transcription)
Query-time inference costs
Vector database infrastructure
Compute resources required for media preprocessing (audio/video transcoding and optimization)
Key limitations we need to address:
Heavy dependency on input data quality and structure
High variability in SCORM package composition
No access control mechanisms yet
Reliance on third-party AI services (vendor lock-in risk)
What We Learned and What's Next
The bottom line: this Proof of Concept confirmed that AI-powered natural language search can absolutely support information retrieval across diverse healthcare LMS content types. We now have a validated technical foundation to build on.
Our recommendations for next steps:
Evaluate the actual content structure to see how well it fits this approach.
Implement proper cost estimation based on real data volumes.
Further decouple ingestion and indexing pipelines for better scalability.
Introduce query logging and analytics to understand usage patterns.
Conduct a limited user rollout before broader deployment.
I'm excited about where this could go. The PoC proved the concept works - now it's about refining the approach and scaling it responsibly.
On-demand webinar: Moving Forward From Legacy Systems
We’ll walk you through how to think about an upgrade, refactor, or migration project to your codebase. By the end of this webinar, you’ll have a step-by-step plan to move away from the legacy system.

Latest Blog Posts
AI Usage & Safety Policy: Why Responsible AI Needs Rules, Not Assumptions
Feb 11, 2026 by Polcode Team
AI in Software Delivery: How Polcode Builds Predictable, Responsible, AI-Supported Systems
Feb 4, 2026 by Polcode Team
Application and Website Quality Audit: ISO 9001 and ISO 29119 Guide
Jan 22, 2026 by Anton Malinovski
AI-Powered Search for Healthcare LMS: From Proof of Concept to Scalable Innovation
Discover the Opportunity
See how AI-powered natural language search can transform content discovery across complex healthcare LMS environments.
Explore the Approach
Learn how a RAG-based architecture enables contextual answers with reliable source attribution across PDFs, videos, and SCORM packages.
Start Your AI Search Journey
Let’s evaluate your LMS content and design a tailored AI search PoC to validate feasibility and business value.