Dear learner,
Introducing Multimodal RAG: Chat with Videos, a short course made in collaboration with Intel!
This course, taught by Vasudev Lal, Principal AI Research Scientist, Intel Labs, teaches you to build an interactive system for querying video content using multimodal AI. You'll create a sophisticated question-answering system that processes, understands, and interacts with video.
You'll learn to create a Q&A system that interacts with a collection of videos. You’ll use multimodal transformer models, like the BridgeTower model, to combine visual and textual data into a unified semantic space. You will generate embeddings from text and images and store them in a vector database. Then, you'll build a RAG pipeline to retrieve relevant content and use a Large Vision-Language Model (LVLM) to generate responses.
In this course, you will make API calls to access multimodal models hosted by Prediction Guard on Intel’s cloud.
By the end, you'll have the expertise to create AI systems that can intelligently interact with video content.
Throughout the course, you'll get hands-on and build a complete multimodal RAG system that:
- Processes and embeds video content (frames, transcripts, and captions)
- Stores multimodal data in a vector database
- Retrieves relevant video segments given text queries
- Generates contextual responses using LVLMs
- Maintains multi-turn conversations about video content
Whether you're looking to enhance content management systems, improve accessibility features, or push the boundaries of human-AI interaction, the techniques learned in this course will provide a solid foundation for innovation in multimodal AI applications.
-
Create a sophisticated question-answering system that processes, understands, and interacts with complex multimodal data.
-
Explore the concept of multimodal semantic space and its importance in AI.
-
Learn the differences between traditional RAG and multimodal RAG systems, focusing on the complexities of integrating different models.
Lesson | Video | Code |
---|---|---|
Introduction | video | |
Interactive Demo and Multimodal RAG System Architecture | video | code |
Multimodal Embeddings | video | code |
Preprocessing Videos for Multimodal RAG | video | code |
Multimodal Retrieval from Vector Stores | video | code |
Large Vision - Language Models (LVLMs) | video | code |
Multimodal RAG with Multimodal Langchain | video | code |
Conclusion | video |