A multimedia system combines intelligent processing algorithms with audio–video modeling to interpret and manipulate complex media streams. By leveraging machine learning, pattern recognition, and modern media representations, these systems can classify events, track objects, enhance signals, and support interactive, content-aware applications.
Modern deep generative models, trained on large multimodal datasets, can synthesize realistic samples conditioned on chosen attributes. They can also create diverse, labeled datasets that enable training machine-learning systems when annotated data are scarce or unavailable.
The course takes a code-first approach, pairing each concept with extensive, hands-on examples that use modern libraries, including PyTorch, librosa, and HuggingFace Transformers. The course equips students with a practical toolbox of ideas and tools for rapidly building GenAI-based models and applications.
The course syllabus is designed to enable students to begin their projects while learning the material. As the course continues, they will enrich their projects with the concepts they acquire. Each team will give several in-class presentations for discussion and feedback.
As standard tasks are increasingly handled by AI and mature libraries, professional developers' expectations shift toward innovation and rapid integration. Accordingly, a key requirement for student projects is to tackle new use cases by generating unique data and training or fine-tuning a task-specific audio or image processing model.
The list below presents the complete set of subjects; individual course instances may vary depending on the course format, students’ backgrounds, and class dynamics.
Browse course offerings