Building with Deep Generative Models: Hands-on Science
GenAI Course Page
Dr. Alexander(Sasha) Apartsin
Dr. Alexander(Sasha) Apartsin
Multimedia processing refers to the analysis, generation, and manipulation of multiple data modalities such as audio, images, video using computational techniques. It integrates signal processing, machine learning, and computer vision to enable understanding and interaction with complex multimedia content.
Modern deep generative models, trained on large multimodal datasets, can synthesize realistic samples conditioned on chosen attributes. They can also create diverse, labeled datasets that enable training machine-learning systems when annotated data are scarce or unavailable.
The course takes a code-first approach, pairing every concept with extensive, hands-on examples that utilize modern libraries, including PyTorch, librosa, and HuggingFace transformers. The course equips students with a practical toolbox of ideas and tools for rapidly building GenAI-based models and applications.
The course syllabus is designed to enable students to begin their projects while learning the material. As the course continues, they will enrich their projects with the concepts they acquire. Each team will give several in-class presentations for discussion and feedback.
As standard tasks are increasingly handled by AI and mature libraries, expectations of professional developers shift toward innovation and rapid integration. Accordingly, a key requirement for student projects is to tackle new use cases by generating unique data and training or fine-tuning a task-specific audio or image processing model.
The list below presents the complete set of subjects; individual course instances may vary depending on the course format, students’ backgrounds, and class dynamics.