Pereira, Francisco José BatistaMarouvo, Gonçalo Ventura Lourenço2025-03-172025-03-172025-02-03http://hdl.handle.net/10400.26/57302Comics represent the complexway humans can communicate and expose ideas, which pose additional challenges for image-to-text deep learning models. In this project, we investigate howmultimodal deep learning architectures performin describing a comics vignette. We investigate howcurrent State-of-the-Art models (GIT and BLIP-2) are able to describe the narrative in 4-images comics sequence from a dataset we created. We find that some prompting can produce acceptable results. We also assess how to propagate information across the sequence’s images, by adding to prompts the previous outputs of the images from the same sequence. The results show limited improvements from this strategy. While the overall meaning of the predicted descriptions is close to the semantic space of the real descriptions, they are still far away from human-level descriptions. Therefore we propose several future experiments, where we highlight reinforcement learning to train a large language model as a policy function for prompt generation.engComicsComputer visionImage captioningMultimodal Deep Learning ModelsPrompt engineeringCapturing the narrative : deep learning models for comics sequencesmaster thesis203894898