| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 8.93 MB | Adobe PDF |
Orientador(es)
Resumo(s)
Comics represent the complexway humans can communicate and expose ideas, which
pose additional challenges for image-to-text deep learning models. In this project, we
investigate howmultimodal deep learning architectures performin describing a comics
vignette. We investigate howcurrent State-of-the-Art models (GIT and BLIP-2) are able
to describe the narrative in 4-images comics sequence from a dataset we created.
We find that some prompting can produce acceptable results. We also assess how to
propagate information across the sequence’s images, by adding to prompts the previous
outputs of the images from the same sequence. The results show limited improvements
from this strategy.
While the overall meaning of the predicted descriptions is close to the semantic space of
the real descriptions, they are still far away from human-level descriptions. Therefore
we propose several future experiments, where we highlight reinforcement learning to
train a large language model as a policy function for prompt generation.
Descrição
Palavras-chave
Comics Computer vision Image captioning Multimodal Deep Learning Models Prompt engineering
