A vision-language question answering system by Ming, Ye, and Yuan, finished in Apr 2023.
明夜缘
—
This is an image-based question answering system. It combines Vision Transformers with pretrained language models to understand both an input image and a natural language question, then generate a meaningful answer.
- Image + Question Input – Users upload an image and type a question about it.
- Answer Generation – The system processes both modalities and outputs a natural language answer.
- Demo Examples – For instance, given a photo and the question “What is the girl’s hair color?”, it may answer “Her hair color is blue.”
The project won the second prize in the university competition.