VQA — 明夜缘 · Vision-Language Question Answering System

A vision-language question answering system by Ming, Ye, and Yuan, finished in Apr 2023.

This is an image-based question answering system. It combines Vision Transformers with pretrained language models to understand both an input image and a natural language question, then generate a meaningful answer.

Image + Question Input – Users upload an image and type a question about it.

Answer Generation – The system processes both modalities and outputs a natural language answer.

Demo Examples – For instance, given a photo and the question “What is the girl’s hair color?”, it may answer “Her hair color is blue.”

The project won the second prize in the university competition.

More posts