Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs