Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?