Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

arXiv · June 28, 2024 · Significant research

Summary

MBZUAI researchers introduce Web2Code, a new large-scale dataset and evaluation framework for training and benchmarking multimodal LLMs on webpage understanding and HTML code generation. The dataset includes webpage images, HTML code, and QA pairs about webpage content. Experiments demonstrate the dataset's utility in webpage understanding, code generation, and general visual domain tasks, with code and data available on Github.

Keywords

MLLM · Web2Code · HTML · MBZUAI · Dataset

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Web2Code: A new dataset to enhance multimodal LLM performance presented at NeurIPS

MBZUAI · Invalid Date

MBZUAI researchers introduced Web2Code, a new dataset suite, at NeurIPS to enhance multimodal LLM performance in web page analysis and HTML generation. The suite includes a fine-tuning dataset and two benchmark datasets. Instruction tuning with Web2Code improved performance on specialized tasks without affecting general capabilities. Why it matters: This contribution addresses a key limitation in current multimodal LLMs, potentially boosting productivity in web design and development by providing targeted training data.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv · Jun 8

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Summary

Keywords

Related

Web2Code: A new dataset to enhance multimodal LLM performance presented at NeurIPS

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models