Survey - Multimodal LLM fine-tuning dataset

03 Jul 2024

Following up on the Multimodal LLM survey, I’ve organized the details and links of the datasets used.

Category	Name	Count	Models used	Notes	Link
Instruction	LLaVA-Instruct-150K	150K	LLaVA-1.5, InstructBLIP, MiniGPT4-v2	Composed of conversation, detail description, complex reasoning	https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
	M4-Instruct	Multi-image 594K Single-image 307K Video 262K 3D 99.5K	LLaVA-NeXT	SOTA model’s dataset	https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data
	Shikra (GPT4Gen_BoxCoT)	6K	Shikra		https://github.com/shikras/shikra/tree/main?tab=readme-ov-file
	MiniGPT4	3.5K	MiniGPT4	Simple caption describing instruction	https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPT4_Train.md
	MiniGPT4-v2 multitask conversation	12K	MiniGPT4-v2	Multi-turn multi-task	https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_MINIGPTv2_FINETUNE.md
	VQA	4.4M		Short answer question	https://visualqa.org/download.html
	OK-VQA	14K		VQA requiring external knowledge	https://okvqa.allenai.org/
Image-caption-grounded pair	GRIT	20M	Ferret	Bounding box link to both “noun phrase” and “image segment”	https://huggingface.co/datasets/zzliang/GRIT?row=0
	VG (Visual Genome)	1.7M		Specifies object region, name, bounding box	https://huggingface.co/datasets/ranjaykrishna/visual_genome
Image-answer pair	CLEVR	16K		Visual reasoning (Color, size, count…)	https://cs.stanford.edu/people/jcjohns/clevr/
Image-caption pair	Cap COCO	330K			https://huggingface.co/datasets/HuggingFaceM4/COCO
	SBU captions	860K			https://huggingface.co/datasets/vicenteor/sbu_captions
	Flickr30K	30K			https://huggingface.co/datasets/nlphuji/flickr30k

Category	Name	Count	Models used	Notes	Link
Instruction	KoLLaVA-Instruct-150K	150K	koLLaVA	LLaVA-Instruct-150K translated with DeepL	https://huggingface.co/datasets/tabtoyou/KoLLaVA-Instruct-150k
	M3IT-80	1K		M3IT translation	https://huggingface.co/datasets/MMInstruction/M3IT-80
	Visual information-based Q&A	7.5M		AI hub	https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=104
	KVQA	10K		SKT dataset (there may be more)	https://github.com/SKTBrain/KVQA?tab=readme-ov-file
Image-caption pair	KoCC12M	12M		CC12M translation	https://huggingface.co/datasets/QuoQA-NLP/KoCC12M

Annotated: a form comparing two answers
Synthetic: a form where, based on one answer, the other is generated via a rule-based/LLM-based method

Category	Name	Count	Models used	Notes	Link
Human Annotated	LLaVA-Human-Preference-10K	9.42K			https://huggingface.co/datasets/zhiqings/LLaVA-Human-Preference-10K
	RLHF-V	5.73	RLHF-V		https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset
AI annotated	RLAIF-V	33.8K	RLAIF-V	Divide & Conquer style LLaVA-NeXT annotation	https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset
	Silkie	80K	Silkie, mDPO	GPT-4V annotation	https://huggingface.co/datasets/MMInstruction/VLFeedback
	LLaVA-Hound-DPO	17K	LLaVA-Hound-DPO	Video support GPT-4V video captioning chatGPT annotation	https://huggingface.co/ShareGPTVideo/LLaVA-Hound-DPO
Synthetic modified	STIC-coco-preference-6k	6K	STIC	Only image description instruction	https://huggingface.co/datasets/STIC-LVLM/stic-coco-preference-6k
	POVID-preference-data	17.2K	POVID	Based on LLaVA-Instruct	https://huggingface.co/datasets/YiyangAiLab/POVID_preference_data_for_VLLMs
	BPO	188K	BPO	Based on ShareGPT4V, COCO, LLaVA-Instruct	https://huggingface.co/datasets/renjiepi/BPO?row=0
	HALVA	21.7K	HALVA	Based on Visual Genome	https://github.com/FuxiaoLiu/LRV-Instruction/blob/main/download.txt#L28

Related posts