Paper survey - Multimodal LLM

02 Jul 2024

Since the era of LLMs is gradually shifting into the era of Multimodal LLMs, I think it’s time for a follow-up.

I briefly organized

what kinds of Multimodal LLM architectures there are,
how instruction tuning for usability is carried out, and
how RLHF, which sparked the chatGPT hype, is applied.

Summary

Base multimodal LLM architectures

Category	Architecture	Method
Projection matrix	Frozen	trains only the vision encoder separately
	Kosmos	trains the visual encoder last layer and the LLM
	FROMAGe	trains only the projection matrix embedding to image is also possible
	LLaVA	trains the projection layer and the LLM 2-stage training
Cross attention	Flamingo	trains only the visual feature resampler and the LLM cross attention
	BLIP	trains a Q-former that converts visual features into LLM embeddings
Adaptation prompt	LLaMA-Adapter	trains the projection matrix and prompt embedding
Image tokenizer	Chameleon	tokenizes the image latent vector using a Codebook

Instruction-tuning methods

Performs SFT of a vision-text instruction dataset based on a pre-trained mLLM
Differs in multi-stage training / freezing modules / architecture, etc.
For image grounded tasks, methods that support bounding box (or free-form) information

Category	Name	Architecture	Method	Dataset / Quantity
Projection matrix	LLaVA-1.5	LLaVA	MLP projection layer High resolution (336x336)	LLaVA-Instruct-150K / 150K
	LLaVA-NeXT	LLaVA	supports video input Backbone upgrade (supports up to Qwen1.5 110B)	M4Instruct / 1M
	Shikra	LLaVA	focuses on image grounded tasks	LLaVA-Instruct-150K + Shikra RD 6K / 156K
	Ferret (Apple)	LLaMA + visual encoder	focuses on free-form region image grounded tasks	GRIT / 34K
	Fuyu-8B	Fuyu	handles arbitrary image resolution	Unknown
Cross attention	OpenFlamingo (Deepmind)	Flamingo	open-source replica of Flamingo	LAION-2B / 2B Multimodal C4 / 101M Synthetic / 417K
	InstructBLIP (Salesforce)	BLIP	Q-former instruct tuning	COCO caption / 82K Web CapFilt / 14M TextCaps / 21K VQAv2 / 82K OKVQA / 9K A-OKVQA / 17K OCR-VQA / 800K
	MiniGPT-4	BLIP	Q-former freeze, additional linear layer training	Curated image-description pair / 3.5K
	MiniGPT-v2 (Meta)	BLIP	enhances task-specific performance through 7 task identifier tokens	LLaVA-Instruct / 81K Filcker / 5.5K
	Qwen-VL	BLIP (similar)	3 stage training	Custom / 350K
Adaptation prompt	LLaMA-Adapter V2	LLaMA-Adapter	2 stage training separates visual / instruction learning parameters	Text-only instruction / 52K COCO caption / 567K

Preference alignment methods

The alignment methods are all used as ways to reduce hallucination
All use LLaVA-based models (it’s unclear whether the reason for not using other architectures is an engineering issue or because LLaVA’s performance is overwhelming)
LoRA finetuning is commonly used

Training-method category	Subcategory	Name	Method	Dataset / Quantity
PPO		LLaVA-RLHF (Berkeley)	10K Human annotated preference dataset ($3000) performs RLHF pipeline	<SFT> Conversation / 98K VQA-v2 / 83K A-OKVQA / 16K Flicker / 23K <RLHF> Human annotated / 10K
DPO		RLHF-V (Tsinghua)	1.4K Human modified preference dataset directly fixes only the parts where hallucination occurred DDPO (DPO with higher weights on changed tokens)	Human modified / 1.4K
		Silkie (CUHK)	80K GPT-4V annotation → 380K pairs (VLFeedback dataset)	GPT-4V annotated / 80K
		LLaVA-Hound-DPO (CMU, Bytedance)	LLaVA-NeXT supporting video 80K GPT-4V video captioning 240K chatGPT question gen 240K chatGPT reward tagging (replace video to caption) → 20$	chatGPT generated / 17K
		mDPO (MS)	solves the problem that existing DPO can’t do image conditioning adds a conditioning loss using image / non-image pairs	Silkie / 10K
		RLAIF-V (Tsinghua)	shows that using Divide & Conquer for reward tagging makes LLaVA-NeXT sufficient instead of GPT-4V	Diverse set / 4K
	Synthetic pair data	STIC (UCLA)	Good prompt → generates chosen Bad prompt / corrupt image → generates rejected SFT with a normal prompt after DPO	<DPO> MSCOCO / 6K <SFT> LLaVA’s SFT / 5K
		POVID (Stanford)	Dis-preferred prompt / noisy image → generates rejected Text Hal DPO (3 epochs) → Noisy image Hal DPO (1 epoch)	LLaVA-Instruct / 17K
		BPO (HKUST)	Image gaussian noise / text error injection → generates rejected making the negative differ only in the hallucination part like this gives the highest performance	ShareGPT-V / 58K LLaVAR / 55K LLaVA-Instruct / 54K
		HSA-DPO (Zhejiang Univ.)	hallucination detection and scoring using 6K GPT-4V trains a detection and scoring model performs HSA-DPO (DPO with weights increased when the hallucination score is high)	Visual Genome / 8K
Contrastive learning		HALVA (Google)	swaps out the object words of the chosen (using an LLM) → generates rejected uses contrastive loss	Visual Genome / 21.5K

Comments