Diary - LLM distillation envidence!!

08 Aug 2024

I asked various LLMs (Large Language Models) for a BFS (Breadth-First Search) example in Korean.

“Python으로 너비우선탐색(BFS) 함수 작성해줘. (=Write a BFS function in Python.)”

The responses from the models varied slightly, but the example graph provided was identical:

graph = {
 'A': ['B', 'C'],
 'B': ['A', 'D', 'E'],
 'C': ['A', 'F'],
 'D': ['B'],
 'E': ['B', 'F'],
 'F': ['C', 'E']
}

The models confirmed so far are:

gpt-3.5-turbo-0125
gpt-4o-2024-08-06
gpt-4o-mini-2024-07-18
Llama-3.1-405B
llama-3.1-70b-instruct
llama-3.1-8b-instruct
llama-3-70b-instruct
gemma-2-2b-it
gemma-2-9b-it
gemma-2-27b-it
gemini-1.5-flash-api-0514
gemini-1.5-pro-api-0514
claude-3-5-sonnet-20240620
claude-3-opus-20240229
yi-large
yi-large-preview
phi-3-mini-4k-instruct-june-2024
phi-3-medium-4k-instruct

This totals 18 models. There might be more, but it’s challenging to check each one on LMSys. However, this sample likely covers the most popular models.

The Qwen series produced a different graph structure, but it’s essentially the same example.

graph = {
    'A' : ['B','C'],
    'B' : ['D', 'E'],
    'C' : ['F'],
    'D' : [],
    'E' : ['F'],
    'F' : []
}

qwen-max-0428
qwen2-72b-instruct
qwen1.5-110b-chat

Some models did produce different examples.

llama-3-8b-instruct
yi-1.5-34b-chat
mixtral-8x22b-instruct-v0.1
phi-3-small-8k-instruct

What does it mean?

What could be causing this? (At least 18 of the most well-known models provided the same response.)

I think there are two hypothesis:

There is a famous code dataset that includes this BFS example.
Data augmentation using a specific model was performed (distillation).

I checked various datasets to confirm the first hypothesis, but I couldn’t find this example. Of course, it might still exist somewhere. However, even if such a dataset exists, why is the same example being used? Does this imply some form of overfitting?

Additionally, the same example appears for DFS (Depth-First Search) as well. When I asked GPT-4o for DFS instead of BFS, it returned the same graph as the Qwen series!

So, they’re essentially sharing the same examples.

What is the root Model?

If the second hypothesis is true, what is the root model? If the root model is ChatGPT, does this reveal that GPT distillation has been happening behind the scenes?