Goodhart's Law on Benchmarks

Capability	Description	miniG	Gemini-Flash	GLM-4-9B-Chat	Llama 3.1 8B Instruct
MMLU	Representation of questions in 57 subjects (incl. STEM, humanities, and others)	85.45	78.9	72.4	69.4
IFEval	Evaluation of instruction-following using verifiable prompts	74.22	-	69	80.4
GSM8K	Challenging math problems (5-shot evaluation)	75.89 (5-shot)	86.2 (11-shot)	79.6	84.5 (8-shot CoT)
HumanEval	Python code generation on a held-out dataset (0-shot)	79.88	74.3	71.8	72.6
GPQA	Challenging dataset of questions from biology, physics, and chemistry	37.37	39.5	34.3 (base)	34.2
Context Window	Maximum context length the model can handle	1M	1M	128K	128K
Input	Supported input modalities	Text, image (single model)	Text, image, audio, video	Text only	Text only

Capability

Description

miniG

Gemini-Flash

GLM-4-9B-Chat

Llama 3.1 8B Instruct

MMLU

Representation of questions in 57 subjects
(incl. STEM, humanities, and others)

85.45

78.9

72.4

69.4

IFEval

Evaluation of instruction-following
using verifiable prompts

74.22

80.4

GSM8K

Challenging math problems
(5-shot evaluation)

75.89 (5-shot)

86.2 (11-shot)

79.6

84.5 (8-shot CoT)

HumanEval

Python code generation on a held-out dataset
(0-shot)

79.88

74.3

71.8

72.6

GPQA

Challenging dataset of questions
from biology, physics, and chemistry

37.37

39.5

34.3 (base)

34.2

Context Window

Maximum context length
the model can handle

128K

Input

Supported input modalities

Text, image, audio, video

Text only

1. miniG is a 14B parameter model derived from the 9B parameter glm-4-9b-chat-1m model weights. It continues pre-training on a selected corpus of 20B tokens while retaining long-context capabilities. The model is fine-tuned on a dataset of 120M+ conversation entries, synthesized through cross-page clustering similar to RAG on this selected corpus. Additionally, miniG underwent multimodal training in two stages for single image input, with the second stage reinitializing 5B parameters of a Vision Transformer from glm-4v-9b for Locked-Image Tuning.
2. miniG outputs are formatted similarly to Gemini 1.5 Flash but were not trained on data generated by the Gemini models.

田忌赛马

Goodhart's Law on Benchmarks