Opening up ChatGPT: LLM openness leaderboard

⚡FAccT'24 paper⚡ Liesenfeld, Andreas, and Mark Dingemanse. 2024. ‘Rethinking Open Source Generative AI: Open-Washing and the EU AI Act’. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24). Rio de Janeiro, Brazil: ACM. (PDF).

There is a growing amount of instruction-tuned text generators billing themselves as 'open source'. How open are they really? 🔗CUI'23 🔗PDF 🔗repo

Project	Availability						Documentation						Access
(maker, bases, URL)	Open code	LLM data	LLM weights	RL data	RL weights	License	Code	Architecture	Preprint	Paper	Modelcard	Datasheet	Package	API
OLMo 7B Instruct	✔︎	✔︎	✔︎	✔︎	✔︎	✔︎	✔︎	✔︎	✔︎	✘	✔︎	✔︎	✔︎	~
AllenAI	LLM base: OLMo 7B			RL base: OpenInstruct										12.5
BLOOMZ	✔︎	✔︎	✔︎	✔︎	~	~	✔︎	✔︎	✔︎	✔︎	✔︎	✔︎	✘	✔︎
bigscience-workshop	LLM base: BLOOMZ, mT0			RL base: xP3										12.0
AmberChat	✔︎	✔︎	✔︎	✔︎	✔︎	✔︎	~	~	✔︎	✘	~	~	✘	✔︎
LLM360	LLM base: Amber			RL base: ShareGPT + Evol-Instruct (synthetic)										10.0
Open Assistant	✔︎	✔︎	✔︎	✔︎	✘	✔︎	✔︎	✔︎	~	✘	✘	✘	✔︎	✔︎
LAION-AI	LLM base: Pythia 12B			RL base: OpenAssistant Conversations										9.5
OpenChat 3.5 7B	✔︎	✘	✔︎	✘	✔︎	✔︎	~	✔︎	✔︎	✔︎	~	✘	✔︎	~
Tshinghua University	LLM base: Mistral 7B			RL base: ShareGPT with C-RLFT										9.5
Pythia-Chat-Base-7B-v0.16	✔︎	✔︎	✔︎	✔︎	✘	✔︎	✔︎	✔︎	~	✘	~	~	✔︎	✘
togethercomputer	LLM base: EleutherAI pythia			RL base: OIG										9.5
Cerebras GPT 111M Instruction	~	✔︎	✔︎	✔︎	✔︎	~	✘	✔︎	~	✘	✘	✔︎	✘	✔︎
Cerebras + Schramm	LLM base: Cerebras			RL base: Alpaca (synthetic)										8.5
RedPajama-INCITE-Instruct-7B	~	✔︎	✔︎	✔︎	✔︎	~	~	~	✘	✘	✔︎	✔︎	✘	~
TogetherComputer	LLM base: RedPajama-INCITE-7B-Base			RL base: various (GPT-JT recipe)										8.5
dolly	✔︎	✔︎	✔︎	✔︎	✘	✔︎	✔︎	✔︎	~	✘	✘	✘	✔︎	✘
databricks	LLM base: EleutherAI pythia			RL base: databricks-dolly-15k										8.5
Tulu V2 DPO 70B	✔︎	✘	~	✔︎	✔︎	~	~	~	✔︎	✘	~	~	✘	✔︎
AllenAI	LLM base: Llama2			RL base: Tulu SFT, Ultrafeedback										8.0
MPT-30B Instruct	✔︎	~	✔︎	~	✘	✔︎	✔︎	~	✘	✘	~	✘	✔︎	~
MosaicML	LLM base: MosaicML			RL base: dolly, anthropic										7.5
MPT-7B Instruct	✔︎	~	✔︎	~	✘	✔︎	✔︎	~	✘	✘	✔︎	✘	✔︎	✘
MosaicML	LLM base: MosaicML			RL base: dolly, anthropic										7.5
trlx	✔︎	✔︎	✔︎	~	✘	✔︎	✔︎	~	✘	✘	✘	✘	~	✔︎
carperai	LLM base: various (pythia, flan, OPT)			RL base: various										7.5
NeuralChat 7B	~	✘	✔︎	✔︎	✔︎	✔︎	~	~	✘	✘	~	~	~	✘
Intel	LLM base: Mistral 7B			RL base: Orca										7.0
Vicuna 13B v 1.3	✔︎	~	✔︎	✘	✘	~	✔︎	✘	✔︎	✘	~	✘	✔︎	~
LMSYS	LLM base: LLaMA			RL base: ShareGPT										7.0
minChatGPT	✔︎	✔︎	✔︎	~	✘	✔︎	✔︎	~	✘	✘	✘	✘	✘	✔︎
ethanyanjiali	LLM base: GPT2			RL base: anthropic										7.0
ChatRWKV	✔︎	~	✔︎	✘	✘	✔︎	~	~	~	✘	✘	✘	✔︎	~
BlinkDL/RWKV	LLM base: RWKV-LM			RL base: alpaca, shareGPT (synthetic)										6.5
BELLE	✔︎	~	~	~	~	✘	~	✔︎	✔︎	✘	✘	~	✘	✘
KE Technologies	LLM base: LLaMA & BLOOMZ			RL base: alpaca, shareGPT, Belle (synthetic)										6.0
Phi 3 Instruct	✘	✘	✘	✘	✔︎	✔︎	✘	✔︎	~	✘	✔︎	✘	~	✔︎
Microsoft	LLM base: Phi3			RL base: Unspecified										6.0
WizardLM 13B v1.2	~	✘	~	✔︎	✔︎	~	~	✔︎	✔︎	✘	✘	✘	✘	✘
Microsoft & Peking University	LLM base: LLaMA2-13B			RL base: Evol-Instruct (synthetic)										6.0
Airoboros L2 70B GPT4	~	✘	~	✔︎	✔︎	~	~	~	✘	✘	~	~	✘	✘
Jon Durbin	LLM base: Llama2			RL base: Airoboros (synthetic)										5.5
ChatGLM-6B	~	~	✔︎	✘	✘	✔︎	~	~	✘	~	✘	✘	✘	✔︎
THUDM	LLM base: GLM (own)			RL base: Unspecified										5.5
Mistral 7B-Instruct	~	✘	✔︎	✘	~	✔︎	✘	~	~	✘	✘	✘	~	✔︎
Mistral AI	LLM base: unclear			RL base: unspecified										5.5
WizardLM-7B	~	~	✘	✔︎	~	~	~	✔︎	✔︎	✘	✘	✘	✘	✘
Microsoft & Peking University	LLM base: LLaMA-7B			RL base: Evol-Instruct (synthetic)										5.5
Qwen 1.5	~	✘	✔︎	✘	✔︎	✘	~	~	✘	✘	✘	✘	~	✔︎
Alibaba Cloud	LLM base: QwenLM			RL base: Unspecified										5.0
StableVicuna-13B	~	✘	~	~	~	~	~	~	~	✘	~	✘	✘	~
CarperAI	LLM base: LLaMA			RL base: OASST1 (human), GPT4All (human), Alpaca (synthetic)										5.0
Falcon-40B-instruct	✘	~	✔︎	~	✘	✔︎	✘	~	~	✘	~	✘	✘	✘
Technology Innovation Institute	LLM base: Falcon 40B			RL base: Baize (synthetic)										4.5
UltraLM	✘	✘	~	✔︎	~	✘	✘	~	✔︎	✘	~	~	✘	✘
OpenBMB	LLM base: LLaMA2			RL base: UltraFeedback (part synthetic)										4.5
Yi 34B Chat	~	✘	✔︎	✘	✔︎	~	✘	✘	✔︎	✘	✘	✘	✘	~
01.AI	LLM base: Yi 34B			RL base: unspecified										4.5
Koala 13B	✔︎	~	~	~	✘	~	~	~	✘	✘	✘	✘	✘	✘
BAIR	LLM base: LLaMA 13B			RL base: HC3, ShareGPT, alpaca (synthetic)										4.0
Mixtral 8x7B Instruct	✘	✘	✔︎	✘	~	✔︎	✘	~	~	✘	✘	✘	~	✘
Mistral AI	LLM base: Mistral			RL base: Unspecified										4.0
Stable Beluga 2	✘	✘	~	✘	✔︎	~	✘	~	~	✘	~	✘	✘	~
Stability AI	LLM base: LLaMA2			RL base: Orca-style (synthetic)										4.0
Stanford Alpaca	✔︎	✘	~	~	~	✘	~	✔︎	✘	✘	✘	✘	✘	✘
Stanford University CRFM	LLM base: LLaMA			RL base: Self-Instruct (synthetic)										4.0
Falcon-180B-chat	✘	~	~	~	~	✘	✘	~	~	✘	~	✘	✘	✘
Technology Innovation Institute	LLM base: Falcon 180B			RL base: OpenPlatypus, Ultrachat, Airoboros (synthetic)										3.5
Orca 2	✘	✘	~	✘	✔︎	✘	✘	~	~	✘	~	✘	✘	~
Microsoft Research	LLM base: LLaMA2			RL base: FLAN, Math, undisclosed (synthetic)										3.5
Command R+	✘	✘	✘	✔︎	✔︎	~	✘	✘	✘	✘	~	✘	✘	✘
Cohere AI	LLM base:			RL base: Aya Collection										3.0
Gemma 7B Instruct	~	✘	~	✘	~	✘	✘	~	✘	✘	✔︎	✘	✘	✘
Google DeepMind	LLM base: Gemma			RL base: Unspecified										3.0
LLaMA2 Chat	✘	✘	~	✘	~	✘	✘	~	~	✘	~	✘	✘	~
Facebook Research	LLM base: LLaMA2			RL base: Meta, StackExchange, Anthropic										3.0
Nanbeige2-Chat	✔︎	✘	✘	✘	✔︎	~	✘	✘	✘	✘	✘	✘	✘	~
Nanbeige LLM lab	LLM base: Unknown			RL base: Unknown										3.0
Llama 3 Instruct	✘	✘	~	✘	~	✘	✘	~	✘	✘	~	✘	✘	~
Facebook Research	LLM base: Meta Llama 3			RL base: Meta, undocumented										2.5
Solar 70B	✘	✘	~	✘	~	✘	✘	✘	✘	✘	~	✘	✘	~
Upstage AI	LLM base: LLaMA2			RL base: Orca-style, Alpaca-style										2.0
Xwin-LM	✘	✘	~	✘	✘	✘	✘	✘	✘	✘	✘	✘	✘	~
Xwin-LM	LLM base: LLaMA2			RL base: unknown										1.0
ChatGPT	✘	✘	✘	✘	✘	✘	✘	✘	~	✘	✘	✘	✘	✘
OpenAI	LLM base: GPT 3.5			RL base: Instruct-GPT										0.5

How to use this table. Every cell records a three-level openness judgement (✔︎ open, ~ partial or ✘ closed) with a direct link to the available evidence; on hover, the cell will display the notes we have on file for that judgement. The name of each project is a direct link to source data. The table is sorted by cumulative openness, where ✔︎ is 1, ~ is 0.5 and ✘ is 0 points. Note that RL may refer to RLHF or other forms of fine-tuning aimed at fostering instruction-following behaviour.

Why is openness important?

Open research is the lifeblood of cumulative progress in science and engineering. Openness is key for fundamental research, for fostering critical computational literacy, and for making informed choices for or against deployment of instruction-tuned LLM architectures. The closed & proprietary nature of ChatGPT and kin makes them fundamentally unfit for responsible use in research and education.

Open alternatives provide ways to build reproducible workflows, chart resource costs, and lessen reliance on corporate whims. One aim of our work here is to provide tools to track openness, transparency and accountability in the fast-evolving landscape of instruction-tuned text generators. Read more in the paper (PDF) or contribute to the repo.

If you know a model that should be listed here or a data point that needs updating, please see guidelines for contributors. We welcome any contribution, whether it's a quick addition to our awesomelist or a more detail-oriented contribution to the metadata for a specific project.

TL;DR

Our paper makes the following contributions:

We review the risks of relying on proprietary software
We review best practices for open, transparent and accountable 'AI'
We find over 40 ChatGPT alternatives at varying degrees of openness, development and documentation
We argue that tech is never a fait accompli unless we make it so, and that openness enables critical computational literacy

We find the following recurrent patterns:

Many projects inherit data of dubious legality
Few projects share the all-important instruction-tuning
Preprints are rare, peer-reviewed papers even rarer
Synthetic instruction-tuning data is on the rise, with unknown consequences that are in need of research

We conclude as follows:

Openness is not the full solution to the scientific and ethical challenges of conversational text generators. Open data will not mitigate the harmful consequences of thoughtless deployment of large language models, nor the questionable copyright implications of scraping all publicly available data from the internet. However, openness does make original research possible, including efforts to build reproducible workflows and understand the fundamentals of instruction-tuned LLM architectures. Openness also enables checks and balances, fostering a culture of accountability for data and its curation, and for models and their deployment. We hope that our work provides a small step in this direction.

Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. “Opening up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators.” In CUI '23: Proceedings of the 5th International Conference on Conversational User Interfaces. July 19-21, Eindhoven. doi: 10.1145/3571884.3604316 (PDF).

Opening up ChatGPT: tracking openness of instruction-tuned LLMs

Why is openness important?

TL;DR