Merge remote-tracking branch 'origin/preview'

DBlankvoort · DBlankvoort · commit a42b78525442 · 2025-06-09T10:18:14.000+02:00
diff --git a/.github/workflows/draft-pdf.yml b/.github/workflows/draft-pdf.yml
@@ -0,0 +1,24 @@
+name: Draft PDF
+on: [push]
+
+jobs:
+  paper:
+    runs-on: ubuntu-latest
+    name: Paper Draft
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Build draft PDF
+        uses: openjournals/openjournals-draft-action@master
+        with:
+          journal: joss
+          # This should be the path to the paper within your repo.
+          paper-path: paper.md
+      - name: Upload
+        uses: actions/upload-artifact@v4
+        with:
+          name: paper
+          # This is the output path where Pandoc will write the compiled
+          # PDF. Note, this should be the same directory as the input
+          # paper.md
+          path: paper.pdf
diff --git a/BitNet.yaml b/BitNet.yaml
@@ -10,7 +10,7 @@
 # Liesenfeld, A. and Dingemanse, M., 2024. Rethinking open source generative AI: open-washing and the EU AI Act. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1774-1787).
 
 system:
-    name: BitNet b1.58 2B4T
+    name: BitNet
     link: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T
     type: text
     performanceclass: full
diff --git a/DeepFloyd.yaml b/DeepFloyd.yaml
@@ -26,7 +26,6 @@ org:
     notes: Collaboration between various organizations.
 
 # availability:
-
 datasources_basemodel:
     class: partial
     link: https://huggingface.co/DeepFloyd/IF-I-XL-v1.0#training
diff --git a/FLUX.1.yaml b/FLUX.1.yaml
@@ -26,7 +26,6 @@ org:
     notes: Image-generation model start-up.
 
 # availability:
-
 datasources_basemodel:
     class: closed
     link:
diff --git a/Falcon.yaml b/Falcon.yaml
@@ -64,9 +64,7 @@ hardware_architecture:
 
 preprint:
     class: open
-    link: 
-        - https://arxiv.org/abs/2306.01116
-        - https://arxiv.org/abs/2311.16867
+    link: ["https://arxiv.org/abs/2306.01116", "https://arxiv.org/abs/2311.16867"]
     notes: First preprint covers the creation and curation of RefinedWeb dataset, but not other aspects of the model. The second preprint provides more details about the model architecture, implementation, evaluation results, and limitations.
 
 paper:
diff --git a/GLM.yaml b/GLM.yaml
@@ -28,9 +28,7 @@ org:
 # availability:
 datasources_basemodel:
     class: closed
-    link: 
-        - http://doi.org/10.18653/v1/2022.acl-long.26
-        - https://arxiv.org/abs/2406.12793
+    link: ["http://doi.org/10.18653/v1/2022.acl-long.26", "https://arxiv.org/abs/2406.12793"]
     notes: Training data not centrally made available, but described in 2022 ACL paper, appears to be mostly public datasets. Preprint also mentions "Our pre-training corpus consists of multilingual (mostly English and Chinese) documents from a mixture of different sources, including webpages, Wikipedia, books, code, and research papers", but does not go into more detail.
 
 datasources_endmodel:
diff --git a/Poro.yaml b/Poro.yaml
@@ -10,7 +10,7 @@
 # Liesenfeld, A. and Dingemanse, M., 2024. Rethinking open source generative AI: open-washing and the EU AI Act. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1774-1787).
 
 system:
-    name: Poro-34B
+    name: Poro
     link: https://huggingface.co/LumiOpen/Poro-34B
     type: text
     performanceclass: full
@@ -26,11 +26,6 @@ org:
     notes: Silo AI was acquired by AMD in August 2024
 
 # availability:
-trainingcode:
-    class: open
-    link: https://github.com/LumiOpen/Megatron-DeepSpeed
-    notes: Custom fork of the Megatron-Deepspeed framework used for training Poro-34B.
-
 datasources_basemodel:
     class: partial
     link: https://arxiv.org/html/2404.01856v1
@@ -51,6 +46,11 @@ weights_endmodel:
     link: https://huggingface.co/LumiOpen/Poro-34B
     notes: Final model weights released under Apache 2.0 license.
 
+trainingcode:
+    class: open
+    link: https://github.com/LumiOpen/Megatron-DeepSpeed
+    notes: Custom fork of the Megatron-Deepspeed framework used for training Poro-34B.
+
 # documentation:
 code:
     class: open
diff --git a/Qwen.yaml b/Qwen.yaml
@@ -10,7 +10,7 @@
 # Liesenfeld, A. and Dingemanse, M., 2024. Rethinking open source generative AI: open-washing and the EU AI Act. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1774-1787).
 
 system:
-    name: Qwen3-235B-A22B
+    name: Qwen
     link: https://huggingface.co/Qwen/Qwen3-235B-A22B
     type: text 
     performanceclass: latest
diff --git a/SDXL-Lightning.yaml b/SDXL-Lightning.yaml
@@ -26,11 +26,6 @@ org:
     notes: Chinese technology company that owns Tiktok
 
 # availability:
-trainingcode:
-    class: closed
-    link: 
-    notes: The training code for SDXL-Lightning has not been publicly released.
-
 datasources_basemodel:
     class: partial
     link: https://arxiv.org/pdf/2307.01952
@@ -51,6 +46,10 @@ weights_endmodel:
     link: https://huggingface.co/ByteDance/SDXL-Lightning
     notes: Available through HuggingFace in the form of SafeTensors
 
+trainingcode:
+    class: closed
+    link: 
+    notes: The training code for SDXL-Lightning has not been publicly released.
 
 # documentation:
 code:
diff --git a/Teuken.yaml b/Teuken.yaml
@@ -26,11 +26,6 @@ org:
     notes: Project aiming to develop LLMs in Germany.
 
 # availability:
-trainingcode:
-    class: closed
-    link:
-    notes:
-
 datasources_basemodel:
     class: partial
     link: https://arxiv.org/pdf/2410.08800
@@ -51,6 +46,11 @@ weights_endmodel:
     link: https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4
     notes: Available via Huggingface repository.
 
+trainingcode:
+    class: closed
+    link:
+    notes:
+
 # documentation:
 code:
     class: closed
@@ -64,10 +64,7 @@ hardware_architecture:
 
 preprint:
     class: open
-    link: 
-        - https://arxiv.org/abs/2410.03730
-        - https://arxiv.org/abs/2410.08928
-        - https://arxiv.org/abs/2410.08800
+    link: ["https://arxiv.org/abs/2410.03730", "https://arxiv.org/abs/2410.08928", "https://arxiv.org/abs/2410.08800"]
     notes: Three corresponding preprints, detailing the models, data, and evaluation.
 
 paper:
diff --git a/Whisper.yaml b/Whisper.yaml
@@ -26,11 +26,6 @@ org:
     notes: American AI research organisation founded in 2015, widely known for ChatGPT.
 
 # availability:
-trainingcode:
-    class: partial
-    link: https://github.com/openai/whisper
-    notes: Inference code is released, but training code is not disclosed.
-
 datasources_basemodel:
     class: closed
     link:
@@ -51,6 +46,11 @@ weights_endmodel:
     link: https://huggingface.co/openai/whisper-large-v3-turbo/tree/main
     notes: 
 
+trainingcode:
+    class: partial
+    link: https://github.com/openai/whisper
+    notes: Inference code is released, but training code is not disclosed.
+
 # documentation:
 code:
     class: open
diff --git a/YuLan.yaml b/YuLan.yaml
@@ -26,11 +26,6 @@ org:
     notes: Gaoling School of Artificial Intelligence, a Chinese university organization.
 
 # availability:
-trainingcode:
-    class: open
-    link: https://github.com/RUC-GSAI/YuLan-Mini/
-    notes: Training code available on GitHub.
-
 datasources_basemodel:
     class: open
     link: ["https://huggingface.co/datasets/yulan-team/YuLan-Mini-Datasets", "https://arxiv.org/pdf/2412.17743"]
@@ -51,6 +46,11 @@ weights_endmodel:
     link: https://huggingface.co/yulan-team/YuLan-Mini-Instruct
     notes: Weights available through HuggingFace.
 
+trainingcode:
+    class: open
+    link: https://github.com/RUC-GSAI/YuLan-Mini/
+    notes: Training code available on GitHub.
+
 # documentation:
 code:
     class: open
diff --git a/Zephyr.yaml b/Zephyr.yaml
@@ -96,5 +96,5 @@ api:
 
 licenses:
     class: partial
-    link: 
+
     notes: "weights under MIT, datasets mixed"
diff --git a/_parameters-descriptions.yml b/_parameters-descriptions.yml
@@ -1,11 +1,11 @@
 datasources_basemodel:
-  en: Are datasources for training the base model comprehensively documented and freely made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
+  en: Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
 datasources_endmodel:
-  en: Are datasources for training the model that the enduser interacts with comprehensively documented made available? 
+  en: Are datasources for training the model that the end user interacts with comprehensively documented and made available? 
 weights_basemodel:
   en: Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
 weights_endmodel:
-  en: Are the weights of the model that the enduser interacts with made freely available?
+  en: Are the weights of the model that the end user interacts with made freely available?
 trainingcode:
   en: Is the source code of dataset processing, model training and tuning comprehensively made available?
 code:
@@ -17,7 +17,7 @@ preprint:
 paper:
   en: Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?
 modelcard:
-  en: Is a model card in standardized format available that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation are available?
+  en: Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?
 datasheet:
   en: Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?
 package:
diff --git a/_parameters.yml b/_parameters.yml
@@ -1,9 +1,6 @@
 - name: Availability
   ref: availability
   params: 
-    - ref: trainingcode
-      name: Training Code
-      types: ['text', 'image','code','video','audio']
     - ref: datasources_basemodel
       name: Base Model Data
       types: ['text', 'image','code','video','audio']
@@ -16,18 +13,9 @@
     - ref: weights_endmodel
       name: End User Model Weights
       types: ['text', 'image','code','video','audio']
-    - ref: datasources
-      name: Model Data
-      types: []
-    - ref: weights
-      name: Model Weights 
-      types: []
-    - ref: watermarking
-      name: Watermarking
-      types: []
-    - ref: prompt_moderation
-      name: Prompt Moderation
-      types: []
+    - ref: trainingcode
+      name: Training Code
+      types: ['text', 'image','code','video','audio']
 - name: Documentation
   ref: documentation
   params: 
diff --git a/a_submission_template.yaml b/a_submission_template.yaml
@@ -26,11 +26,6 @@ org:
     notes:
 
 # availability:
-trainingcode:
-    class: closed
-    link:
-    notes:
-
 datasources_basemodel:
     class: closed
     link:
@@ -51,6 +46,11 @@ weights_endmodel:
     link:
     notes:
 
+trainingcode:
+    class: closed
+    link:
+    notes:
+
 # documentation:
 code:
     class: closed
diff --git a/command-r.yaml b/command-r.yaml
@@ -26,11 +26,6 @@ org:
     notes: Company developing an enterprise AI platform.
 
 # availability:
-trainingcode:
-    class: closed
-    link:
-    notes: No codebase available to study or adjust model architecture, training, or inner workings.
-
 datasources_basemodel:
     class: closed
     link: https://docs.cohere.com/docs/data-statement
@@ -51,6 +46,11 @@ weights_endmodel:
     link: https://huggingface.co/CohereForAI/c4ai-command-r-v01/tree/main
     notes: Fine-tuned model weights made available for download
 
+trainingcode:
+    class: closed
+    link:
+    notes: No codebase available to study or adjust model architecture, training, or inner workings.
+
 # documentation:
 code:
     class: closed
diff --git a/mistral-nemo.yaml b/mistral-nemo.yaml
@@ -74,9 +74,7 @@ paper:
 
 modelcard:
     class: partial
-    link:
-        - https://huggingface.co/mistralai/Mistral-Nemo-Base-2407
-        - https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
+    link: ["https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", "https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407"]
     notes: Model cards available for both the base and end models, although they are both severely limited.
 
 datasheet:
diff --git a/openjourney.yaml b/openjourney.yaml
@@ -25,6 +25,7 @@ org:
     link: https://prompthero.com/
     notes: Prompt engineering site.
 
+# availability:
 datasources_basemodel:
     class: partial
     link: https://arxiv.org/abs/2210.08402
diff --git a/paper.md b/paper.md
@@ -0,0 +1,42 @@
+---
+title: Language Technology Assessment main database - OSAI Index
+tags:
+  - open-source
+  - generative AI
+  - catalogue
+  - index
+authors:
+  - name: Andreas Liesenfeld
+    orcid: 0000-0001-6076-4406
+    affiliation: 'Centre of Language and Speech Technology, Radboud University'
+  - name: Mark Dingemanse
+    orcid: 0000-0002-3290-5723
+    affiliation: 'Centre of Language and Speech Technology, Radboud University'
+  - name: Nityaa Kalra
+    orcid: 0009-0005-0958-553X
+    affiliation: 'Centre of Language and Speech Technology, Radboud University'
+  - name: Dick Blankvoort
+    orcid: 0009-0003-0766-4678
+    affiliation: 'Centre of Language and Speech Technology, Radboud University'
+
+date: 7 June 2025
+---
+
+# Summary
+
+The European Open Source AI index is an EU-based community-driven public resource on open-source generative AI systems, created for the purposes of cataloguing and scrutinizing systems which claim to be open or open-source. This upload catalogues the index at a specific point in time (2025-05-12).
+
+The index is hosted at the Centre of Language and Speech Technology at Radboud University at [osai-index.eu](osai-index.eu), and is maintained by a small team of academics and community members.
+
+# Statement of need
+Recent AI models marketed as 'open-source' have been shown to be engaged in a practice known as [open-washing](https://doi.org/10.2139/ssrn.4543807), where minimal information about a model is released (e.g. only its model weights) before claiming open-source status. This practice has most notably been seen in Meta's [Llama](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) models, where open-source status is claimed based on gated access to model weights and the release of a non-peer-reviewed preprint. The practice of open-source washing seeks to dilute genuine efforts towards developing truly open AI models, such as those exemplified by [BLOOMZ](https://arxiv.org/abs/2211.01786).
+
+To elucidate and combat this practice, the European Open-Source AI Index evaluates 'open' models along 14 different dimensions of openness, seeking to establish to what degree these models match up to the ideal of open-source. The index is based largely off the papers [Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators](https://dl.acm.org/doi/abs/10.1145/3571884.3604316) and [Rethinking open source generative AI: open-washing and the EU AI Act](https://dl.acm.org/doi/abs/10.1145/3630106.3659005).
+
+# Key References
+- Liesenfeld, A., & Dingemanse, M. (2024). Rethinking open source generative AI: open-washing and the EU AI Act. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24). doi: [10.1145/3630106.3659005](https://dl.acm.org/doi/abs/10.1145/3630106.3659005)
+- Liesenfeld, A., Lopez, A., & Dingemanse, M. (2023). Opening up ChatGPT: tracking openness, transparency, and accountability in instruction-tuned text generators. CUI ’23: Proceedings of the 5th International Conference on Conversational User Interfaces. doi: [10.1145/3571884.3604316](https://dl.acm.org/doi/abs/10.1145/3571884.3604316)
+- Solaiman, I. (2023). The Gradient of Generative AI Release: Methods and Considerations. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 111–122. doi: [10.1145/3593013.3593981](https://dl.acm.org/doi/abs/10.1145/3593013.3593981)
+
+# Acknowledgements
+The European Open Source AI Index is supported by the Centre for Language Studies and the Dutch Research Council.
diff --git a/readme.md b/readme.md
diff --git a/stable-diffusion.yaml b/stable-diffusion.yaml
diff --git a/viking.yaml b/viking.yaml