GitHub Blog

利用全新开放数据集加速研究人员和开发人员构建多语言AI

GitHub发布了GitHub多语言仓库数据集,这是一个覆盖超过4000万个仓库的仓库级元数据集。它使用三种分类器对README、issue和pull request进行语言分类,帮助研究人员和开发者发现非英语开发者内容。该数据集以CC0-1.0许可提供,旨在支持多语言AI开发,特别是针对代表性不足的欧洲语言。这遵循了GitHub对开放数据的承诺,并设计为透明的发现工具,而非真实基准。

状态已摘要
抓取快照1
AI 输出2
开放问题0

已验证摘要

英文摘要

Accelerating researchers and developers building multilingual AI with a new open dataset

GitHub has released the GitHub Multilingual Repositories Dataset, a repository-level metadata dataset covering over 40 million repositories. It provides language classifications for READMEs, issues, and pull requests using three classifiers, enabling researchers and developers to discover non-English developer content. The dataset is available under CC0-1.0 and aims to support multilingual AI development, particularly for underrepresented European languages. It follows GitHub's commitment to open data and is designed as a transparent discovery tool, not a ground-truth benchmark.

  • GitHub released a new open metadata dataset for multilingual repositories.
  • The dataset covers over 40 million repositories with language classifications.
  • It uses three classifiers (fastText, gcld3, lingua-py) with confidence scores.
  • Focuses on non-English content in READMEs, issues, and pull requests.
  • Available under CC0-1.0 license on GitHub.
  • Designed for studying multilingual developer collaboration and building AI tools.
  • Part of Microsoft's European Digital Commitments to promote open data.

中文摘要

利用全新开放数据集加速研究人员和开发人员构建多语言AI

GitHub发布了GitHub多语言仓库数据集,这是一个覆盖超过4000万个仓库的仓库级元数据集。它使用三种分类器对README、issue和pull request进行语言分类,帮助研究人员和开发者发现非英语开发者内容。该数据集以CC0-1.0许可提供,旨在支持多语言AI开发,特别是针对代表性不足的欧洲语言。这遵循了GitHub对开放数据的承诺,并设计为透明的发现工具,而非真实基准。

  • GitHub发布了新的多语言仓库元数据集。
  • 数据集覆盖超过4000万个仓库,提供语言分类。
  • 使用三种分类器(fastText、gcld3、lingua-py)及置信度得分。
  • 专注于README、issue和pull request中的非英语内容。
  • 以CC0-1.0许可证在GitHub上提供。
  • 用于研究多语言开发者协作和构建AI工具。
  • 是微软欧洲数字承诺的一部分,旨在推动开放数据。

open dataset / multilingual AI / GitHub / repository metadata / developer collaboration / language classification / CC0-1.0 / European languages / open source / AI evaluation / fastText / gcld3

完整文章

Kevin Xu · @khxu

June 15, 2026|

5 minutes

Share:Software may be written in programming languages, but human language is at the heart of developer collaboration. Developers explain how projects work in READMEs. They ask for help in issues. They review, debate, and improve code in pull requests. That collaboration often happens in English—but not always. As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever.Today, GitHub is publishing the GitHub Multilingual Repositories Dataset , a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. When building the dataset, we found that language distribution differs across READMEs, issues and pull requests: Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs. Portuguese tops the non-English README list with more than 3 million repositories.The dataset is now available on GitHub under CC0-1.0. It follows through on a commitment we made in 2025, as part of Microsoft’s European Digital Commitments, to make multilingual data more accessible, including to open source AI developers.What’s in the datasetThe GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Instead, it is a metadata dataset that helps developers and researchers find repositories where multilingual collaboration may be happening. The dataset covers over 80 million classification rows across more than 40 million repositories . For each public repository, we provide:Language classifications of the README, the most-commented issue, and the most-commented pull request, with the first 150 characters of each used as the input sample. We exclude texts under 20 characters.Classifications for each text source, from fastText , gcld3 , and lingua-py , each with a confidence score. The dataset only includes classifications with >0.5 confidence.Repository metadata: creation timestamp, disk usage, stars, forks, primary programming language, SPDX license, issue and pull request counts, and the snapshot date.We deliberately did not collapse the three classifiers into a single label. Different classifiers have different coverage and confidence calibration, especially for lower-resource languages. By exposing all three, we let you decide how strict you want to be. Want a high-precision Greek subset? Require all three classifiers to agree above some confidence threshold. Want broad recall for an exploratory study of Romance languages? One classifier may be enough.What you can build with itThe dataset is designed for the kind of work that’s hard to do with general web text:Discover repositories likely to contain developer documentation or collaboration in specific languages.Study how non-English developer communities use issues, pull requests, and READMEs.Build evaluation sets for AI coding tools, doc generators, or review assistants that need to behave well across languages.Encourage decision-makers to expand language coverage for new developer tools and AI features using data-backed arguments on the rich multilingual diversity of developers.Measure representation of European and other underrepresented languages in open source.Some caveatsLanguage identification is hard, especially in software repositories. Repository text is often short. It may include badges, templates, installation commands, code snippets, usernames, or mixed-language content. A 150-character sample may not represent the whole repository. Classifiers also vary in coverage and calibration, especially for lower-resource languages.That is why the dataset should not be treated as a ground-truth benchmark for language identification. Instead, it is designed as a transparent discovery tool. Users can inspect classifications, confidence scores, and sources, then choose the precision and recall tradeoffs that fit their own research or development workflow.The dataset also should not be used to infer sensitive attributes about repository owners, contributors, or communities. The signals are repository-level metadata, not person-level attributes.Why open multilingual data mattersToday, many European languages remain underrepresented in the online text used to build and evaluate AI systems. That creates a risk that AI tools work well for some developers, languages, and communities, while leaving others behind. Open data can help close that gap. We built this dataset because developer content is different from general web text. READMEs, issues, and pull requests contain the language of software collaboration: installation instructions, bug reports, feature requests, review comments, and community norms. That context can help build AI systems that better understand how developers actually work.By making multilingual developer-content signals easier to find and analyze, this dataset gives researchers, open source developers, and model builders another tool for studying language representation in software development. It can help identify gaps, support better evaluation, and inform more inclusive AI tools for developers across Europe and beyond. It also reflects a broader principle: Building AI for developers should include the communities, languages, and workflows developers actually use.What’s nextWe’ll be discussing the dataset, and the broader importance of open data for multilingual AI, at the Open Innovation Dialogue Hub in Strasbourg on June 16. The event is co-organized by the Microsoft Open Innovation Center, the Council of Europe, and GitHub, and will bring together policymakers, researchers, cultural institutions, and open innovation leaders to discuss AI, linguistic diversity, cultural heritage, and open data.Multilingual AI needs multilingual developer communities. We hope this dataset helps more people study, support, and build for them. By releasing it under CC0-1.0 on GitHub, we’re inviting researchers, open source maintainers, and model builders to use it, critique it, extend it, and build evaluation sets and tools on top of it.If you do something interesting with it, we’d love to hear about it .Written byStaff Software Engineer, CELARelated postsAI & ML

What are git worktrees, and why should I use them?Git worktrees have been around since 2015, but it wasn’t until recently they became popular. Learn what they are, how to use them, and why you might.AI & ML

GitHub Copilot CLI for Beginners: Overview of common slash commandsGitHub Copilot CLI for Beginners: Learn how to use slash commands to control your terminal AI agent.AI & ML

How we made GitHub Copilot CLI more selective about delegationBetter orchestration, fewer handoffs, faster progress, without a single new knob.We do newsletters, tooDiscover tips, technical guides, and best practices in our biweekly newsletter just for devs.Your email address

抓取快照

用于解析和审计的抓取证据。

200 · text/html; charset=UTF-8

2026/06/17 10:57

504d99051c57d99cb4e2d5fe1d50b909ae3bdbe7a62a30fc1bffb414c02e3a0a

AI 输出

带验证状态的结构化模型输出。

article.summarize

deepseek-v4-flash · 有效

{"tags":["open dataset","multilingual AI","GitHub","repository metadata","developer collaboration","language classification","CC0-1.0","European languages","open source","AI evaluation","fastText","gcld3"],"titleEn":"Accelerating researchers and developers building multilingual A...
article.classify

deepseek-v4-flash · 有效

{"relevant":true,"confidence":0.95,"primaryTopic":"ai-research","secondaryTopics":["ai-engineering","software-engineering"]}

质量问题与日报引用

开放或已解决的问题,以及文章出现在每日日报中的记录。