32 datasets found

O
WanJuan1.0（书生·万卷）
opendatalab.com
zip
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanghai Artificial Intelligence Laboratory (2023). WanJuan1.0（书生·万卷） [Dataset]. https://opendatalab.com/OpenDataLab/WanJuan1_dot_0
Explore at:
zipAvailable download formats
Dataset updated
Aug 14, 2023
Dataset provided by
Corpus Data Alliance for Foudation Model
Shanghai Artificial Intelligence Laboratory
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Intern · Wanjuan 1.0 is the first open source version of the Intern · Wanjuan multimodal corpus, which includes three parts: NLP dataset, muti-modal dataset, and video dataset, with a total data volume of over 2TB.

At present, Intern · Wanjuan 1.0 has been applied to the training of InternMM and InternLM. By digesting high-quality corpus, the Intern Series model exhibits excellent performance in various generative tasks such as semantic understanding, knowledge Q&A, visual understanding, and visual Q&A.

(Email contact: OpenDataLab@pjlab.org.cn)
h
WanJuan-Thai
huggingface.co
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataLab (2025). WanJuan-Thai [Dataset]. https://huggingface.co/datasets/opendatalab/WanJuan-Thai
Explore at:
Dataset updated
Feb 22, 2025
Dataset authored and provided by
OpenDataLab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
💡 Introduction

WanJuan-Thai (万卷丝路-泰语) corpus, with a volume exceeding 155GB, comprises 7 major categories and 34 subcategories. It covers a wide range of local-specific content, including history, politics, culture, real estate, shopping, weather, dining, encyclopedias, and professional knowledge. The rich thematic classification not only facilitates researchers in retrieving data according to specific needs but also ensures that the corpus can adapt to diverse research… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/WanJuan-Thai.
O
WanJuan2.0 (万卷-CC)
opendatalab.com
zip
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanghai Artificial Intelligence Laboratory (2024). WanJuan2.0 (万卷-CC) [Dataset]. https://opendatalab.com/OpenDataLab/WanJuanCC
Explore at:
zipAvailable download formats
Dataset updated
Mar 6, 2024
Dataset provided by
Corpus Data Alliance for Foudation Model
Shanghai Artificial Intelligence Laboratory
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WanJuan2.0（万卷-CC）是从CommonCrawl获取的一个 1T Tokens 的高质量英文网络文本数据集。结果显示，与各类开源英文CC语料在 Perspective API 不同维度的评估上，WanJuan-CC都表现出更高的安全性。此外，通过在4个验证集上的困惑度（PPL）和6下游任务的准确率，也展示了WanJuan-CC的实用性。WanJuan-CC在各种验证集上的PPL表现出竞争力，特别是在要求更高语言流畅性的tiny-storys等集上。通过与同类型数据集进行1B模型训练对比，使用验证数据集的困惑度（perplexity）和下游任务的准确率作为评估指标，实验证明，WanJuan-CC显著提升了英文文本补全和通用英文能力任务的性能。
O
DrawBench
opendatalab.com
zip
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google Research (2022). DrawBench [Dataset]. https://opendatalab.com/OpenDataLab/DrawBench
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2022
Dataset provided by
Google Research
Description
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment.
SciTail
opendatalab.com
paperswithcode.com
+2more
zip
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2022). SciTail [Dataset]. https://opendatalab.com/OpenDataLab/SciTail
Explore at:
zip(93006970 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
艾伦人工智能研究院http://allenai.org/
Description
The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.
Data from: SciQ
opendatalab.com
paperswithcode.com
+1more
zip
Updated Oct 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2022). SciQ [Dataset]. https://opendatalab.com/OpenDataLab/SciQ
Explore at:
zip(11984582 bytes)Available download formats
Dataset updated
Oct 6, 2022
Dataset provided by
艾伦人工智能研究院http://allenai.org/
Description
The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.
O
SA-1B(segment anything)
opendatalab.com
zip
Updated May 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meta AI Research (2023). SA-1B(segment anything) [Dataset]. https://opendatalab.com/OpenDataLab/SA-1B
Explore at:
zipAvailable download formats
Dataset updated
May 1, 2023
Dataset provided by
Meta AI Research
License
https://ai.facebook.com/datasets/segment-anything-downloads/https://ai.facebook.com/datasets/segment-anything-downloads/
Description
Segment Anything 1 Billion (SA-1B) is a dataset designed for training general-purpose object segmentation models from open world images.

SA-1B consists of 11M diverse, high-resolution, privacy protecting images and 1.1B high-quality segmentation masks that were collected with our data engine. It is intended to be used for computer vision research for the purposes permitted under our Data License.

The images are licensed from a large photo company. The 1.1B masks were produced using our data engine, all of which were generated fully automatically by the Segment Anything Model (SAM).
TrackML challenge Accuracy phase dataset (Tracking Machine Learning...
opendatalab.com
zip
Updated Aug 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bosch Center for Artificial Intelligence (2018). TrackML challenge Accuracy phase dataset (Tracking Machine Learning Challenge) [Dataset]. https://opendatalab.com/OpenDataLab/TrackML_challenge_Accuracy_phase_etc
Explore at:
zip(237421573480 bytes)Available download formats
Dataset updated
Aug 6, 2018
Dataset provided by
IBMhttp://ibm.com/
Geneva University
University of Massachusetts
University of California, Berkeley
Norwegian University of Science and Technology
University of Lisbon
Goethe University Frankfurt
California Institute of Technology
Bosch Center for Artificial Intelligence
Sorbonne University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. The goal of the tracking machine learning challenge is to group the recorded measurements or hit for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track. The training dataset contains the recorded hit, their ground truth counterpart and their association to particles, and the initial parameters of those particles. The test dataset contains only the recorded hits. The dataset was used for the Accuracy Phase of the Tracking Machine Learning challenge on Kaggle. See more details in the home page url.
O
ego4d
opendatalab.com
huggingface.co
zip
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Minnesota (2023). ego4d [Dataset]. https://opendatalab.com/OpenDataLab/ego4d
Explore at:
zipAvailable download formats
Dataset updated
Dec 1, 2023
Dataset provided by
Facebook AI Research
University of Minnesota
University of Texas at Austin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
EGO4D is the world's largest egocentric (first person) video ML dataset and benchmark suite, with 3,600 hrs (and counting) of densely narrated video and a wide range of annotations across five new benchmark tasks. It covers hundreds of scenarios (household, outdoor, workplace, leisure, etc.) of daily life activity captured in-the-wild by 926 unique camera wearers from 74 worldwide locations and 9 different countries. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. The approach to data collection was designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant.
AI2D (AI2 Diagrams)
opendatalab.com
zip
Updated Sep 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2022). AI2D (AI2 Diagrams) [Dataset]. https://opendatalab.com/OpenDataLab/AI2D
Explore at:
zip(1171019159 bytes)Available download formats
Dataset updated
Sep 20, 2022
Dataset provided by
艾伦人工智能研究院http://allenai.org/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
AI2 Diagrams (AI2D) is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground truth syntactic parses, and more than 15000 corresponding multiple choice questions.
O
Enterprise-Driven Open Source Software
opendatalab.com
data.niaid.nih.gov
+1more
zip
Updated Apr 21, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Athens University of Economics and Business (2020). Enterprise-Driven Open Source Software [Dataset]. https://opendatalab.com/OpenDataLab/Enterprise-Driven_Open_Source_etc
Explore at:
zip(7896769 bytes)Available download formats
Dataset updated
Apr 21, 2020
Dataset provided by
Athens University of Economics and Business
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
QASPER
opendatalab.com
huggingface.co
zip
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2022). QASPER [Dataset]. https://opendatalab.com/OpenDataLab/QASPER
Explore at:
zip(681163695 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
艾伦人工智能研究院http://allenai.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.
O
MultiBench
opendatalab.com
paperswithcode.com
zip
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tohoku University (2023). MultiBench [Dataset]. https://opendatalab.com/OpenDataLab/MultiBench
Explore at:
zipAvailable download formats
Dataset updated
May 2, 2023
Dataset provided by
Johns Hopkins University
Carnegie Mellon University
Tohoku University
Stanford University
University of Texas at Austin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MultiBench, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers evaluation methodology to study (1) generalization, (2) time and space complexity, and (3) modality robustness.
ai2-arc
opendatalab.com
tensorflow.org
+1more
zip
Updated Jan 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2024). ai2-arc [Dataset]. https://opendatalab.com/OpenDataLab/ai2-arc
Explore at:
zipAvailable download formats
Dataset updated
Jan 8, 2024
Dataset provided by
艾伦人工智能研究院http://allenai.org/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to the task, and an implementation of three neural baseline models for this dataset. We pose ARC as a challenge to the community.
ARC-DA (ARC Direct Answer Questions)
opendatalab.com
zip
Updated Aug 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2022). ARC-DA (ARC Direct Answer Questions) [Dataset]. https://opendatalab.com/OpenDataLab/ARC-DA
Explore at:
zip(864509 bytes)Available download formats
Dataset updated
Aug 30, 2022
Dataset provided by
艾伦人工智能研究院http://allenai.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ARC Direct Answer Questions (ARC-DA) dataset consists of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018. How the dataset was built These questions were derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018. The ARC Easy and ARC Challenge set questions in the original dataset were combined and then filtered/modified by the following process:
TupleInf Open IE Dataset
opendatalab.com
huggingface.co
zip
Updated Apr 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Institute for Artificial Intelligence (2023). TupleInf Open IE Dataset [Dataset]. https://opendatalab.com/OpenDataLab/TupleInf_Open_IE_Dataset
Explore at:
zip(69430390 bytes)Available download formats
Dataset updated
Apr 30, 2023
Dataset provided by
艾伦人工智能研究院http://allenai.org/
Description
The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.
O
Paper2Fig100k
opendatalab.com
zip
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
École de technologie supérieure, Paper2Fig100k [Dataset]. https://opendatalab.com/OpenDataLab/Paper2Fig100k
Explore at:
zip(42542162538 bytes)Available download formats
Dataset provided by
ServiceNow Research
École de technologie supérieure
Computer Vision Center
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts).The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure:figure_id: Figure identification based on the arXiv identifier:
comma2k19
opendatalab.com
zip
Updated Sep 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Comma AI (2022). comma2k19 [Dataset]. https://opendatalab.com/OpenDataLab/comma2k19
Explore at:
zip(107257029342 bytes)Available download formats
Dataset updated
Sep 21, 2022
Dataset provided by
comma
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
comma.ai presents comma2k19, a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. comma2k19 is a fully reproducible and scalable dataset.
O
MagicData-RAMC Conversational Speech Dataset
opendatalab.com
zip
Updated Mar 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Chinese Academy of Sciences (2023). MagicData-RAMC Conversational Speech Dataset [Dataset]. https://opendatalab.com/OpenDataLab/MagicData-RAMC_Conversational_etc
Explore at:
zipAvailable download formats
Dataset updated
Mar 17, 2023
Dataset provided by
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese
Magic Data Technology Co., Ltd.
University of Chinese Academy of Sciences
Chinese Academy of Sciences
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in the dialogs are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided.
O
Negotiation Dialogues Dataset
opendatalab.com
zip
Updated Sep 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Facebook AI Research (2022). Negotiation Dialogues Dataset [Dataset]. https://opendatalab.com/OpenDataLab/Negotiation_Dialogues_Dataset
Explore at:
zip(9342406 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
Facebook AI Research
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset consists of 5808 dialogues, based on 2236 unique scenarios. Each dialogue is converted into two training examples in the dataset, showing the complete conversation from the perspective of each agent. The perspectives differ on their input goals, output choice, and in special tokens marking whether a statement was read or written.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shanghai Artificial Intelligence Laboratory (2023). WanJuan1.0（书生·万卷） [Dataset]. https://opendatalab.com/OpenDataLab/WanJuan1_dot_0

WanJuan1.0（书生·万卷）

OpenDataLab/WanJuan1_dot_0

Explore at:

zipAvailable download formats

Dataset updated

Aug 14, 2023

Dataset provided by

Corpus Data Alliance for Foudation Model
Shanghai Artificial Intelligence Laboratory

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Intern · Wanjuan 1.0 is the first open source version of the Intern · Wanjuan multimodal corpus, which includes three parts: NLP dataset, muti-modal dataset, and video dataset, with a total data volume of over 2TB.

At present, Intern · Wanjuan 1.0 has been applied to the training of InternMM and InternLM. By digesting high-quality corpus, the Intern Series model exhibits excellent performance in various generative tasks such as semantic understanding, knowledge Q&A, visual understanding, and visual Q&A.

(Email contact: OpenDataLab@pjlab.org.cn)

Clear search

Close search

Google apps

Main menu

WanJuan1.0（书生·万卷）

WanJuan-Thai

WanJuan2.0 (万卷-CC)

DrawBench

SciTail

Data from: SciQ

SA-1B(segment anything)

TrackML challenge Accuracy phase dataset (Tracking Machine Learning...

ego4d

AI2D (AI2 Diagrams)

Enterprise-Driven Open Source Software

QASPER

MultiBench

ai2-arc

ARC-DA (ARC Direct Answer Questions)

TupleInf Open IE Dataset

Paper2Fig100k

comma2k19

MagicData-RAMC Conversational Speech Dataset

Negotiation Dialogues Dataset

WanJuan1.0（书生·万卷）

OpenDataLab/WanJuan1_dot_0