32 datasets found
  1. O

    WanJuan1.0(书生·万卷)

    • opendatalab.com
    zip
    Updated Aug 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanghai Artificial Intelligence Laboratory (2023). WanJuan1.0(书生·万卷) [Dataset]. https://opendatalab.com/OpenDataLab/WanJuan1_dot_0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 14, 2023
    Dataset provided by
    Corpus Data Alliance for Foudation Model
    Shanghai Artificial Intelligence Laboratory
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Intern · Wanjuan 1.0 is the first open source version of the Intern · Wanjuan multimodal corpus, which includes three parts: NLP dataset, muti-modal dataset, and video dataset, with a total data volume of over 2TB.

    At present, Intern · Wanjuan 1.0 has been applied to the training of InternMM and InternLM. By digesting high-quality corpus, the Intern Series model exhibits excellent performance in various generative tasks such as semantic understanding, knowledge Q&A, visual understanding, and visual Q&A.

    (Email contact: OpenDataLab@pjlab.org.cn)

  2. h

    WanJuan-Thai

    • huggingface.co
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenDataLab (2025). WanJuan-Thai [Dataset]. https://huggingface.co/datasets/opendatalab/WanJuan-Thai
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    OpenDataLab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    💡 Introduction

    WanJuan-Thai (万卷丝路-泰语) corpus, with a volume exceeding 155GB, comprises 7 major categories and 34 subcategories. It covers a wide range of local-specific content, including history, politics, culture, real estate, shopping, weather, dining, encyclopedias, and professional knowledge. The rich thematic classification not only facilitates researchers in retrieving data according to specific needs but also ensures that the corpus can adapt to diverse research… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/WanJuan-Thai.

  3. O

    WanJuan2.0 (万卷-CC)

    • opendatalab.com
    zip
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanghai Artificial Intelligence Laboratory (2024). WanJuan2.0 (万卷-CC) [Dataset]. https://opendatalab.com/OpenDataLab/WanJuanCC
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Corpus Data Alliance for Foudation Model
    Shanghai Artificial Intelligence Laboratory
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WanJuan2.0(万卷-CC) 是从CommonCrawl获取的一个 1T Tokens 的高质量英文网络文本数据集。结果显示,与各类开源英文CC语料在 Perspective API 不同维度的评估上,WanJuan-CC都表现出更高的安全性。此外,通过在4个验证集上的困惑度(PPL)和6下游任务的准确率,也展示了WanJuan-CC的实用性。WanJuan-CC在各种验证集上的PPL表现出竞争力,特别是在要求更高语言流畅性的tiny-storys等集上。通过与同类型数据集进行1B模型训练对比,使用验证数据集的困惑度(perplexity)和下游任务的准确率作为评估指标,实验证明,WanJuan-CC显著提升了英文文本补全和通用英文能力任务的性能。

  4. O

    DrawBench

    • opendatalab.com
    zip
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2022). DrawBench [Dataset]. https://opendatalab.com/OpenDataLab/DrawBench
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 1, 2022
    Dataset provided by
    Google Research
    Description

    We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment.

  5. SciTail

    • opendatalab.com
    • paperswithcode.com
    • +2more
    zip
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2022). SciTail [Dataset]. https://opendatalab.com/OpenDataLab/SciTail
    Explore at:
    zip(93006970 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    Description

    The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.

  6. Data from: SciQ

    • opendatalab.com
    • paperswithcode.com
    • +1more
    zip
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2022). SciQ [Dataset]. https://opendatalab.com/OpenDataLab/SciQ
    Explore at:
    zip(11984582 bytes)Available download formats
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    Description

    The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

  7. O

    SA-1B(segment anything)

    • opendatalab.com
    zip
    Updated May 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meta AI Research (2023). SA-1B(segment anything) [Dataset]. https://opendatalab.com/OpenDataLab/SA-1B
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 1, 2023
    Dataset provided by
    Meta AI Research
    License

    https://ai.facebook.com/datasets/segment-anything-downloads/https://ai.facebook.com/datasets/segment-anything-downloads/

    Description

    Segment Anything 1 Billion (SA-1B) is a dataset designed for training general-purpose object segmentation models from open world images.

    SA-1B consists of 11M diverse, high-resolution, privacy protecting images and 1.1B high-quality segmentation masks that were collected with our data engine. It is intended to be used for computer vision research for the purposes permitted under our Data License.

    The images are licensed from a large photo company. The 1.1B masks were produced using our data engine, all of which were generated fully automatically by the Segment Anything Model (SAM).

  8. TrackML challenge Accuracy phase dataset (Tracking Machine Learning...

    • opendatalab.com
    zip
    Updated Aug 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bosch Center for Artificial Intelligence (2018). TrackML challenge Accuracy phase dataset (Tracking Machine Learning Challenge) [Dataset]. https://opendatalab.com/OpenDataLab/TrackML_challenge_Accuracy_phase_etc
    Explore at:
    zip(237421573480 bytes)Available download formats
    Dataset updated
    Aug 6, 2018
    Dataset provided by
    IBMhttp://ibm.com/
    Geneva University
    University of Massachusetts
    University of California, Berkeley
    Norwegian University of Science and Technology
    University of Lisbon
    Goethe University Frankfurt
    California Institute of Technology
    Bosch Center for Artificial Intelligence
    Sorbonne University
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. The goal of the tracking machine learning challenge is to group the recorded measurements or hit for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track. The training dataset contains the recorded hit, their ground truth counterpart and their association to particles, and the initial parameters of those particles. The test dataset contains only the recorded hits. The dataset was used for the Accuracy Phase of the Tracking Machine Learning challenge on Kaggle. See more details in the home page url.

  9. O

    ego4d

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Minnesota (2023). ego4d [Dataset]. https://opendatalab.com/OpenDataLab/ego4d
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Facebook AI Research
    University of Minnesota
    University of Texas at Austin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    EGO4D is the world's largest egocentric (first person) video ML dataset and benchmark suite, with 3,600 hrs (and counting) of densely narrated video and a wide range of annotations across five new benchmark tasks. It covers hundreds of scenarios (household, outdoor, workplace, leisure, etc.) of daily life activity captured in-the-wild by 926 unique camera wearers from 74 worldwide locations and 9 different countries. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. The approach to data collection was designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant.

  10. AI2D (AI2 Diagrams)

    • opendatalab.com
    zip
    Updated Sep 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2022). AI2D (AI2 Diagrams) [Dataset]. https://opendatalab.com/OpenDataLab/AI2D
    Explore at:
    zip(1171019159 bytes)Available download formats
    Dataset updated
    Sep 20, 2022
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    AI2 Diagrams (AI2D) is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground truth syntactic parses, and more than 15000 corresponding multiple choice questions.

  11. O

    Enterprise-Driven Open Source Software

    • opendatalab.com
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Apr 21, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Athens University of Economics and Business (2020). Enterprise-Driven Open Source Software [Dataset]. https://opendatalab.com/OpenDataLab/Enterprise-Driven_Open_Source_etc
    Explore at:
    zip(7896769 bytes)Available download formats
    Dataset updated
    Apr 21, 2020
    Dataset provided by
    Athens University of Economics and Business
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

  12. QASPER

    • opendatalab.com
    • huggingface.co
    zip
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2022). QASPER [Dataset]. https://opendatalab.com/OpenDataLab/QASPER
    Explore at:
    zip(681163695 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.

  13. O

    MultiBench

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated May 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tohoku University (2023). MultiBench [Dataset]. https://opendatalab.com/OpenDataLab/MultiBench
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 2, 2023
    Dataset provided by
    Johns Hopkins University
    Carnegie Mellon University
    Tohoku University
    Stanford University
    University of Texas at Austin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MultiBench, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers evaluation methodology to study (1) generalization, (2) time and space complexity, and (3) modality robustness.

  14. ai2-arc

    • opendatalab.com
    • tensorflow.org
    • +1more
    zip
    Updated Jan 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2024). ai2-arc [Dataset]. https://opendatalab.com/OpenDataLab/ai2-arc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to the task, and an implementation of three neural baseline models for this dataset. We pose ARC as a challenge to the community.

  15. ARC-DA (ARC Direct Answer Questions)

    • opendatalab.com
    zip
    Updated Aug 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2022). ARC-DA (ARC Direct Answer Questions) [Dataset]. https://opendatalab.com/OpenDataLab/ARC-DA
    Explore at:
    zip(864509 bytes)Available download formats
    Dataset updated
    Aug 30, 2022
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ARC Direct Answer Questions (ARC-DA) dataset consists of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018. How the dataset was built These questions were derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018. The ARC Easy and ARC Challenge set questions in the original dataset were combined and then filtered/modified by the following process:

  16. TupleInf Open IE Dataset

    • opendatalab.com
    • huggingface.co
    zip
    Updated Apr 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Institute for Artificial Intelligence (2023). TupleInf Open IE Dataset [Dataset]. https://opendatalab.com/OpenDataLab/TupleInf_Open_IE_Dataset
    Explore at:
    zip(69430390 bytes)Available download formats
    Dataset updated
    Apr 30, 2023
    Dataset provided by
    艾伦人工智能研究院http://allenai.org/
    Description

    The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.

  17. O

    Paper2Fig100k

    • opendatalab.com
    zip
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    École de technologie supérieure, Paper2Fig100k [Dataset]. https://opendatalab.com/OpenDataLab/Paper2Fig100k
    Explore at:
    zip(42542162538 bytes)Available download formats
    Dataset provided by
    ServiceNow Research
    École de technologie supérieure
    Computer Vision Center
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts).The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure:figure_id: Figure identification based on the arXiv identifier:

  18. comma2k19

    • opendatalab.com
    zip
    Updated Sep 21, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Comma AI (2022). comma2k19 [Dataset]. https://opendatalab.com/OpenDataLab/comma2k19
    Explore at:
    zip(107257029342 bytes)Available download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    comma
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    comma.ai presents comma2k19, a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. comma2k19 is a fully reproducible and scalable dataset.

  19. O

    MagicData-RAMC Conversational Speech Dataset

    • opendatalab.com
    zip
    Updated Mar 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Chinese Academy of Sciences (2023). MagicData-RAMC Conversational Speech Dataset [Dataset]. https://opendatalab.com/OpenDataLab/MagicData-RAMC_Conversational_etc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 17, 2023
    Dataset provided by
    Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese
    Magic Data Technology Co., Ltd.
    University of Chinese Academy of Sciences
    Chinese Academy of Sciences
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in the dialogs are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided.

  20. O

    Negotiation Dialogues Dataset

    • opendatalab.com
    zip
    Updated Sep 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Facebook AI Research (2022). Negotiation Dialogues Dataset [Dataset]. https://opendatalab.com/OpenDataLab/Negotiation_Dialogues_Dataset
    Explore at:
    zip(9342406 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    Facebook AI Research
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset consists of 5808 dialogues, based on 2236 unique scenarios. Each dialogue is converted into two training examples in the dataset, showing the complete conversation from the perspective of each agent. The perspectives differ on their input goals, output choice, and in special tokens marking whether a statement was read or written.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shanghai Artificial Intelligence Laboratory (2023). WanJuan1.0(书生·万卷) [Dataset]. https://opendatalab.com/OpenDataLab/WanJuan1_dot_0

WanJuan1.0(书生·万卷)

OpenDataLab/WanJuan1_dot_0

Explore at:
zipAvailable download formats
Dataset updated
Aug 14, 2023
Dataset provided by
Corpus Data Alliance for Foudation Model
Shanghai Artificial Intelligence Laboratory
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Intern · Wanjuan 1.0 is the first open source version of the Intern · Wanjuan multimodal corpus, which includes three parts: NLP dataset, muti-modal dataset, and video dataset, with a total data volume of over 2TB.

At present, Intern · Wanjuan 1.0 has been applied to the training of InternMM and InternLM. By digesting high-quality corpus, the Intern Series model exhibits excellent performance in various generative tasks such as semantic understanding, knowledge Q&A, visual understanding, and visual Q&A.

(Email contact: OpenDataLab@pjlab.org.cn)

Search
Clear search
Close search
Google apps
Main menu