100+ datasets found

O
COCO 2017
opendatalab.com
huggingface.co
zip
Updated Sep 30, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2017). COCO 2017 [Dataset]. https://opendatalab.com/OpenDataLab/COCO_2017
Explore at:
zip(49105147630 bytes)Available download formats
Dataset updated
Sep 30, 2017
Dataset provided by
Microsoft
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints
h
OHR-Bench
huggingface.co
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataLab (2025). OHR-Bench [Dataset]. https://huggingface.co/datasets/opendatalab/OHR-Bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2025
Dataset authored and provided by
OpenDataLab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

[📜 arXiv] | [Dataset (🤗Hugging Face)] | [Dataset (OpenDataLab)]

This repository contains the official code of OHR-Bench, a benchmark designed to evaluate the cascading impact of OCR on RAG.

Overview

PDF, gt structured data and Q&A datasets: [🤗 Hugging Face] pdfs.zip, data/retrieval_base/gt. It includes 8500+ unstructured PDF pages from various domains, including Textbook, Law… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/OHR-Bench.
O
TAO AVA and HACS videos
opendatalab.com
zip
Updated Jan 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut national de recherche en informatique et en automatique (2023). TAO AVA and HACS videos [Dataset]. https://opendatalab.com/OpenDataLab/TAO_AVA_and_HACS_videos
Explore at:
zip(242338745891 bytes)Available download formats
Dataset updated
Jan 17, 2023
Dataset provided by
Institut national de recherche en informatique et en automatique
Carnegie Mellon University
Toyota Research Institute
Argo AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. We adopt a bottom-up approach for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks.
h
ProverQA
huggingface.co
opendatalab.com
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataLab (2025). ProverQA [Dataset]. https://huggingface.co/datasets/opendatalab/ProverQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2025
Dataset authored and provided by
OpenDataLab
Description
This dataset is for evaluating logical reasoning with large language models, as described in the paper Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation. Code: https://github.com/opendatalab/ProverGen
O
蜜巢·花粉1.0
opendatalab.com
zip
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Midu (2023). 蜜巢·花粉1.0 [Dataset]. https://opendatalab.com/OpenDataLab/MiChao
Explore at:
zipAvailable download formats
Dataset updated
Sep 8, 2023
Dataset provided by
Midu
Corpus Data Alliance for Foudation Model
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
蜜巢·花粉1.0开源数据集为文本数据集。数据集由互联网公开可访问网站2022年历史数据收集整理而成，数据总量7000余万条。数据集具备来源可靠，数据质量高，可持续稳定更新等特点。蜜巢·花粉数据集已被应用于多个大模型的训练，为媒体垂直领域提供基于材料的知识问答与内容生成、分析报告自动生成、文稿内容审校与润色改写等各类智能生成式服务。
h
WanJuanSiLu-Multimodal-5Languages
huggingface.co
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataLab (2025). WanJuanSiLu-Multimodal-5Languages [Dataset]. https://huggingface.co/datasets/opendatalab/WanJuanSiLu-Multimodal-5Languages
Explore at:
Dataset updated
May 23, 2025
Dataset authored and provided by
OpenDataLab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WanJuan·SiLu Multimodal Multilingual Corpus

🌏Dataset Introduction

The newly upgraded "Wanjuan·Silk Road Multimodal Corpus" brings the following three core improvements:

The number of languages has been significantly expanded: Based on the five open-source languages of "Wanjuan·Silk Road", namely Arabic, Russian, Korean, Vietnamese, and Thai, "Wanjuan·Silk Road Multimodal" has added three scarce corpus data of Serbian, Hungarian, and Czech, and uses the above eight key… See the full description on the dataset page: https://huggingface.co/datasets/opendatalab/WanJuanSiLu-Multimodal-5Languages.
O
WanJuan1.0（书生·万卷）
opendatalab.com
zip
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corpus Data Alliance for Foudation Model (2023). WanJuan1.0（书生·万卷） [Dataset]. https://opendatalab.com/OpenDataLab/WanJuan1_dot_0
Explore at:
zipAvailable download formats
Dataset updated
Aug 14, 2023
Dataset provided by
Shanghai Artificial Intelligence Laboratory
Corpus Data Alliance for Foudation Model
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Intern · Wanjuan 1.0 is the first open source version of the Intern · Wanjuan multimodal corpus, which includes three parts: NLP dataset, muti-modal dataset, and video dataset, with a total data volume of over 2TB.

At present, Intern · Wanjuan 1.0 has been applied to the training of InternMM and InternLM. By digesting high-quality corpus, the Intern Series model exhibits excellent performance in various generative tasks such as semantic understanding, knowledge Q&A, visual understanding, and visual Q&A.

(Email contact: OpenDataLab@pjlab.org.cn)
PandaLM-testset
opendatalab.com
zip
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peking University (2023). PandaLM-testset [Dataset]. https://opendatalab.com/OpenDataLab/PandaLM-testset
Explore at:
zipAvailable download formats
Dataset updated
Aug 1, 2023
Dataset provided by
Microsoft Research Asiahttps://www.msra.cn/
Westlake University
Peking University
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PandaLM aims to provide reproducible and automated comparisons between different large language models (LLMs). By giving PandaLM the same context, it can compare the responses of different LLMs and provide a reason for the decision, along with a reference answer. The target audience for PandaLM may be organizations that have confidential data and research labs with limited funds that seek reproducibility. These organizations may not want to disclose their data to third parties or may not be able to afford the high costs of secret data leakage using third-party APIs or hiring human annotators. With PandaLM, they can perform evaluations without compromising data security or incurring high costs, and obtain reproducible results. To demonstrate the reliability and consistency of our tool, we have created a diverse human-annotated test dataset of approximately 1,000 samples, where the contexts and the labels are all created by humans. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset.. More papers and features are coming soon.
O
massive
opendatalab.com
paperswithcode.com
+2more
zip
Updated Apr 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). massive [Dataset]. https://opendatalab.com/OpenDataLab/massive
Explore at:
zipAvailable download formats
Dataset updated
Apr 20, 2022
Dataset provided by
Amazon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MASSIVE 1.1 is a parallel dataset of > 1M utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.
O
WanJuan2.0 (万卷-CC)
opendatalab.com
zip
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanghai Artificial Intelligence Laboratory (2024). WanJuan2.0 (万卷-CC) [Dataset]. https://opendatalab.com/OpenDataLab/WanJuanCC
Explore at:
zipAvailable download formats
Dataset updated
Mar 6, 2024
Dataset provided by
Shanghai Artificial Intelligence Laboratory
Corpus Data Alliance for Foudation Model
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WanJuan2.0（万卷-CC）是从CommonCrawl获取的一个 1T Tokens 的高质量英文网络文本数据集。结果显示，与各类开源英文CC语料在 Perspective API 不同维度的评估上，WanJuan-CC都表现出更高的安全性。此外，通过在4个验证集上的困惑度（PPL）和6下游任务的准确率，也展示了WanJuan-CC的实用性。WanJuan-CC在各种验证集上的PPL表现出竞争力，特别是在要求更高语言流畅性的tiny-storys等集上。通过与同类型数据集进行1B模型训练对比，使用验证数据集的困惑度（perplexity）和下游任务的准确率作为评估指标，实验证明，WanJuan-CC显著提升了英文文本补全和通用英文能力任务的性能。
O
openai-humaneval
opendatalab.com
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2023). openai-humaneval [Dataset]. https://opendatalab.com/OpenDataLab/openai-humaneval
Explore at:
zipAvailable download formats
Dataset updated
Dec 16, 2023
Dataset provided by
OpenAIhttps://openai.com/
Zipline
Anthropic AI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.
O
CLICK-ID
opendatalab.com
zip
Updated Aug 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universitas Gadjah Mada (2020). CLICK-ID [Dataset]. https://opendatalab.com/OpenDataLab/CLICK-ID
Explore at:
zipAvailable download formats
Dataset updated
Aug 1, 2020
Dataset provided by
Universitas Gadjah Mada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CLICK-ID dataset is a collection of Indonesian news headlines that was collected from 12 local online news publishers; detikNews, Fimela, Kapanlagi, Kompas, Liputan6, Okezone, Posmetro-Medan, Republika, Sindonews, Tempo, Tribunnews, and Wowkeren. This dataset is comprised of mainly two parts; (i) 46,119 raw article data, and (ii) 15,000 clickbait annotated sample headlines. Annotation was conducted with 3 annotator examining each headline. Judgment were based only on the headline. The majority then is considered as the ground truth. In the annotated sample, our annotation shows 6,290 clickbait and 8,710 non-clickbait.
O
JGLUE
opendatalab.com
huggingface.co
zip
Updated Jan 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waseda University (2024). JGLUE [Dataset]. https://opendatalab.com/OpenDataLab/JGLUE
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2024
Dataset provided by
Waseda University
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese. JGLUE has been constructed from scratch without translation. We hope that JGLUE will facilitate NLU research in Japanese. JGLUE has been constructed by a joint research project of Yahoo Japan Corporation and Kawahara Lab at Waseda University.
O
databricks-dolly-15k-ja-reformat-v1
opendatalab.com
huggingface.co
zip
Updated Apr 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). databricks-dolly-15k-ja-reformat-v1 [Dataset]. https://opendatalab.com/OpenDataLab/databricks-dolly-15k-ja-reformat-v1
Explore at:
zipAvailable download formats
Dataset updated
Apr 13, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category. Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly. For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.
O
Data from: Diffusion Policy
opendatalab.com
Updated Mar 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Columbia University (2023). Diffusion Policy [Dataset]. https://opendatalab.com/OpenDataLab/Diffusion%20Policy
Explore at:
Dataset updated
Mar 7, 2023
Dataset provided by
Columbia University
Massachusetts Institute of Technology
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A new way of generating robot behavior by representing a robot's visuomotor policy as a conditional denoising diffusion process. We benchmark Diffusion Policy across 12 different tasks from 4 different robot manipulation benchmarks and find that it consistently outperforms existing state-of-the-art robot learning methods with an average improvement of 46.9%. Diffusion Policy learns the gradient of the action-distribution score function and iteratively optimizes with respect to this gradient field during inference via a series of stochastic Langevin dynamics steps. We find that the diffusion formulation yields powerful advantages when used for robot policies, including gracefully handling multimodal action distributions, being suitable for high-dimensional action spaces, and exhibiting impressive training stability. To fully unlock the potential of diffusion models for visuomotor policy learning on physical robots, this paper presents a set of key technical contributions including the incorporation of receding horizon control, visual conditioning, and the time-series diffusion transformer. We hope this work will help motivate a new generation of policy learning techniques that are able to leverage the powerful generative modeling capabilities of diffusion models.
GLUE
opendatalab.com
zip
Updated Nov 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York University (2018). GLUE [Dataset]. https://opendatalab.com/OpenDataLab/glue
Explore at:
zipAvailable download formats
Dataset updated
Nov 1, 2018
Dataset provided by
Paul G. Allen School of Computer Science and Engineering
New York University
DeepMind
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set. The format of the GLUE benchmark is model-agnostic, so any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. The benchmark tasks are selected so as to favor models that share information across tasks using parameter sharing or other transfer learning techniques. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems.
O
miam
opendatalab.com
zip
Updated Jan 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut polytechnique de Paris (2024). miam [Dataset]. https://opendatalab.com/OpenDataLab/miam
Explore at:
zipAvailable download formats
Dataset updated
Jan 15, 2024
Dataset provided by
IBM GBS France
Institut polytechnique de Paris
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Multilingual dIalogAct benchMark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. Datasets are in English, French, German, Italian and Spanish. They cover a variety of domains including spontaneous speech, scripted scenarios, and joint task completion. All datasets contain dialogue act labels.
O
sentiment140
opendatalab.com
tensorflow.org
+2more
zip
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University (2023). sentiment140 [Dataset]. https://opendatalab.com/OpenDataLab/sentiment140
Explore at:
zipAvailable download formats
Dataset updated
Dec 19, 2023
Dataset provided by
Stanford University
Description
Sentiment140 consists of Twitter messages with emoticons, which are used as noisy labels for sentiment classification. For more detailed information please refer to the paper.
O
xcsr
opendatalab.com
zip
Updated Jan 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Southern California (2021). xcsr [Dataset]. https://opendatalab.com/OpenDataLab/xcsr
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2021
Dataset provided by
University of Southern California
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
To evaluate multi-lingual language models (ML-LMs) for commonsense reasoning in a cross-lingual zero-shot transfer setting (X-CSR), i.e., training in English and test in other languages, we create two benchmark datasets, namely X-CSQA and X-CODAH. Specifically, we automatically translate the original CSQA and CODAH datasets, which only have English versions, to 15 other languages, forming development and test sets for studying X-CSR. As our goal is to evaluate different ML-LMs in a unified evaluation protocol for X-CSR, we argue that such translated examples, although might contain noise, can serve as a starting benchmark for us to obtain meaningful analysis, before more human-translated datasets will be available in the future.
O
RICH
opendatalab.com
paperswithcode.com
zip
Updated Apr 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Middlebury College (2022). RICH [Dataset]. https://opendatalab.com/OpenDataLab/RICH
Explore at:
zipAvailable download formats
Dataset updated
Apr 2, 2022
Dataset provided by
Max Planck Institute for Intelligent Systems
Middlebury College
Description
Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for “Real scenes, Interaction, Contact and Humans.” RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art.

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft (2017). COCO 2017 [Dataset]. https://opendatalab.com/OpenDataLab/COCO_2017

COCO 2017

OpenDataLab/COCO_2017

Explore at:

zip(49105147630 bytes)Available download formats

Dataset updated

Sep 30, 2017

Dataset provided by

Microsoft

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints

Clear search

Close search

Google apps

Main menu

COCO 2017

OHR-Bench

TAO AVA and HACS videos

ProverQA

蜜巢·花粉1.0

WanJuanSiLu-Multimodal-5Languages

WanJuan1.0（书生·万卷）

PandaLM-testset

massive

WanJuan2.0 (万卷-CC)

openai-humaneval

CLICK-ID

JGLUE

databricks-dolly-15k-ja-reformat-v1

Data from: Diffusion Policy

GLUE

miam

sentiment140

xcsr

RICH

COCO 2017See More Versions

OpenDataLab/COCO_2017

COCO 2017