11 datasets found

h
dev-push-to-hub
huggingface.co
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashim Mahara (2021). dev-push-to-hub [Dataset]. https://huggingface.co/datasets/ashim/dev-push-to-hub
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2021
Authors
Ashim Mahara
Description
ashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
h
push-to-hub
huggingface.co
Updated Dec 25, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lysandre (2021). push-to-hub [Dataset]. https://huggingface.co/datasets/LysandreJik/push-to-hub
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2021
Authors
Lysandre
Description
LysandreJik/push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
h
subset-0-push-to-hub
huggingface.co
Updated Sep 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sraghvi Anchaliya (2025). subset-0-push-to-hub [Dataset]. https://huggingface.co/datasets/Sraghvi/subset-0-push-to-hub
Explore at:
Dataset updated
Sep 14, 2025
Authors
Sraghvi Anchaliya
Description
Sraghvi/subset-0-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
h
testing-distilabel-push-to-hub-2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing, testing-distilabel-push-to-hub-2 [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/testing-distilabel-push-to-hub-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
distilabel-internal-testing
Description
distilabel-internal-testing/testing-distilabel-push-to-hub-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
LinCE (Linguistic Code-switching Evaluation)
kaggle.com
zip
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). LinCE (Linguistic Code-switching Evaluation) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-universal-language-with-the-lince-dataset
Explore at:
zip(11808965 bytes)Available download formats
Dataset updated
Dec 1, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

By Huggingface Hub [source]

About this dataset

Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.

Understand what is included in this dataset This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish Multi**Source**English (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences

Decide What Kind Of Analysis You Want To Do Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance

Run Appropriate Algorithms On The Data Provided In The Dataset Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly

Research Ideas

Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.

Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.

Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: lid_msaea_test.csv...
OpenBookQA (Multi-step Reasoning)
kaggle.com
zip
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). OpenBookQA (Multi-step Reasoning) [Dataset]. https://www.kaggle.com/datasets/thedevastator/openbookqa-a-new-dataset-for-advanced-question-a/discussion?sort=undefined
Explore at:
zip(826782 bytes)Available download formats
Dataset updated
Nov 21, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
OpenBookQA: A New Dataset for Advanced Question-Answering

Multi-step Reasoning, Commonsense Knowledge, and Rich Text Comprehension

Source

Huggingface Hub: link

About this dataset

OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject.

With OpenBookQA, we hope to push the boundaries of what current QA models can do and advance the state-of-the-art in this field. In addition to providing a challenging benchmark for existing models, we hope that this dataset will encourage new model architectures that can better handle complex questions and reasoning

How to use the dataset

Research Ideas

Questions that require multi-step reasoning,

Use of additional common and commonsense knowledge,

Rich text comprehension

Acknowledgements

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: main_test.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

File: main_train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

File: additional_train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

File: additional_test.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

File: additional_validation.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

**File: ...
h
IPCCBench-mini
huggingface.co
Updated Aug 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith (2025). IPCCBench-mini [Dataset]. https://huggingface.co/datasets/patrickfleith/IPCCBench-mini
Explore at:
Dataset updated
Aug 3, 2025
Authors
Patrick Fleith
Description
Ipccbench Mini

This dataset was generated using YourBench (v0.3.1), an open-source framework for generating domain-specific benchmarks from document collections.

Pipeline Steps

ingestion: Read raw source documents, convert them to normalized markdown and save for downstream steps upload_ingest_to_hub: Package and push ingested markdown dataset to the Hugging Face Hub or save locally with standardized fields summarization: Perform hierarchical summarization: chunk-level… See the full description on the dataset page: https://huggingface.co/datasets/patrickfleith/IPCCBench-mini.
h
my_raft
huggingface.co
Updated Aug 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moshe Wasserblat (2021). my_raft [Dataset]. https://huggingface.co/datasets/moshew/my_raft
Explore at:
Dataset updated
Aug 29, 2021
Authors
Moshe Wasserblat
Description
RAFT submissions for my_raft

Submitting to the leaderboard

To make a submission to the leaderboard, there are three main steps:

Generate predictions on the unlabeled test set of each task Validate the predictions are compatible with the evaluation framework Push the predictions to the Hub!

See the instructions below for more details.

Rules

To prevent overfitting to the public leaderboard, we only evaluate one submission per week. You can push predictions to… See the full description on the dataset page: https://huggingface.co/datasets/moshew/my_raft.
h
RaftSub
huggingface.co
Updated Aug 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
László Hanzel (2021). RaftSub [Dataset]. https://huggingface.co/datasets/HLaci/RaftSub
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2021
Authors
László Hanzel
Description
RAFT submissions for RaftSub

Submitting to the leaderboard

To make a submission to the leaderboard, there are three main steps:

Generate predictions on the unlabeled test set of each task Validate the predictions are compatible with the evaluation framework Push the predictions to the Hub!

See the instructions below for more details.

Rules

To prevent overfitting to the public leaderboard, we only evaluate one submission per week. You can push predictions to… See the full description on the dataset page: https://huggingface.co/datasets/HLaci/RaftSub.
h
Kimono
huggingface.co
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hung Vu (2024). Kimono [Dataset]. https://huggingface.co/datasets/Hoshik/Kimono
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2024
Authors
Hung Vu
Description
import datasets ds = load_dataset("Hoshik/Kimono") ds = ds.rename_column("Url", "image") # renames the "url" column to "image" (feel free to skip this step) ds = ds.cast_column("image", datasets.Image()) # casts the "image" column from Value("string") to Image() ds.push_to_hub("Hoshik/Kimono") # pushes the "fixed" dataset to the Hub as Parquet
h
Nepali-Text-Corpus
huggingface.co
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Regan Maharjan (2023). Nepali-Text-Corpus [Dataset]. https://huggingface.co/datasets/raygx/Nepali-Text-Corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2023
Authors
Regan Maharjan
Area covered
नेपाल
Description
For some reason I can't push the data to hub using push_to_hub() method. Kept getting identical_ok error or sometimes data didn't get uploaded even when it was successful. Well, anyways, Data can be found in kaggle
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ashim Mahara (2021). dev-push-to-hub [Dataset]. https://huggingface.co/datasets/ashim/dev-push-to-hub

dev-push-to-hub

ashim/dev-push-to-hub

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 31, 2021

Authors

Ashim Mahara

Description

ashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

dev-push-to-hub

push-to-hub

subset-0-push-to-hub

testing-distilabel-push-to-hub-2

LinCE (Linguistic Code-switching Evaluation)

LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

OpenBookQA (Multi-step Reasoning)

OpenBookQA: A New Dataset for Advanced Question-Answering

Multi-step Reasoning, Commonsense Knowledge, and Rich Text Comprehension

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

IPCCBench-mini

my_raft

RaftSub

Kimono

Nepali-Text-Corpus

dev-push-to-hub

ashim/dev-push-to-hub