11 datasets found
  1. h

    dev-push-to-hub

    • huggingface.co
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashim Mahara (2021). dev-push-to-hub [Dataset]. https://huggingface.co/datasets/ashim/dev-push-to-hub
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2021
    Authors
    Ashim Mahara
    Description

    ashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    push-to-hub

    • huggingface.co
    Updated Dec 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lysandre (2021). push-to-hub [Dataset]. https://huggingface.co/datasets/LysandreJik/push-to-hub
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2021
    Authors
    Lysandre
    Description

    LysandreJik/push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    subset-0-push-to-hub

    • huggingface.co
    Updated Sep 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sraghvi Anchaliya (2025). subset-0-push-to-hub [Dataset]. https://huggingface.co/datasets/Sraghvi/subset-0-push-to-hub
    Explore at:
    Dataset updated
    Sep 14, 2025
    Authors
    Sraghvi Anchaliya
    Description

    Sraghvi/subset-0-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    testing-distilabel-push-to-hub-2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    distilabel-internal-testing, testing-distilabel-push-to-hub-2 [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/testing-distilabel-push-to-hub-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    distilabel-internal-testing
    Description

    distilabel-internal-testing/testing-distilabel-push-to-hub-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. LinCE (Linguistic Code-switching Evaluation)

    • kaggle.com
    zip
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). LinCE (Linguistic Code-switching Evaluation) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-universal-language-with-the-lince-dataset
    Explore at:
    zip(11808965 bytes)Available download formats
    Dataset updated
    Dec 1, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LinCE (Linguistic Code-switching Evaluation)

    Data for training and evaluating NLP systems on code-switching tasks

    By Huggingface Hub [source]

    About this dataset

    Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.

    Understand what is included in this dataset This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish Multi**Source**English (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences

    Decide What Kind Of Analysis You Want To Do Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance

    Run Appropriate Algorithms On The Data Provided In The Dataset Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly

    Research Ideas

    • Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.
    • Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.
    • Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: lid_msaea_test.csv...

  6. OpenBookQA (Multi-step Reasoning)

    • kaggle.com
    zip
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). OpenBookQA (Multi-step Reasoning) [Dataset]. https://www.kaggle.com/datasets/thedevastator/openbookqa-a-new-dataset-for-advanced-question-a/discussion?sort=undefined
    Explore at:
    zip(826782 bytes)Available download formats
    Dataset updated
    Nov 21, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenBookQA: A New Dataset for Advanced Question-Answering

    Multi-step Reasoning, Commonsense Knowledge, and Rich Text Comprehension

    Source

    Huggingface Hub: link

    About this dataset

    OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject.

    With OpenBookQA, we hope to push the boundaries of what current QA models can do and advance the state-of-the-art in this field. In addition to providing a challenging benchmark for existing models, we hope that this dataset will encourage new model architectures that can better handle complex questions and reasoning

    How to use the dataset

    Research Ideas

    • Questions that require multi-step reasoning,
    • Use of additional common and commonsense knowledge,
    • Rich text comprehension

    Acknowledgements

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: main_test.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

    File: main_train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

    File: additional_train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

    File: additional_test.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

    File: additional_validation.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |

    **File: ...

  7. h

    IPCCBench-mini

    • huggingface.co
    Updated Aug 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith (2025). IPCCBench-mini [Dataset]. https://huggingface.co/datasets/patrickfleith/IPCCBench-mini
    Explore at:
    Dataset updated
    Aug 3, 2025
    Authors
    Patrick Fleith
    Description

    Ipccbench Mini

    This dataset was generated using YourBench (v0.3.1), an open-source framework for generating domain-specific benchmarks from document collections.

      Pipeline Steps
    

    ingestion: Read raw source documents, convert them to normalized markdown and save for downstream steps upload_ingest_to_hub: Package and push ingested markdown dataset to the Hugging Face Hub or save locally with standardized fields summarization: Perform hierarchical summarization: chunk-level… See the full description on the dataset page: https://huggingface.co/datasets/patrickfleith/IPCCBench-mini.

  8. h

    my_raft

    • huggingface.co
    Updated Aug 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moshe Wasserblat (2021). my_raft [Dataset]. https://huggingface.co/datasets/moshew/my_raft
    Explore at:
    Dataset updated
    Aug 29, 2021
    Authors
    Moshe Wasserblat
    Description

    RAFT submissions for my_raft

      Submitting to the leaderboard
    

    To make a submission to the leaderboard, there are three main steps:

    Generate predictions on the unlabeled test set of each task Validate the predictions are compatible with the evaluation framework Push the predictions to the Hub!

    See the instructions below for more details.

      Rules
    

    To prevent overfitting to the public leaderboard, we only evaluate one submission per week. You can push predictions to… See the full description on the dataset page: https://huggingface.co/datasets/moshew/my_raft.

  9. h

    RaftSub

    • huggingface.co
    Updated Aug 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    László Hanzel (2021). RaftSub [Dataset]. https://huggingface.co/datasets/HLaci/RaftSub
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 29, 2021
    Authors
    László Hanzel
    Description

    RAFT submissions for RaftSub

      Submitting to the leaderboard
    

    To make a submission to the leaderboard, there are three main steps:

    Generate predictions on the unlabeled test set of each task Validate the predictions are compatible with the evaluation framework Push the predictions to the Hub!

    See the instructions below for more details.

      Rules
    

    To prevent overfitting to the public leaderboard, we only evaluate one submission per week. You can push predictions to… See the full description on the dataset page: https://huggingface.co/datasets/HLaci/RaftSub.

  10. h

    Kimono

    • huggingface.co
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hung Vu (2024). Kimono [Dataset]. https://huggingface.co/datasets/Hoshik/Kimono
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2024
    Authors
    Hung Vu
    Description

    import datasets ds = load_dataset("Hoshik/Kimono") ds = ds.rename_column("Url", "image") # renames the "url" column to "image" (feel free to skip this step) ds = ds.cast_column("image", datasets.Image()) # casts the "image" column from Value("string") to Image() ds.push_to_hub("Hoshik/Kimono") # pushes the "fixed" dataset to the Hub as Parquet

  11. h

    Nepali-Text-Corpus

    • huggingface.co
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Regan Maharjan (2023). Nepali-Text-Corpus [Dataset]. https://huggingface.co/datasets/raygx/Nepali-Text-Corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2023
    Authors
    Regan Maharjan
    Area covered
    नेपाल
    Description

    For some reason I can't push the data to hub using push_to_hub() method. Kept getting identical_ok error or sometimes data didn't get uploaded even when it was successful. Well, anyways, Data can be found in kaggle

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ashim Mahara (2021). dev-push-to-hub [Dataset]. https://huggingface.co/datasets/ashim/dev-push-to-hub

dev-push-to-hub

ashim/dev-push-to-hub

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2021
Authors
Ashim Mahara
Description

ashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu