Facebook
Twitterashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterLysandreJik/push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSraghvi/subset-0-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterdistilabel-internal-testing/testing-distilabel-push-to-hub-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.
Understand what is included in this dataset This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish Multi**Source**English (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences
Decide What Kind Of Analysis You Want To Do Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance
Run Appropriate Algorithms On The Data Provided In The Dataset Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly
- Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.
- Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.
- Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: lid_msaea_test.csv...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject.
With OpenBookQA, we hope to push the boundaries of what current QA models can do and advance the state-of-the-art in this field. In addition to providing a challenging benchmark for existing models, we hope that this dataset will encourage new model architectures that can better handle complex questions and reasoning
- Questions that require multi-step reasoning,
- Use of additional common and commonsense knowledge,
- Rich text comprehension
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: main_test.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |
File: main_train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |
File: additional_train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |
File: additional_test.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |
File: additional_validation.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------------------------| | question_stem | The column 'question_stem' contains the stem of the question. (String) | | choices | The column 'choices' contains a list of answers to choose from. (List) | | answerKey | The column 'answerKey' contains the index of the correct answer in the choices list. (Integer) |
**File: ...
Facebook
TwitterIpccbench Mini
This dataset was generated using YourBench (v0.3.1), an open-source framework for generating domain-specific benchmarks from document collections.
Pipeline Steps
ingestion: Read raw source documents, convert them to normalized markdown and save for downstream steps upload_ingest_to_hub: Package and push ingested markdown dataset to the Hugging Face Hub or save locally with standardized fields summarization: Perform hierarchical summarization: chunk-level… See the full description on the dataset page: https://huggingface.co/datasets/patrickfleith/IPCCBench-mini.
Facebook
TwitterRAFT submissions for my_raft
Submitting to the leaderboard
To make a submission to the leaderboard, there are three main steps:
Generate predictions on the unlabeled test set of each task Validate the predictions are compatible with the evaluation framework Push the predictions to the Hub!
See the instructions below for more details.
Rules
To prevent overfitting to the public leaderboard, we only evaluate one submission per week. You can push predictions to… See the full description on the dataset page: https://huggingface.co/datasets/moshew/my_raft.
Facebook
TwitterRAFT submissions for RaftSub
Submitting to the leaderboard
To make a submission to the leaderboard, there are three main steps:
Generate predictions on the unlabeled test set of each task Validate the predictions are compatible with the evaluation framework Push the predictions to the Hub!
See the instructions below for more details.
Rules
To prevent overfitting to the public leaderboard, we only evaluate one submission per week. You can push predictions to… See the full description on the dataset page: https://huggingface.co/datasets/HLaci/RaftSub.
Facebook
Twitterimport datasets ds = load_dataset("Hoshik/Kimono") ds = ds.rename_column("Url", "image") # renames the "url" column to "image" (feel free to skip this step) ds = ds.cast_column("image", datasets.Image()) # casts the "image" column from Value("string") to Image() ds.push_to_hub("Hoshik/Kimono") # pushes the "fixed" dataset to the Hub as Parquet
Facebook
TwitterFor some reason I can't push the data to hub using push_to_hub() method. Kept getting identical_ok error or sometimes data didn't get uploaded even when it was successful. Well, anyways, Data can be found in kaggle
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community