31 datasets found

h
kaggle-nlp-getting-start
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
hui
Description
Dataset Summary

Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

Columns

id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.
NLP with Disaster Tweets - cleaning data
kaggle.com
zip
Updated Sep 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitalii Mokin (2021). NLP with Disaster Tweets - cleaning data [Dataset]. https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data
Explore at:
zip(1053715 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Vitalii Mokin
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
Context

The data obtained by clearing the Getting Started Prediction Competition "Real or Not? NLP with Disaster Tweets" data is the result of a public notebook "NLP with Disaster Tweets - EDA and Cleaning data". In the future, I plan to improve cleaning and update the dataset

Content

id - a unique identifier for each tweet text - the text of the tweet location - the location the tweet was sent from (may be blank) keyword - a particular keyword from the tweet (may be blank) target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

Acknowledgements

Thanks to Kaggle team for this Competition "Real or Not? NLP with Disaster Tweets" and its datasets (this dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here. Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480).

Thanks to web-site Ambulance services drive, strive to keep you alive for your image, which is very similar to the image of the contest "Real or Not? NLP with Disaster Tweets" and which I used as the image of my dataset

Inspiration

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.
NLP in Practice competition dataset
kaggle.com
zip
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Baushenko (2023). NLP in Practice competition dataset [Dataset]. https://www.kaggle.com/datasets/e0xextazy/nlp-in-practice-competition-dataset
Explore at:
zip(36685070 bytes)Available download formats
Dataset updated
Jun 12, 2023
Authors
Mark Baushenko
Description
Dataset

This dataset was created by Mark Baushenko

Contents
Disaster Tweets Geodata
kaggle.com
zip
Updated Jan 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fred Navruzov (2020). Disaster Tweets Geodata [Dataset]. https://www.kaggle.com/datasets/frednavruzov/disaster-tweets-geodata
Explore at:
zip(74586 bytes)Available download formats
Dataset updated
Jan 3, 2020
Authors
Fred Navruzov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Tweet's geodata, extracted from pre-cleaned location field in Real or Not? NLP with Disaster Tweets competition data to make geospatial analysis easier

Content

Simple geodata, based on Real or Not? NLP with Disaster Tweets competition.
The data was extracted with geopy atop of ArcGIS geocoding service.
h
kaggle_mnli
huggingface.co
Updated Apr 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Huber (2022). kaggle_mnli [Dataset]. https://huggingface.co/datasets/chrishuber/kaggle_mnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 28, 2022
Authors
Chris Huber
Description
Dataset Card for [Kaggle MNLI]

Dataset Summary

[These are the datasets posted to Kaggle for an inference detection NLP competition. Moving them here to use with Pytorch.]

Supported Tasks and Leaderboards

Provides train and validation data for sentence pairs with inference labels. [https://www.kaggle.com/competitions/multinli-matched-open-evaluation/leaderboard] [https://www.kaggle.com/competitions/multinli-mismatched-open-evaluation/leaderboard]… See the full description on the dataset page: https://huggingface.co/datasets/chrishuber/kaggle_mnli.
NLP Disaster Tweets competition data
kaggle.com
zip
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ritin Nambiar (2022). NLP Disaster Tweets competition data [Dataset]. https://www.kaggle.com/datasets/ritinnambiar/nlp-disaster-tweets-competition-data
Explore at:
zip(607343 bytes)Available download formats
Dataset updated
Sep 30, 2022
Authors
Ritin Nambiar
Description
Dataset

This dataset was created by Ritin Nambiar

Contents
NLP Starter Test
kaggle.com
zip
Updated Jun 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
loryn808 (2018). NLP Starter Test [Dataset]. https://www.kaggle.com/loryn808/nlp-starter-test
Explore at:
zip(824521 bytes)Available download formats
Dataset updated
Jun 24, 2018
Authors
loryn808
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by loryn808

Released under CC0: Public Domain

Contents
Z
Navigating News Narratives: A Media Bias Analysis Dataset
data-staging.niaid.nih.gov
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raza, Shaina (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10037860
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
Vector Institute
Authors
Raza, Shaina
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media. Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII). Data Format: The format of data is:

ID: Numeric unique identifier. Text: Main content. Dimension: Categorical descriptor of the text. Biased_Words: List of words considered biased. Aspect: Specific topic within the text. Label: Bias True/False value Aggregate Label: Calculated through multiple weighted formulae Annotation Scheme: The annotation scheme is based on Active learning, which is Manual Labeling --> Semi-Supervised Learning --> Human Verifications (iterative process)

Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong). Words/Phrases Level Biases: Identify specific biased words/phrases. Subjective Bias (Aspect): Capture biases related to content aspects. List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news. We also utilize publicly available data from the following links. Our Attribution to others. MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336
Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detection Toxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge. Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification. Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu) Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtV Social biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/

Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage. If you use this dataset, please cite us. Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
word_mapping
kaggle.com
zip
Updated Apr 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathur (2019). word_mapping [Dataset]. https://www.kaggle.com/kshitij68/word-mapping
Explore at:
zip(2707 bytes)Available download formats
Dataset updated
Apr 6, 2019
Authors
Mathur
Description
Dataset

This dataset was created by Mathur

Contents
rucode_medium_data
kaggle.com
zip
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
andrew_tep (2022). rucode_medium_data [Dataset]. https://www.kaggle.com/datasets/andrewteplov/rucode-medium-data
Explore at:
zip(12644916 bytes)Available download formats
Dataset updated
Apr 21, 2022
Authors
andrew_tep
Description
Dataset

This dataset was created by andrew_tep

Contents
Natural Language Processing with Disaster tweets
kaggle.com
zip
Updated Oct 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
zip(621497 bytes)Available download formats
Dataset updated
Oct 1, 2022
Authors
prahasith naru
Description
This repo contains an approach I implemented for the Disaster Tweets competition on Kaggle. This particular challenge is perfect for data scientists looking to get started with Natural Language Processing, and Kaggle in general. You can access the Kaggle competition.
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
h
ukr-toxicity-dataset-translated-jigsaw
huggingface.co
Updated Feb 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ukrainian Texts Classification (2024). ukr-toxicity-dataset-translated-jigsaw [Dataset]. https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset-translated-jigsaw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2024
Dataset authored and provided by
Ukrainian Texts Classification
License
https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/
Description
Ukrainian Toxicity Dataset (translated)

Additionaly to the twitter filtered data, we provide translated English Jigsaw Toxicity Classification Dataset to Ukrainian.

Dataset formation:

English data source: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/ Working with data to get only two labels: a toxic and a non-toxic sentence. Translation into Ukrainian language using model: https://huggingface.co/Helsinki-NLP/opus-mt-en-uk

Labels: 0 -… See the full description on the dataset page: https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset-translated-jigsaw.
Value Labs NLP Contest
kaggle.com
zip
Updated Sep 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JeevaTS (2019). Value Labs NLP Contest [Dataset]. https://www.kaggle.com/datasets/jeevats/value-labs-nlp-contest
Explore at:
zip(3232405 bytes)Available download formats
Dataset updated
Sep 30, 2019
Authors
JeevaTS
Description
Dataset

This dataset was created by JeevaTS

Contents
jigsaw-curated-raw-datasets
kaggle.com
zip
Updated Nov 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julián Peller (dataista0) (2021). jigsaw-curated-raw-datasets [Dataset]. https://www.kaggle.com/datasets/julian3833/jigsaw-curated-raw-datasets
Explore at:
zip(432511589 bytes)Available download formats
Dataset updated
Nov 16, 2021
Authors
Julián Peller (dataista0)
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Minimally curated datasets from the previous Jigsaw competitions

See ☣️ Jigsaw - Explore Previous Competitions Datasets
NLP competition assignment
kaggle.com
zip
Updated May 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adel Sabboba (2024). NLP competition assignment [Dataset]. https://www.kaggle.com/datasets/adelsabboba/nlp-competition-assignment
Explore at:
zip(293217 bytes)Available download formats
Dataset updated
May 4, 2024
Authors
Adel Sabboba
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Adel Sabboba

Released under CC0: Public Domain

Contents
Disaster Tweets, geocoded locations
kaggle.com
zip
Updated Nov 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
herwinvw (2020). Disaster Tweets, geocoded locations [Dataset]. https://www.kaggle.com/herwinvw/disaster-tweets-geocoded-locations
Explore at:
zip(83085 bytes)Available download formats
Dataset updated
Nov 30, 2020
Authors
herwinvw
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Context

Trying to make use of the location feature in the "Real or Not? NLP with Disaster Tweets" competition. I tried to geocode the locations, hoping that at least the difference between locations that can be geocoded (e.g. Birmingham) vs those that cannot be (e.g. "your sisters bedroom") would be a good feature. Additionally, geocoding provides longitude and latitude features that may be helpful.

Content

The dataset captures whether a location could be geocoded (that is: it is a valid location in the world).

Acknowledgements

Geocoding is done with Nominatim

Inspiration

Can you make better tweet classifications with geocoded locations?
Google - FastSlow dataset in parquet & csv
kaggle.com
zip
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SebastianBarry55 (2023). Google - FastSlow dataset in parquet & csv [Dataset]. https://www.kaggle.com/datasets/sebastianbarry55/google-competition-dataset-in-parquet-and-csv
Explore at:
zip(722713540 bytes)Available download formats
Dataset updated
Sep 1, 2023
Authors
SebastianBarry55
Description
The dataset is organized into various folders in the directories, representing different configurations and features of NLP models:

config- This folder contains four subtypes of files: - features: Parquet files capturing various feature vectors. - ids: Parquet files containing unique identifiers for the configurations. - runtime: Parquet files detailing the runtime in different configurations. - .csv versions of the above files for easy accessibility.

'edge`- This folder contains parquet files representing the edge features of the NLP model graphs.

node/ - Nested within this folder are three sub-folders: - node_opcode: Parquet files capturing the operations at each node. - node_splits: Parquet files detailing how nodes are split in the graph. - node_feat: Parquet files containing node features.
Exploring transfer learning for NLP
kaggle.com
zip
Updated Jun 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yury Kashnitsky (2019). Exploring transfer learning for NLP [Dataset]. https://www.kaggle.com/kashnitsky/exploring-transfer-learning-for-nlp/tasks
Explore at:
zip(610568813 bytes)Available download formats
Dataset updated
Jun 12, 2019
Authors
Yury Kashnitsky
Description
Goal

This is a small project lead by Yury Kashnitsky within OpenDataScience and Amsterdam Data Science communities. We plan to explore transfer & semi-supervised learning techniques for NLP tasks, mainly for classification. The idea is to develop best practices for using such models as BERT & ULMFiT (maybe smth else as well) for production-grade usage. Possible outcomes of this collaboration: - primarily, shared experience within this group, and advance in our own projects - articles sharing our experience (ex. Medium) - shared models, ex. trained LM for ULMFiT in Dutch - small library, ex. to productionize ULMFiT models (if they turn out to work best)

Anybody is welcome to join and share findings via Kernels and Discussions.

Datasets

We are gathering several datasets in English, Russian and Dutch. Each of them addresses the general task - to utilize loads of unlabeled texts to improve classification of (scarce) labeled texts. So for each task we have the following files:

train.csv (small)

validation.csv (small)

unlabeled.csv (large)

test.csv (optionally, within competitions)

Current datasets are:

Amazon pet product reviews classification (English, 6 classes, 52k train, 17k valid, 17k test, 100k unlabeled), competition , see Kernels for baselines: logit-tfidf, ULMFiT & BERT

Amazon healthcare reviews (English) (6 classes, 7k train, 3k valid, 200k unlabeled )

Clickbait news detection (English, 3 classes, 25k train, 5.5k valid, 3.5k test, 80k unlabeled), competition, see Kernels for baselines: logit-tfidf, ULMFiT & BERT.

Dutch book reviews (Dutch, 2 classes, 14k train, 6k valid, 90k validation).

Acknowledgements

Thanks to Vladislav Lyalin for the clickbait news data (original competition by ipavlov) and to Benjamin van der Burgh for Dutch reviews data (source repository). Background image credit: Jeremy Howard, fast.ai Lesson 4
PoetryFoundation HF jnb666poems
kaggle.com
zip
Updated Dec 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Igor Lashkov (2024). PoetryFoundation HF jnb666poems [Dataset]. https://www.kaggle.com/datasets/igorlashkov/poetryfoundation-hf-jnb666poems
Explore at:
zip(64674599 bytes)Available download formats
Dataset updated
Dec 22, 2024
Authors
Igor Lashkov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains most poems available on poetryfoundation.org.

The dataset was created as part of the Unlock Global Communication with Gemma competition.

Refer to the notebook for a detailed explanation of data creation, training methodology and evaluation Kaggle notebook.

Facebook

Twitter

Click to copy link

Link copied

Cite

hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start

kaggle-nlp-getting-start

gdwangh/kaggle-nlp-getting-start

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

hui

Description

Dataset Summary

Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

Columns

id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

Clear search

Close search

Google apps

Main menu

kaggle-nlp-getting-start

NLP with Disaster Tweets - cleaning data

Context

Content

Acknowledgements

Inspiration

NLP in Practice competition dataset

Dataset

Contents

Disaster Tweets Geodata

Context

Content

kaggle_mnli

NLP Disaster Tweets competition data

Dataset

Contents

NLP Starter Test

Dataset

Contents

Navigating News Narratives: A Media Bias Analysis Dataset

word_mapping

Dataset

Contents

rucode_medium_data

Dataset

Contents

Natural Language Processing with Disaster tweets

CT-FAN-21 corpus: A dataset for Fake News Detection

ukr-toxicity-dataset-translated-jigsaw

Value Labs NLP Contest

Dataset

Contents

jigsaw-curated-raw-datasets

Minimally curated datasets from the previous Jigsaw competitions

See ☣️ Jigsaw - Explore Previous Competitions Datasets

NLP competition assignment

Dataset

Contents

Disaster Tweets, geocoded locations

Context

Content

Acknowledgements

Inspiration

Google - FastSlow dataset in parquet & csv

Exploring transfer learning for NLP

Goal

Datasets

Acknowledgements

PoetryFoundation HF jnb666poems

kaggle-nlp-getting-start

gdwangh/kaggle-nlp-getting-start