31 datasets found
  1. h

    kaggle-nlp-getting-start

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    hui
    Description

    Dataset Summary

    Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

    Columns

    id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

  2. NLP with Disaster Tweets - cleaning data

    • kaggle.com
    zip
    Updated Sep 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitalii Mokin (2021). NLP with Disaster Tweets - cleaning data [Dataset]. https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data
    Explore at:
    zip(1053715 bytes)Available download formats
    Dataset updated
    Sep 11, 2021
    Authors
    Vitalii Mokin
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    Context

    The data obtained by clearing the Getting Started Prediction Competition "Real or Not? NLP with Disaster Tweets" data is the result of a public notebook "NLP with Disaster Tweets - EDA and Cleaning data". In the future, I plan to improve cleaning and update the dataset

    Content

    id - a unique identifier for each tweet text - the text of the tweet location - the location the tweet was sent from (may be blank) keyword - a particular keyword from the tweet (may be blank) target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

    Acknowledgements

    Thanks to Kaggle team for this Competition "Real or Not? NLP with Disaster Tweets" and its datasets (this dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here. Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480).

    Thanks to web-site Ambulance services drive, strive to keep you alive for your image, which is very similar to the image of the contest "Real or Not? NLP with Disaster Tweets" and which I used as the image of my dataset

    Inspiration

    You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

  3. NLP in Practice competition dataset

    • kaggle.com
    zip
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Baushenko (2023). NLP in Practice competition dataset [Dataset]. https://www.kaggle.com/datasets/e0xextazy/nlp-in-practice-competition-dataset
    Explore at:
    zip(36685070 bytes)Available download formats
    Dataset updated
    Jun 12, 2023
    Authors
    Mark Baushenko
    Description

    Dataset

    This dataset was created by Mark Baushenko

    Contents

  4. Disaster Tweets Geodata

    • kaggle.com
    zip
    Updated Jan 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fred Navruzov (2020). Disaster Tweets Geodata [Dataset]. https://www.kaggle.com/datasets/frednavruzov/disaster-tweets-geodata
    Explore at:
    zip(74586 bytes)Available download formats
    Dataset updated
    Jan 3, 2020
    Authors
    Fred Navruzov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Tweet's geodata, extracted from pre-cleaned location field in Real or Not? NLP with Disaster Tweets competition data to make geospatial analysis easier

    Content

    Simple geodata, based on Real or Not? NLP with Disaster Tweets competition.
    The data was extracted with geopy atop of ArcGIS geocoding service.

  5. h

    kaggle_mnli

    • huggingface.co
    Updated Apr 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Huber (2022). kaggle_mnli [Dataset]. https://huggingface.co/datasets/chrishuber/kaggle_mnli
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2022
    Authors
    Chris Huber
    Description

    Dataset Card for [Kaggle MNLI]

      Dataset Summary
    

    [These are the datasets posted to Kaggle for an inference detection NLP competition. Moving them here to use with Pytorch.]

      Supported Tasks and Leaderboards
    

    Provides train and validation data for sentence pairs with inference labels. [https://www.kaggle.com/competitions/multinli-matched-open-evaluation/leaderboard] [https://www.kaggle.com/competitions/multinli-mismatched-open-evaluation/leaderboard]… See the full description on the dataset page: https://huggingface.co/datasets/chrishuber/kaggle_mnli.

  6. NLP Disaster Tweets competition data

    • kaggle.com
    zip
    Updated Sep 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritin Nambiar (2022). NLP Disaster Tweets competition data [Dataset]. https://www.kaggle.com/datasets/ritinnambiar/nlp-disaster-tweets-competition-data
    Explore at:
    zip(607343 bytes)Available download formats
    Dataset updated
    Sep 30, 2022
    Authors
    Ritin Nambiar
    Description

    Dataset

    This dataset was created by Ritin Nambiar

    Contents

  7. NLP Starter Test

    • kaggle.com
    zip
    Updated Jun 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    loryn808 (2018). NLP Starter Test [Dataset]. https://www.kaggle.com/loryn808/nlp-starter-test
    Explore at:
    zip(824521 bytes)Available download formats
    Dataset updated
    Jun 24, 2018
    Authors
    loryn808
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by loryn808

    Released under CC0: Public Domain

    Contents

  8. Z

    Navigating News Narratives: A Media Bias Analysis Dataset

    • data-staging.niaid.nih.gov
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raza, Shaina (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10037860
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Vector Institute
    Authors
    Raza, Shaina
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media. Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII). Data Format: The format of data is:

    ID: Numeric unique identifier. Text: Main content. Dimension: Categorical descriptor of the text. Biased_Words: List of words considered biased. Aspect: Specific topic within the text. Label: Bias True/False value Aggregate Label: Calculated through multiple weighted formulae Annotation Scheme: The annotation scheme is based on Active learning, which is Manual Labeling --> Semi-Supervised Learning --> Human Verifications (iterative process)

    Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong). Words/Phrases Level Biases: Identify specific biased words/phrases. Subjective Bias (Aspect): Capture biases related to content aspects. List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news. We also utilize publicly available data from the following links. Our Attribution to others. MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336
    Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detection Toxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge. Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification. Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu) Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtV Social biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/

    Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage. If you use this dataset, please cite us. Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

  9. word_mapping

    • kaggle.com
    zip
    Updated Apr 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathur (2019). word_mapping [Dataset]. https://www.kaggle.com/kshitij68/word-mapping
    Explore at:
    zip(2707 bytes)Available download formats
    Dataset updated
    Apr 6, 2019
    Authors
    Mathur
    Description

    Dataset

    This dataset was created by Mathur

    Contents

  10. rucode_medium_data

    • kaggle.com
    zip
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    andrew_tep (2022). rucode_medium_data [Dataset]. https://www.kaggle.com/datasets/andrewteplov/rucode-medium-data
    Explore at:
    zip(12644916 bytes)Available download formats
    Dataset updated
    Apr 21, 2022
    Authors
    andrew_tep
    Description

    Dataset

    This dataset was created by andrew_tep

    Contents

  11. Natural Language Processing with Disaster tweets

    • kaggle.com
    zip
    Updated Oct 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    zip(621497 bytes)Available download formats
    Dataset updated
    Oct 1, 2022
    Authors
    prahasith naru
    Description

    This repo contains an approach I implemented for the Disaster Tweets competition on Kaggle. This particular challenge is perfect for data scientists looking to get started with Natural Language Processing, and Kaggle in general. You can access the Kaggle competition.

  12. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  13. h

    ukr-toxicity-dataset-translated-jigsaw

    • huggingface.co
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ukrainian Texts Classification (2024). ukr-toxicity-dataset-translated-jigsaw [Dataset]. https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset-translated-jigsaw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2024
    Dataset authored and provided by
    Ukrainian Texts Classification
    License

    https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/

    Description

    Ukrainian Toxicity Dataset (translated)

    Additionaly to the twitter filtered data, we provide translated English Jigsaw Toxicity Classification Dataset to Ukrainian.

      Dataset formation:
    

    English data source: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/ Working with data to get only two labels: a toxic and a non-toxic sentence. Translation into Ukrainian language using model: https://huggingface.co/Helsinki-NLP/opus-mt-en-uk

    Labels: 0 -… See the full description on the dataset page: https://huggingface.co/datasets/ukr-detect/ukr-toxicity-dataset-translated-jigsaw.

  14. Value Labs NLP Contest

    • kaggle.com
    zip
    Updated Sep 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JeevaTS (2019). Value Labs NLP Contest [Dataset]. https://www.kaggle.com/datasets/jeevats/value-labs-nlp-contest
    Explore at:
    zip(3232405 bytes)Available download formats
    Dataset updated
    Sep 30, 2019
    Authors
    JeevaTS
    Description

    Dataset

    This dataset was created by JeevaTS

    Contents

  15. jigsaw-curated-raw-datasets

    • kaggle.com
    zip
    Updated Nov 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julián Peller (dataista0) (2021). jigsaw-curated-raw-datasets [Dataset]. https://www.kaggle.com/datasets/julian3833/jigsaw-curated-raw-datasets
    Explore at:
    zip(432511589 bytes)Available download formats
    Dataset updated
    Nov 16, 2021
    Authors
    Julián Peller (dataista0)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Minimally curated datasets from the previous Jigsaw competitions

    See ☣️ Jigsaw - Explore Previous Competitions Datasets

  16. NLP competition assignment

    • kaggle.com
    zip
    Updated May 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adel Sabboba (2024). NLP competition assignment [Dataset]. https://www.kaggle.com/datasets/adelsabboba/nlp-competition-assignment
    Explore at:
    zip(293217 bytes)Available download formats
    Dataset updated
    May 4, 2024
    Authors
    Adel Sabboba
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Adel Sabboba

    Released under CC0: Public Domain

    Contents

  17. Disaster Tweets, geocoded locations

    • kaggle.com
    zip
    Updated Nov 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    herwinvw (2020). Disaster Tweets, geocoded locations [Dataset]. https://www.kaggle.com/herwinvw/disaster-tweets-geocoded-locations
    Explore at:
    zip(83085 bytes)Available download formats
    Dataset updated
    Nov 30, 2020
    Authors
    herwinvw
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    Trying to make use of the location feature in the "Real or Not? NLP with Disaster Tweets" competition. I tried to geocode the locations, hoping that at least the difference between locations that can be geocoded (e.g. Birmingham) vs those that cannot be (e.g. "your sisters bedroom") would be a good feature. Additionally, geocoding provides longitude and latitude features that may be helpful.

    Content

    The dataset captures whether a location could be geocoded (that is: it is a valid location in the world).

    Acknowledgements

    Geocoding is done with Nominatim

    Inspiration

    Can you make better tweet classifications with geocoded locations?

  18. Google - FastSlow dataset in parquet & csv

    • kaggle.com
    zip
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SebastianBarry55 (2023). Google - FastSlow dataset in parquet & csv [Dataset]. https://www.kaggle.com/datasets/sebastianbarry55/google-competition-dataset-in-parquet-and-csv
    Explore at:
    zip(722713540 bytes)Available download formats
    Dataset updated
    Sep 1, 2023
    Authors
    SebastianBarry55
    Description

    The dataset is organized into various folders in the directories, representing different configurations and features of NLP models:

    config- This folder contains four subtypes of files: - features: Parquet files capturing various feature vectors. - ids: Parquet files containing unique identifiers for the configurations. - runtime: Parquet files detailing the runtime in different configurations. - .csv versions of the above files for easy accessibility.

    'edge`- This folder contains parquet files representing the edge features of the NLP model graphs.

    node/ - Nested within this folder are three sub-folders: - node_opcode: Parquet files capturing the operations at each node. - node_splits: Parquet files detailing how nodes are split in the graph. - node_feat: Parquet files containing node features.

  19. Exploring transfer learning for NLP

    • kaggle.com
    zip
    Updated Jun 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yury Kashnitsky (2019). Exploring transfer learning for NLP [Dataset]. https://www.kaggle.com/kashnitsky/exploring-transfer-learning-for-nlp/tasks
    Explore at:
    zip(610568813 bytes)Available download formats
    Dataset updated
    Jun 12, 2019
    Authors
    Yury Kashnitsky
    Description

    Goal

    This is a small project lead by Yury Kashnitsky within OpenDataScience and Amsterdam Data Science communities. We plan to explore transfer & semi-supervised learning techniques for NLP tasks, mainly for classification. The idea is to develop best practices for using such models as BERT & ULMFiT (maybe smth else as well) for production-grade usage. Possible outcomes of this collaboration: - primarily, shared experience within this group, and advance in our own projects - articles sharing our experience (ex. Medium) - shared models, ex. trained LM for ULMFiT in Dutch - small library, ex. to productionize ULMFiT models (if they turn out to work best)

    Anybody is welcome to join and share findings via Kernels and Discussions.

    Datasets

    We are gathering several datasets in English, Russian and Dutch. Each of them addresses the general task - to utilize loads of unlabeled texts to improve classification of (scarce) labeled texts. So for each task we have the following files:

    • train.csv (small)
    • validation.csv (small)
    • unlabeled.csv (large)
    • test.csv (optionally, within competitions)

    Current datasets are:

    • Amazon pet product reviews classification (English, 6 classes, 52k train, 17k valid, 17k test, 100k unlabeled), competition , see Kernels for baselines: logit-tfidf, ULMFiT & BERT
    • Amazon healthcare reviews (English) (6 classes, 7k train, 3k valid, 200k unlabeled )
    • Clickbait news detection (English, 3 classes, 25k train, 5.5k valid, 3.5k test, 80k unlabeled), competition, see Kernels for baselines: logit-tfidf, ULMFiT & BERT.
    • Dutch book reviews (Dutch, 2 classes, 14k train, 6k valid, 90k validation).

    Acknowledgements

    Thanks to Vladislav Lyalin for the clickbait news data (original competition by ipavlov) and to Benjamin van der Burgh for Dutch reviews data (source repository). Background image credit: Jeremy Howard, fast.ai Lesson 4

  20. PoetryFoundation HF jnb666poems

    • kaggle.com
    zip
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor Lashkov (2024). PoetryFoundation HF jnb666poems [Dataset]. https://www.kaggle.com/datasets/igorlashkov/poetryfoundation-hf-jnb666poems
    Explore at:
    zip(64674599 bytes)Available download formats
    Dataset updated
    Dec 22, 2024
    Authors
    Igor Lashkov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains most poems available on poetryfoundation.org.

    The dataset was created as part of the Unlock Global Communication with Gemma competition.

    Refer to the notebook for a detailed explanation of data creation, training methodology and evaluation Kaggle notebook.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start

kaggle-nlp-getting-start

gdwangh/kaggle-nlp-getting-start

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
hui
Description

Dataset Summary

Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

Columns

id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

Search
Clear search
Close search
Google apps
Main menu