91 datasets found

Events blog
kaggle.com
zip
Updated Jun 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Saxena (2020). Events blog [Dataset]. https://www.kaggle.com/sahilsaxenass/events-blog
Explore at:
zip(1448144 bytes)Available download formats
Dataset updated
Jun 11, 2020
Authors
Sahil Saxena
Description
Dataset

This dataset was created by Sahil Saxena

Released under Data files © Original Authors

Contents
Presidency in the middle of the pandemic
kaggle.com
zip
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kurt F (2020). Presidency in the middle of the pandemic [Dataset]. https://www.kaggle.com/kurtfcelsius/presidency-in-the-middle-of-the-pandemic
Explore at:
zip(261605 bytes)Available download formats
Dataset updated
Dec 17, 2020
Authors
Kurt F
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Twitter is a good way to measure current reactions. And during the epidemic, Lockdown is frequently the subject of the platform. While almost every country in the world suffers heavy losses in this war, politicians are also exposed to harsh criticism. In this dataset, we would like to examine the comments on Twitter about German chancellor Angela Merkel, who ranks first in the list of the world's most powerful women by Forbes. So we are curious about the results of the Lockdown arguments.

Content

The data was created in December-2020 as 1500 train and 650 test files about German chancellor Angela Merkel. Each tweet in the train data set has been labeled as positive or negative. Those behind the negative tweets were categorized under three headings. These are: - Conspiracy theory - Insult - Political criticism.

Inspiration

Maybe you might below be wondering:

-In which language were the most positive or negative tweets? -What is the structure of the words used according to languages? -What are the reflections of the headings highlighted in negative comments according to languages?

And while answering questions like this, you can find graphical options suitable for your exploratory data analysis.

And a happy ending: You can develop a machine learning model for tweets that are not labeled in test data.
f
Detailed characterization of the dataset.
figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t006
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
NLP Research Papers Dataset
kaggle.com
zip
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subham Surana (2024). NLP Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/subhamjain/natural-language-processing-research-papers
Explore at:
zip(1074694 bytes)Available download formats
Dataset updated
May 1, 2024
Authors
Subham Surana
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The dataset appears to be a collection of NLP research papers, with the full text available in the "article" column, abstract summaries in the "abstract" column, and information about different sections in the "section_names" column. Researchers and practitioners in the field of natural language processing can use this dataset for various tasks, including text summarization, document classification, and analysis of research paper structures.

Data Fields

Here's a short description of the Natural Language Processing Research Papers dataset: 1. Article: This column likely contains the full text or content of the research papers related to Natural Language Processing (NLP). Each entry in this column represents the entire body of a specific research article. 2. Abstract: This column is likely to contain the abstracts of the NLP research papers. The abstract provides a concise summary of the paper, highlighting its key objectives, methods, and findings. 3. Section Names: This column probably contains information about the section headings within each research paper. It could include the names or titles of different sections such as Introduction, Methodology, Results, Conclusion, etc. This information can be useful for structuring and organizing the content of the research papers.

File Description

Content Overview: The dataset is valuable for researchers, students, and practitioners in the field of Natural Language Processing. File format: This file is csv format.
Stop Words Hinglish for NLP
kaggle.com
zip
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranam Shetty (2024). Stop Words Hinglish for NLP [Dataset]. https://www.kaggle.com/datasets/prxshetty/stop-words-hinglish/suggestions
Explore at:
zip(2796 bytes)Available download formats
Dataset updated
Mar 13, 2024
Authors
Pranam Shetty
Description
The provided list contains common stop words used in natural language processing (NLP) tasks. Stop words are words that are filtered out before or after processing of natural language data. They are typically the most common words in a language and don't carry significant meaning, thus often removed to focus on the more important words or tokens in a text. This dataset can be used in various NLP applications such as text classification, sentiment analysis, and information retrieval to improve the accuracy and efficiency of text processing algorithms. By eliminating these stop words, the computational resources can be utilized more effectively, and the analysis can focus on the meaningful content of the text.
NLP EDA
kaggle.com
zip
Updated Feb 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adithya Madhavan (2022). NLP EDA [Dataset]. https://www.kaggle.com/datasets/adithyamadhavan/nlp-eda
Explore at:
zip(1163131 bytes)Available download formats
Dataset updated
Feb 19, 2022
Authors
Adithya Madhavan
Description
Dataset

This dataset was created by Adithya Madhavan

Contents
Movies Rating Data
kaggle.com
zip
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anand Shaw (2024). Movies Rating Data [Dataset]. https://www.kaggle.com/datasets/anandshaw2001/movie-rating-dataset
Explore at:
zip(1659098 bytes)Available download formats
Dataset updated
Mar 8, 2024
Authors
Anand Shaw
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This Data About Movie Voting and their best rating. This Data have 20 Columns and 4804 Rows. And In this dataset how was the popularity of a movie and their characters and how was the release date of the movie revenue , status , title , movie language , average vote ,id and more..

Column Names:

1.**Budget**

2.**Genres**

3.**Homepage**

4.**Id**

5.**Keywords**

6.**Original_language**

7.**Original_title**

8.**Overview**

9.**Popularity**

10.**Production_companies**

11.**Production_countries**

12.**Release_date**

13.**Revenue**

14.**Runtime**

15.**Spoken_languages**

16.**Status**

17.**Tagline**

18.**Title**

19.**Vote_average**

20.**Vote_count**
GLARE: Google Apps Arabic Reviews Dataset
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq; Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq (2024). GLARE: Google Apps Arabic Reviews Dataset [Dataset]. http://doi.org/10.5281/zenodo.6457824
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6457824
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq; Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper introduces GLARE an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews of 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset.
Amazon_Healing_Crystal_Dataset
kaggle.com
zip
Updated Aug 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arun_Vijay (2022). Amazon_Healing_Crystal_Dataset [Dataset]. https://www.kaggle.com/arunvj997/amazon-healing-crystal-dataset
Explore at:
zip(1813627 bytes)Available download formats
Dataset updated
Aug 11, 2022
Authors
Arun_Vijay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Arun_Vijay

Released under CC0: Public Domain

Contents
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated Sep 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2021). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.2
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.2
Dataset updated
Sep 10, 2021
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 31-08-2021. It contains 547651 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Reviews Dataset
kaggle.com
zip
Updated Aug 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahesh Chauhan (2022). Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/maheshchauhan09/reviews-dataset
Explore at:
zip(587330 bytes)Available download formats
Dataset updated
Aug 20, 2022
Authors
Mahesh Chauhan
Description
Dataset

This dataset was created by Mahesh Chauhan

Contents
f
F1-score results [40].
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). F1-score results [40]. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t002
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Research Paper
kaggle.com
zip
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vijender Singh (2020). Research Paper [Dataset]. https://www.kaggle.com/vijendersingh412/research-paper
Explore at:
zip(56828595 bytes)Available download formats
Dataset updated
Nov 10, 2020
Authors
Vijender Singh
Description
Context

The dataset consists of top research papers in NLP domain with its metadata.xls file containing detailed information.

Content

The dataset contains description of research paper, its domain, its sub domain and link of origin to correct paper. Each research paper starts with unique number followed by underscore and name of research paper. The unique number is is assigned to Sno of metadata sheet.

Acknowledgements

This is just a start of making a dataset for research purpose and using this dataset for recommendation system or solving other problems. You are welcome to contribute in this. And can also share the problem you are solving and I can help without any cost.

Problem Use Case

Collaborating Filtering EDA on NLP research paper Document Classification Creating own Embedding for NLP domain applications

Inspiration

The data is open to the world's largest data science community. Please share your doubts, problems and how we can make this better. ✌️

Open to direct chat @ https://in.linkedin.com/in/vijendersingh412 🤝
Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05
kaggle.com
zip
Updated Apr 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt Yates (2019). Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05 [Dataset]. https://www.kaggle.com/datasets/yeayates21/jigsaw-bias-toxicity-eda-nlp-aug16-alpha005
Explore at:
zip(1593446024 bytes)Available download formats
Dataset updated
Apr 14, 2019
Authors
Matt Yates
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Text Augmentation on Jigsaw Unintended Bias in Toxicity Classification competition training data using EDA_NLP.

Context

Code from https://github.com/jasonwei20/eda_nlp was run on the training dataset for the Jigsaw Unintended Bias in Toxicity Classification competition to create augmented training dataset. Number of augmentations was set to 16 and alpha value was set to 0.05.

Content

train_augmented1605.zip - augmented training dataset for Jigsaw Unintended Bias in Toxicity Classification competition.

Acknowledgements

Code provided by: https://github.com/jasonwei20/eda_nlp

Code for the paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196

Special thanks to ErvTong \ @papasmurfff for sharing the eda_nlp repo with me. https://www.kaggle.com/papasmurfff

Inspiration

https://mlwhiz.com/blog/2019/02/19/siver_medal_kaggle_learnings/

The above article talks about how the 1st place competitors for the Quora Insincere Question competition stated they:

"We do not pad sequences to the same length based on the whole data, but just on a batch level. That means we conduct padding and truncation on the data generator level for each batch separately, so that length of the sentences in a batch can vary in size. Additionally, we further improved this by not truncating based on the length of the longest sequence in the batch but based on the 95% percentile of lengths within the sequence. This improved runtime heavily and kept accuracy quite robust on single model level, and improved it by being able to average more models."

This got @papasmurfff and I thinking about text augmentation and from there @papasmurfff found the eda_nlp repo.
Constitution Dataset
kaggle.com
zip
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2023). Constitution Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/constitution-dataset/code
Explore at:
zip(118484 bytes)Available download formats
Dataset updated
Apr 20, 2023
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data is sourced from Comparative Constitutions Project (CCP). This dataset is useful for exploratory data analysis and NLP practices.

Content

Scope — This is drawn from Elkins, Ginsburg and Melton, The Endurance of National Constitutions (Cambridge University Press, 2009). It measures the percentage of 701 major topics from the CCP survey that are included in any given constitution.

Length (in Words) — This is simply a report of the total number of words in the Constitution as measured by Microsoft Word.

Executive Power— This is an additive index drawn from a working paper, Constitutional Constraints on Executive Lawmaking. The index ranges from 0-7 and captures the presence or absence of seven important aspects of executive lawmaking: (1) the power to initiate legislation; (2) the power to issue decrees; (3) the power to initiate constitutional amendments; (4) the power to declare states of emergency; (5) veto power; (6) the power to challenge the constitutionality of legislation; and (7) the power to dissolve the legislature.

The index score indicates the total number of these powers given to any national executive (president, prime minister, or assigned to the government) as a whole.

Legislative Power— This captures the formal degree of power assigned to the legislature by the Constitution. The indicator is drawn from Elkins, Ginsburg and Melton, The Endurance of National Constitutions (Cambridge University Press, 2009), in which we created a set of binary CCP variables to match the 32-item survey developed by M. Steven Fish and Mathew Kroenig in The Handbook of National Legislatures: A Global Survey (Cambridge University Press, 2009). The index score is simply the mean of the 32 binary elements, with higher numbers indicating more legislative power and lower numbers indicating less legislative power.

Judicial Independence — This index is drawn from a paper by Ginsburg and Melton, Does De Jure Judicial Independence Really Matter? A Reevaluation of Explanations for Judicial Independence. It is an additive index ranging from 0-6 that captures the constitutional presence or absence of six features thought to enhance judicial independence. The six features are: (1) whether the constitution contains an explicit statement of judicial independence; (2) whether the constitution provides that judges have lifetime appointments; (3) whether appointments to the highest court involve either a judicial council or two (or more) actors; (4) whether removal is prohibited or limited so that it requires the proposal of a supermajority vote in the legislature, or if only the public or judicial council can propose removal and another political actor is required to approve such a proposal; (5) whether removal explicitly limited to crimes and other issues of misconduct, treason, or violations of the constitution; and (6) whether judicial salaries are protected from reduction.

Number of Rights — In our ongoing book project on human rights, we analyze a set of 1172 different rights found in national constitutions. The rights index indicates the number of these rights found in any particular constitution.

Preamble - This is something I have extracted from the platform itself. It has the textual content of the preamble of every nation's Constitution.
Criteria for detailed characterization of the dataset.
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Criteria for detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t005
Dataset updated
Sep 26, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Criteria for detailed characterization of the dataset.
Job Descriptions Dataset
kaggle.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayakishan Minnekanti (2025). Job Descriptions Dataset [Dataset]. https://www.kaggle.com/datasets/jayakishan225/job-descriptions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jayakishan Minnekanti
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset includes 521 real-world job descriptions for various data analyst roles, compiled solely for educational and research purposes. It was created to support natural language processing (NLP) and skill extraction tasks.

Each row represents a unique job posting with: - Job Title: The role being advertised - Description: The full-text job description

🔍 Use Case:
This dataset was used in the "Job Skill Analyzer" project, which applies NLP and multi-label classification to extract in-demand skills such as Python, SQL, Tableau, Power BI, Excel, and Communication.

🎯 Ideal For: - NLP-based skill extraction - Resume/job description matching - EDA on job market skill trends - Multi-label text classification projects

⚠️ Disclaimer:
- The job descriptions were collected from publicly available postings across multiple job boards.
- No logos, branding, or personally identifiable information is included.
- This dataset is not intended for commercial use.

License: CC BY-NC-SA 4.0
Suitable For: NLP, EDA, Job Market Analysis, Skill Mining, Text Classification
NLP with Disaster Tweets - cleaning data
kaggle.com
zip
Updated Sep 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitalii Mokin (2021). NLP with Disaster Tweets - cleaning data [Dataset]. https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data
Explore at:
zip(1053715 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Vitalii Mokin
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
Context

The data obtained by clearing the Getting Started Prediction Competition "Real or Not? NLP with Disaster Tweets" data is the result of a public notebook "NLP with Disaster Tweets - EDA and Cleaning data". In the future, I plan to improve cleaning and update the dataset

Content

id - a unique identifier for each tweet text - the text of the tweet location - the location the tweet was sent from (may be blank) keyword - a particular keyword from the tweet (may be blank) target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

Acknowledgements

Thanks to Kaggle team for this Competition "Real or Not? NLP with Disaster Tweets" and its datasets (this dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here. Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480).

Thanks to web-site Ambulance services drive, strive to keep you alive for your image, which is very similar to the image of the contest "Real or Not? NLP with Disaster Tweets" and which I used as the image of my dataset

Inspiration

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.
Consumer Complaints Dataset for NLP
kaggle.com
zip
Updated May 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Tiwari (2021). Consumer Complaints Dataset for NLP [Dataset]. https://www.kaggle.com/shashwatwork/consume-complaints-dataset-fo-nlp
Explore at:
zip(20803633 bytes)Available download formats
Dataset updated
May 24, 2021
Authors
Shashwat Tiwari
Description
Context

The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute. An NLP model would make the classification of complaints and their routing to the appropriate teams more efficient than manually tagged complaints.

Content

A data file was downloaded directly from the CFPB website for training and testing the model. It included one year's worth of data (March 2020 to March 2021). Later in the project, I used an API to download up-to-the-minute data to verify the model's performance.

Each submission was tagged with one of nine financial product classes. Because of similarities between certain classes as well some class imbalances, I consolidated them into five classes:

credit reporting

debt collection

mortgages and loans (includes car loans, payday loans, student loans, etc.)

credit cards

retail banking (includes checking/savings accounts, as well as money transfers, Venmo, etc.)

After data cleaning, the dataset consisted of around 162,400 consumer submissions containing narratives. The dataset was still imbalanced, with 56% in the credit reporting class, and the remainder roughly equally distributed (between 8% and 14%) among the remaining classes.

Acknowledgements

Many thanks to Author halpert3 for providing us the Clean Dataset- Github URL.

Data is organised by CFPB.
Harry Potter Reviews
kaggle.com
zip
Updated Jan 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Paulo (2024). Harry Potter Reviews [Dataset]. https://www.kaggle.com/datasets/marcpaulo/harry-potter-reviews
Explore at:
zip(13097 bytes)Available download formats
Dataset updated
Jan 12, 2024
Authors
Marc Paulo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains 491 synthetic reviews of the famous "Harry Potter and the Philosopher's Stone" movie. The reviews were generated using a LLM (Large Language Model).

Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's conflict with Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles (non-magical people).

😉**Play with this data!**😉 - Exploratory-Data-Analysis [*EDA*] - NLP Sentiment Analysis [*NLP, Classification*] - Rating prediction using NLP and other features [*NLP, Regression | Classification*] - Favourite Character Prediction [*Multiclass Classification*] - And much more❗ ...

Harry Potter. (2024, January 10). In Wikipedia. https://en.wikipedia.org/wiki/Harry_Potter

Facebook

Twitter

Click to copy link

Link copied

Cite

Sahil Saxena (2020). Events blog [Dataset]. https://www.kaggle.com/sahilsaxenass/events-blog

Events blog

Big job related events to apply NLP, EDA (usage-recommendation system)

Explore at:

zip(1448144 bytes)Available download formats

Dataset updated

Jun 11, 2020

Authors

Sahil Saxena

Description

Dataset

This dataset was created by Sahil Saxena

Clear search

Close search

Google apps

Main menu

Events blog

Dataset

Contents

Presidency in the middle of the pandemic

Context

Content

Inspiration

Detailed characterization of the dataset.

NLP Research Papers Dataset

Context

Data Fields

File Description

Stop Words Hinglish for NLP

NLP EDA

Dataset

Contents

Movies Rating Data

GLARE: Google Apps Arabic Reviews Dataset

Amazon_Healing_Crystal_Dataset

Dataset

Contents

Reddit r/AskScience Flair Dataset

Reviews Dataset

Dataset

Contents

F1-score results [40].

Research Paper

Context

Content

Acknowledgements

Problem Use Case

Inspiration

Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05

Text Augmentation on Jigsaw Unintended Bias in Toxicity Classification competition training data using EDA_NLP.

Context

Content

Acknowledgements

Inspiration

Constitution Dataset

Content

Criteria for detailed characterization of the dataset.

Job Descriptions Dataset

NLP with Disaster Tweets - cleaning data

Context

Content

Acknowledgements

Inspiration

Consumer Complaints Dataset for NLP

Context

Content

Acknowledgements

Harry Potter Reviews

Events blog

Big job related events to apply NLP, EDA (usage-recommendation system)

Dataset

Contents