http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:
1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures
Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
This dataset was created by Kanwal Zahoor
This dataset was created by Ramya Vidiyala
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).
The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.
Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.
https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)
We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).
Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.
Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains tweets labeled for sentiment analysis, categorized into Positive, Negative, and Neutral sentiments. The dataset includes tweet IDs, user metadata, sentiment labels, and tweet text, making it suitable for Natural Language Processing (NLP), machine learning, and AI-based sentiment classification research. Originally sourced from Kaggle, this dataset is curated for improved usability in social media sentiment analysis.
A Simple but Rich Dataset for Sentiment Analysis of Chat Messages
This dataset contains a collection of chat messages that can be used to develop a sentiment analysis machine learning model to classify messages into 3 sentiment classes - positive, negative, and neutral. The messages are diverse in nature, containing not only simple text but also special characters, numbers, emoji/emoticons, and URL addresses. The dataset can be used for various natural language processing tasks related to chat analysis.
Dataset with sentiment of Russian text
Contains aggregated dataset of Russian texts from 6 datasets.
Labels meaning
0: NEUTRAL 1: POSITIVE 2: NEGATIVE
Datasets
Sentiment Analysis in Russian
Sentiments (positive, negative or neutral) of news in russian language from Kaggle competition.
Russian Language Toxic Comments
Small dataset with labeled comments from 2ch.hk and pikabu.ru.
Dataset of car reviews for machine learning (sentiment analysis)
Glazkova A.… See the full description on the dataset page: https://huggingface.co/datasets/MonoHime/ru_sentiment_dataset.
Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this dataset develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help serve the customers better but can also reveal lolot of customer traits present/hidden in the reviews.
The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Deep-NLP’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/samdeeplearning/deepnlp on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Sheet_1.csv contains 80 user responses, in the response_text column, to a therapy chatbot. Bot said: 'Describe a time when you have acted as a resource for someone else'. User responded. If a response is 'not flagged', the user can continue talking to the bot. If it is 'flagged', the user is referred to help.
Sheet_2.csv contains 125 resumes, in the resume_text column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.
Classify new resumes/responses as flagged or not flagged.
There are two sets of data here - resumes and responses. Split the data into a train set and a test set to test the accuracy of your classifier. Bonus points for using the same classifier for both problems.
Good luck.
Thank you to Parsa Ghaffari (Aylien), without whom these visuals (cover photo is in Parsa Ghaffari's excellent LinkedIn article on English, Spanish and German postive v. negative sentiment analysis) would not exist.
You can use any of the code in that kernel anywhere, on or off Kaggle. Ping me at @_samputnam for questions.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541
The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.
The following columns are in the dataset:
➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.
Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
The inspiration behind creating the OYO Review Dataset for sentiment analysis was to explore the sentiment and opinions expressed in hotel reviews on the OYO Hotels platform. Analyzing the sentiment of customer reviews can provide valuable insights into the overall satisfaction of guests, identify areas for improvement, and assist in making data-driven decisions to enhance the hotel experience. By collecting and curating this dataset, Deep Patel, Nikki Patel, and Nimil aimed to contribute to the field of sentiment analysis in the context of the hospitality industry. Sentiment analysis allows us to classify the sentiment expressed in textual data, such as reviews, into positive, negative, or neutral categories. This analysis can help hotel management and stakeholders understand customer sentiments, identify common patterns, and address concerns or issues that may affect the reputation and customer satisfaction of OYO Hotels. The dataset provides a valuable resource for training and evaluating sentiment analysis models specifically tailored to the hospitality domain. Researchers, data scientists, and practitioners can utilize this dataset to develop and test various machine learning and natural language processing techniques for sentiment analysis, such as classification algorithms, sentiment lexicons, or deep learning models. Overall, the goal of creating the OYO Review Dataset for sentiment analysis was to facilitate research and analysis in the area of customer sentiments and opinions in the hotel industry. By understanding the sentiment of hotel reviews, businesses can strive to improve their services, enhance customer satisfaction, and make data-driven decisions to elevate the overall guest experience.
Deep Patel: https://www.linkedin.com/in/deep-patel-55ab48199/ Nikki Patel: https://www.linkedin.com/in/nikipatel9/ Nimil lathiya: https://www.linkedin.com/in/nimil-lathiya-059a281b1/
**Overview Analyzing sentiments related to various products such as Tablet, Mobile and various other gizmos can be fun and difficult especially when collected across various demographics around the world. In this weekend hackathon, we challenge the machinehackers community to develop a machine learning model to accurately classify various products into 4 different classes of sentiments based on the raw text review provided by the user. Analyzing these sentiments will not only help us serve the customers better but can also reveal lot of customer traits present/hidden in the reviews.
The sentiment analysis requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc. However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers.
In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.
Dataset Description:
Train.csv - 6364 rows x 4 columns (Inlcudes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission
Attribute Description:
Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment Skills:
NLP, Sentiment Analysis Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing multi-class log loss to generalize well on unseen data
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip
. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours
subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar
as text_recognition_multipro.py
.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar
.
Parameter sweeps are automated by param_sweep.rb
. This file also shows how to invoke all of these components.
This dataset contains text samples labeled with sentiment categories, including positive, negative, and neutral sentiments. It is designed for sentiment analysis and can be used to train and evaluate machine learning models aimed at understanding the emotional tone of text data.
This dataset can be used for training sentiment analysis models, evaluating model performance, and conducting research on natural language processing (NLP) and sentiment classification. It is suitable for machine learning projects focusing on sentiment detection, opinion mining, and text classification.
The dataset is provided in CSV format with the following columns:
- tweet
: I’m so proud of my team for finishing the project ahead of schedule!
- label
: Positive, Negative, Neutral
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study significantly contributes to the sphere of educational technology by deploying state-of-the-art machine learning and deep learning strategies for meaningful changes in education. The hybrid stacking approach did an excellent implementation using Decision Trees, Random Forest, and XGBoost as base learners with Gradient Boosting as a meta-learner, which managed to record an accuracy of 90%. That indeed puts into great perspective the huge potential it possesses for accuracy measures while predicting in educational setups. The CNN model, which predicted with an accuracy of 89%, showed quite impressive capability in sentiment analysis to acquire further insight into the emotional status of the students. RCNN, Random Forests, and Decision Trees contribute to the possibility of educational data complexity with valuable insight into the complex interrelationships within ML models and educational contexts. The application of the bagging XGBoost algorithm, which attained a high accuracy of 88%, further stamps its utility toward enhancement of academic performance through strong robust techniques of model aggregation. The dataset that was used in this study was sourced from Kaggle, with 1205 entries of 14 attributes concerning adaptability, sentiment, and academic performance; the reliability and richness of the analytical basis are high. The dataset allows rigorous modeling and validation to be done to ensure the findings are considered robust. This study has several implications for education and develops on the key dimensions: teacher effectiveness, educational leadership, and well-being of the students. From the obtained information about student adaptability and sentiment, the developed system helps educators to make modifications in instructional strategy more efficiently for a particular student to enhance effectiveness in teaching. All these aspects could provide critical insights for the educational leadership to devise data-driven strategies that would enhance the overall school-wide academic performance, as well as create a caring learning atmosphere. The integration of sentiment analysis within the structure of education brings an inclusive, responsive attitude toward ensuring students’ well-being and, thus, a caring educational environment. The study is closely aligned with sustainable ICT in education objectives and offers a transformative approach to integrating AI-driven insights with practice in this field. By integrating notorious ML and DL methodologies with educational challenges, the research puts the basis for future innovations and technology in this area. Ultimately, it contributes to sustainable improvement in the educational system.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Source: https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus Data Description : Movie Scripts Corpus This corpus was collected to use for screenplay analysis with machine learning methods. Corpus includes movie scripts, crawled from different sources, their annotations by script structural elements and movies metadata. Corpus description Screenplay data consists of: Movie scripts TXT-documents with raw full text (2858 docs) Movie scripts TXT-documents with full text lemmas (2858 docs) Manual annotation TXT-documents for some movie scripts (33 docs, more than 6000 annotated rows) Movie scripts annotations TXT-documents obtained by BERT Movie scripts annotations json-documents obtained by rule-based annotator ScreenPy Movies metadata consists of: Cut versions of movie reviews and scores from metacritic: Number of reviews: 21025 Number of movies with reviews: 2038 Metadata for movies, including: title, akas, launch year, score from metacritic, imdb user rating and number of votes from imdb.com, movie awards, opening weekend, producers, budget, script department, production companies, writers, directors, cast info, countries involved in production, age restrict, plot (with outline), keywords, genres, taglines, critics' synopsis Screenplay awards information: Academy Awards adapted screenplay, Academy Awards original screenplay, BAFTA, Golden Globe Award for Best Screenplay, Writers Guild Awards Winners & Nominees 2020-2013 nominations information for 462 movies in total. Movie characters data consists of: Script text fragments with dialogs and scene descriptions for characters, gathered with annotators: 2153 movies and text fragments for 32114 characters in total Gender labels for 4792 characters
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description This resources consists of calibrations sets (N=100) for 4 publically available datasets using the LLM-Assisted Content Analysis method (LACA). Each dataset contains the following columns:
text_id: Unique ID for each text document code_id: Unique ID for each code category text: Document text that's been coded original_code: Coded response from the original datasets replicated_code: Coded response from independent coding exercise from our study team model_code: Coded response generated from the LLM (GPT-3.5-turbo) reason: LLM generated reason for coding decision
Additional details on methods and definitions of individual code categories are available in the following paper:
Chew, R., Bollenbacher, J., Speer, J., Wenger., M, Kim., A. (2023) LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding.
Trump Tweets
Citation: Coe, Kevin, Berger, Julia, Blumling, Allison , Brooks, Katelyn , Giorgi, Elizabeth , Jackson, Jennifer , … Wellman, Mariah . Quantitative Content Analysis of Donald Trump’s Twitter, 2017-2019. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-04-01. https://doi.org/10.3886/E118603V1 Source: https://www.openicpsr.org/openicpsr/project/118603/version/V1/view
BBC News
Citation: Greene, D., & Cunningham, P. (2006). Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning (pp. 377-384). Source: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification
Contrarian Claims
Citation: Coan, T. G., Boussalis, C., Cook, J., & Nanko, M. O. (2021). Computer-assisted classification of contrarian claims about climate change. Scientific reports, 11(1), 22320. Source: https://socialanalytics.ex.ac.uk/cards/data.zip
Ukraine Water Problems
Citation: Afanasyev S, N. B, Bodnarchuk T, S. V, M. V, T. V, Yu V, K. G, V. D, Konovalenko O, O. K, E. K, Lietytska O, O. L, V. M, Marushevska O, Mokin V, K. M, Osadcha N, O. I (2013) River Basin Management Plan for Pivdenny Bug: river basin analysis and measures Source: https://www.kaggle.com/datasets/vbmokin/nlp-reports-news-classification
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This Stock Market Dataset is designed for predictive analysis and machine learning applications in financial markets. It includes 13647 records of simulated stock trading data with features commonly used in stock price forecasting.
🔹 Key Features Date – Trading day timestamps (business days only) Open, High, Low, Close – Simulated stock prices Volume – Trading volume per day RSI (Relative Strength Index) – Measures market momentum MACD (Moving Average Convergence Divergence) – Trend-following momentum indicator Sentiment Score – Simulated market sentiment from financial news & social media Target – Binary label (1: Price goes up, 0: Price goes down) for next-day prediction This dataset is useful for training hybrid deep learning models such as LSTM, CNN, and Attention-based networks for stock market forecasting. It enables financial analysts, traders, and AI researchers to experiment with market trends, technical analysis, and sentiment-based predictions.
Airline Passenger Reviews Dataset This dataset contains real-world airline passenger reviews gathered from various flights across different countries and airlines. Each row represents an individual passenger’s experience and feedback on a specific flight. Dataset Overview The dataset includes reviews on several aspects of the airline experience, such as seat comfort, food, cabin crew service, value for money, and more. It can be used for sentiment analysis, NLP-based text classification, airline performance evaluation, and other machine learning or data visualization tasks.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:
1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures
Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.