Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem Statement You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.
Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.
The file classes.txt contains a list of classes corresponding to each label.
The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Original dataset consists of 2225 documents (as text files) from the BBC news website corresponding to stories in five topical areas from 2004-2005. Files are segregated into 5 folders:
As part of Data Wrangling, original dataset is pre-processed in three stages:
Note: Every next stage persists and improves data from previous stage into a new csv file.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
this is the csv and clean version of this dataset link_to_the_original_Data. You can use this data to train your NLP skills.
Facebook
TwitterThis dataset is based on the 300,000 images fo training in Alaska2 Competition.
train: 225,000 observations labeled from 0 to 3. valid: 75,000 observations labeled from 0 to 3.
https://i.imgur.com/x6dsHc1.png" width="500">
Paths are based on the image folders in Alaska2 Competition on Kaggle.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This is a dataset of movie reviews to be used for the NLP task of sentiment analysis, it's in the form of sentences, were every sentence is given a sentiment score fro 0 to 4 (1 = Very Bad 2 = Bad 3 = Neutral 4 = Good 5 = Very Good).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Philipp Schmid (From Huggingface) [source]
The dataset is provided in two separate files: train.csv and test.csv. The train.csv file contains a substantial amount of labeled data with columns for the text data itself, as well as their corresponding binary and multi-class labels. This enables users to develop and train machine learning models effectively using this dataset.
Similarly, test.csv includes additional examples for evaluating pre-trained models or assessing model performance after training on train.csv. It follows a similar structure as train.csv with columns representing text data, binary labels, and multi-class labels.
With its rich content and extensive labeling scheme for binary and multi-class classification tasks combined with its ease of use due to its tabular format in CSV files makes this dataset an excellent choice for anyone looking to advance their NLP capabilities through diverse text classification challenges
How to Use this Dataset for Text Classification
This guide will provide you with useful information on how to effectively utilize this dataset for your text classification projects.
Understanding the Columns
The dataset consists of several columns, each serving a specific purpose:
text: This column contains the actual text data that needs to be classified. It is the primary feature for your modeling task.
binary: This column represents the binary classification label associated with each text entry. The label indicates whether the text belongs to one class or another. For example, it could be used to classify emails as either spam or not spam.
multi: This column represents the multi-class classification label associated with each text entry. The label indicates which class or category the text belongs to out of multiple possible classes. For instance, it can be used to categorize news articles into topics like sports, politics, entertainment, etc.
Dataset Files
The dataset is provided in two files:
train.csvandtest.csv.
train.csv: This file contains a subset of labeled data specifically intended for training your models. It includes columns for both text data and their corresponding binary and multi-class labels.
test.csv: In order to evaluate your trained models' performance on unseen data, this file provides additional examples similar in structure and format as
train.csv. It includes columns for both texts and their respective binary and multi-class labels as well.Getting Started
To make use of this dataset effectively, here are some steps you can follow:
- Download both
train.csvandtest.csvfiles containing labeled examples.- Load these datasets into your preferred machine learning environment (such as Python with libraries like Pandas or Scikit-learn).
- Explore the dataset by examining its structure, summary statistics, and visualizations.
- Preprocess the text data as needed, which may include techniques like tokenization, removing stop words, stemming/lemmatizing, and encoding text into numerical representations (such as bag-of-words or TF-IDF vectors).
- Consider splitting the
train.csvdata further into training and validation sets for model development and evaluation.- Select appropriate machine learning algorithms for your text classification task (e.g., Naive Bayes, Logistic Regression, Support Vector Machines) and train them
- Sentiment Analysis: The dataset can be used to classify text data into positive or negative sentiment, based on the binary classification label. This can be helpful in analyzing customer reviews, social media sentiment, and feedback analysis.
- Topic Categorization: The multi-class classification label can be used to categorize text into different topics or themes. This can be useful in organizing large amounts of text data, such as news articles or research papers.
- Spam Detection: The binary classification label can be used to identify whether a text message or email is spam or not. This can help users filter out unwanted messages and improve their overall communication experience. Overall, this dataset provides an opportunity to create models for various applications of text classification such as sentiment analysis, topic categorization, and spam detection
If you use this dataset in your research, please credit the original authors. [Data Source](https://huggingface.co/datase...
Facebook
TwitterYouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.
This dataset contain video-Title,Videourl,Category,Description Task-Make the prediction of Category using Description or title of video
This dataset contain one file name Youtube Video Dataset.csv
in this file there are 4 columns-Title,Videourl,Category,Description
Title-Title or Name of video
Videourl-Unique videoID or URL
Category-Category
Description-Description of video
Possible uses for this dataset could include:
• Make the prediction of Category using Description or title of video
• Data visualization
For further inspiration, see the kernels on this dataset!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For the data set provided first, see the next page.
+ original: https://www.kaggle.com/trolukovich/apparel-images-dataset
I added a csv file containing colors and labels. See data.
ex) black_dress --> [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
Also, the image column of the csv file contains the full path where the image exists.
The dataset consist of 11385 images and includes next categories:
Facebook
TwitterAG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
The file classes.txt contains a list of classes corresponding to each label.
The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:
The Data has been gathered from multiple websites such as :
Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset
Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis
https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.
| Column | Description |
|---|---|
Comment | User-generated text content |
Sentiment | Sentiment label (0=Negative, 1=Neutral, 2=Positive) |
Comment: "apple pay is so convenient secure and easy to use"
Sentiment: 2 (Positive)
Facebook
TwitterThe dataset has been built using the publically available news data from Hiru news website which is a reputable news source in Sri Lanka.
Please cite to the AdaptText research paper
Format: CSV - Single File
Lack of proper Sinhala multiclass datasets has made me the inspiration to contribute a new dataset for the research community.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is tailored for multi-class prediction of cirrhosis outcomes, containing meticulously curated training and testing sets. The training set comprises a diverse array of patient data with associated cirrhosis outcomes, while the test set is prepared for model evaluation. Participants are challenged to predict outcomes for unseen data and submit their predictions in CSV format following the specified submission guidelines. Dive into this comprehensive dataset to advance predictive modeling in cirrhosis research.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18544731%2F7b208395f699c097b57ae81cee1299be%2FCirrhosis%201.png?generation=1706785225894190&alt=media" alt="">
Facebook
TwitterThis is Arabic news data with 9 categories in csv format
original data link: https://www.kaggle.com/antcorpus/antcorpus
Facebook
TwitterThis dataset is specifically created for Shopee Code League 2020 Product Detection competition. This competition lasts for 2 weeks which required all the teams and participants to come out with a image classification model. The purpose of creating this dataset is to resize the original dataset provided into 299x299 images that match to Kaggle Kernel limitation. The number of images is same as the number of rows provided in the train.csv and test.csv.
Please refer: https://www.kaggle.com/c/shopee-product-detection-open/overview
This dataset consists for 1 folder and 2 csv files which are images folders, train.csv and test.csv
We would like to thank Shopee for hosting a series of great competitions and giving chances for us to work with real world problems.
Facebook
TwitterTrain.csv - 6364 rows x 4 columns (Includes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv
Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In India, every year lacs of students sit for competitive examinations like JEE Advanced, JEE Mains, NEET, etc. These exams are said to be the gateway to get admission into India's premier Institutes such as IITs, NITs, AIIMS, etc. Keeping in mind that the competition is tough as lacs of students appear for these examinations, there has been an enormous development in Ed Tech Industry in India, fortuning the dreams of lacs of aspirants via providing online as well as offline coaching, mentoring, etc. This particular dataset consists of questions/doubts raised by students preparing for such examinations.
The dataset contains 3 CSV files. All of them have the same columns as it is no competition. The dataset is split randomly across these 3 CSV files. Inside the CSV file, we have four columns:
q_id: Questions id, unique for every questioneng: The full question or description of the questionsclass: The question belongs to which class/grade in the Indian Education system.chapter: Target classes, So, it's basically an NLP problem where we have the question description and we need to find out which chapter does this question belongs to. Note: More updates might be added in the future versions.
Facebook
TwitterThere's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Welcome to another exciting weekend hackathon to flex your machine learning classification skills by classifying various classes of flowers into 8 different classes. To recognize the right flower you will be using 6 different attributes to classify them into the right set of classes(0-7). Using computer vision to do such recognition has reached state-of-the-art. Collecting Image data needs lots of human labor to annotate the images with the labels/bounding-boxes for detection/segmentation based tasks. Hence, some generic attribute which can be collected easily from various Area/Locality/Region were captured for over various species of flowers.
In this hackathon, we are challenging the machinehack community to use classical machine learning classification techniques to come up with a machine learning model that can generalize well on the unseen data provided explanatory attributes about the flower species instead of a picture.
In this competition, you will be learning advanced classification techniques, handling higher cardinality categorical variables, and much more.
Dataset Description:
Train.csv - 12666 rows x 7 columns (includes Class as target column)
Test.csv - 29555 rows x 6 columns
Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission.
Attributes Description:
Area_Code - Generic Area code, species were collected from
Locality_Code - Locality code, species were collected from
Region_Code - Region code, species were collected from
Height - Height collected from lab data
Diameter - Diameter collected from lab data
Species - Species of the flower
Class - Target Column (0-7) classes
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterThese dataset is made up of images from 8 different environments. 37 video sources have been processed, every 1 second an image is extracted (frame at 0.5s, 1.5s, 2.5s ... and so on) and to accompany that image, the MFCC audio statistics are also extracted from the relevant second of video.
In this dataset, you will notice some common errors from single classifiers. For example, in the video of London, the image classifier confuses the environment with "FOREST" when a lady walks past with flowing hair. Likewise, the audio classifier gets confused by "RIVER" when we walk past a large fountain in Las Vegas due to the sounds of flowing water. Both of these errors can be fixed by a multi-modal approach, where fusion allows for the correction of errors. In our study, both of these issues were classified as "CITY" since multimodality can provide a solution for single-modal errors due to anomalous data occurring.
Look and Listen: A Multi-Modal Late Fusion Approach to Scene Classification for Autonomous Machines Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, and George Vogiatzis
In this challenge, we can learn environments ("Where am I?") from either images, audio, or take a multimodal approach to fuse the data.
Multi-modal fusion often requires far fewer computing resources than temporal models, but sometimes at the cost of classification ability. Can a method of fusion overcome this? Let's find out!
Class data are given as strings in dataset.csv
Each row of the dataset contains a path to the image, as well as the MFCC data extracted from the second of video that accompany the frame.
(copied and pasted from the paper) we extract the the Mel-Frequency Cepstral Coefficients (MFCC) of the audio clips through a set of sliding windows 0.25s in length (ie frame size of 4K sampling points) and an additional set of overlapping windows, thus producing 8 sliding windows, 8 frames/sec. From each audio-frame, we extract 13 MFCC attributes, producing 104 attributes per 1 second clip.
These are numbered in sequence from MFCC_1
The original study deals with Class 2 (the actual environment, 8 classes) but we have included Class 1 also. Class 1 is a much easier binary classification problem of "Outdoors" and "Indoors"
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Here are the 3 files requires to take part to the Polar Expedition 1913 challenge (https://labyrinthinesecurity.github.io/aurora_1913/index.html):
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem Statement You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.
Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.