100+ datasets found

Multi-Class Classification Problem
kaggle.com
zip
Updated Apr 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sudhanshu Rastogi (2023). Multi-Class Classification Problem [Dataset]. https://www.kaggle.com/datasets/sudhanshu2198/processed-data-credit-score
Explore at:
zip(4317766 bytes)Available download formats
Dataset updated
Apr 14, 2023
Authors
Sudhanshu Rastogi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Problem Statement You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.

Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.

Age: Represents the age of the person

Annual_Income: Represents the annual income of the person

Monthly_Inhand_Salary: Represents the monthly base salary of a person

Num_Bank_Accounts:Represents the number of bank accounts a person holds

Num_Credit_Card: Represents the number of other credit cards held by a person

Interest_Rate: Represents the interest rate on credit card

Num_of_Loan: Represents the number of loans taken from the bank

Delay_from_due_date: Represents the average number of days delayed from the payment date

Num_of_Delayed_Payment: Represents the average number of payments delayed by a person

Changed_Credit_Limit: Represents the percentage change in credit card limit

Num_Credit_Inquiries: Represents the number of credit card inquiries

Credit_Mix: Represents the classification of the mix of credits

Outstanding_Debt: Represents the remaining debt to be paid (in USD)

Credit_Utilization_Ratio: Represents the utilization ratio of credit card

Credit_History_Age: Represents the age of credit history of the person

Payment_of_Min_Amount: Represents whether only the minimum amount was paid by the person

Total_EMI_per_month: Represents the monthly EMI payments (in USD)

Amount_invested_monthly: Represents the monthly amount invested by the customer (in USD)

Monthly_Balance: Represents the monthly balance amount of the customer (in USD)

Credit_Score: Represents the bracket of credit score (Poor, Standard, Good)
📊 Yahoo Answers 10 categories for NLP CSV
kaggle.com
zip
Updated Apr 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassir Acharki (2023). 📊 Yahoo Answers 10 categories for NLP CSV [Dataset]. https://www.kaggle.com/datasets/yacharki/yahoo-answers-10-categories-for-nlp-csv
Explore at:
zip(324009471 bytes)Available download formats
Dataset updated
Apr 7, 2023
Authors
Yassir Acharki
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

The file classes.txt contains a list of classes corresponding to each label.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
BBC Full Text Preprocessed
kaggle.com
zip
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dheemanth Bhat (2023). BBC Full Text Preprocessed [Dataset]. https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed
Explore at:
zip(3006056 bytes)Available download formats
Dataset updated
Feb 23, 2023
Authors
Dheemanth Bhat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original Dataset

Original dataset consists of 2225 documents (as text files) from the BBC news website corresponding to stories in five topical areas from 2004-2005. Files are segregated into 5 folders:

business

entertainment

politics

sport

tech

This Dataset

As part of Data Wrangling, original dataset is pre-processed in three stages:

Stage 1: Extract Metadata from files that are segregated in 5 folders into a single csv.

Stage 2: Clean and compress text content (remove extra spaces and newlines) in files into a single csv.

Stage 3: Process English language (stop-word removal, lemmatization and NER) using spaCy.

Note: Every next stage persists and improves data from previous stage into a new csv file.
BBC Full Text Document Classification
kaggle.com
zip
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Al Fath Terry (2024). BBC Full Text Document Classification [Dataset]. https://www.kaggle.com/datasets/alfathterry/bbc-full-text-document-classification
Explore at:
zip(1929885 bytes)Available download formats
Dataset updated
Apr 4, 2024
Authors
Al Fath Terry
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
this is the csv and clean version of this dataset link_to_the_original_Data. You can use this data to train your NLP skills.
Alaska2 Train-Valid 4 Class .csv
kaggle.com
zip
Updated Jun 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrada (2020). Alaska2 Train-Valid 4 Class .csv [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/alaska2-trainvalid-4-class-csv/data
Explore at:
zip(1180671 bytes)Available download formats
Dataset updated
Jun 4, 2020
Authors
Andrada
Description
Context

This dataset is based on the 300,000 images fo training in Alaska2 Competition.

Content

train: 225,000 observations labeled from 0 to 3. valid: 75,000 observations labeled from 0 to 3.

https://i.imgur.com/x6dsHc1.png" width="500">

Acknowledgements

Paths are based on the image folders in Alaska2 Competition on Kaggle.
🎭Movie Reviews Sentences for Sentiment Analysis
kaggle.com
zip
Updated Jul 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassir Acharki (2022). 🎭Movie Reviews Sentences for Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/yacharki/movie-review-sentiment-analysis/discussion
Explore at:
zip(1812584 bytes)Available download formats
Dataset updated
Jul 25, 2022
Authors
Yassir Acharki
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
This is a dataset of movie reviews to be used for the NLP task of sentiment analysis, it's in the form of sentences, were every sentence is given a sentiment score fro 0 to 4 (1 = Very Bad 2 = Bad 3 = Neutral 4 = Good 5 = Very Good).
Germeval18 - Text Classification Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Germeval18 - Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/text-classification-dataset
Explore at:
zip(538082 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Text Classification Dataset

Text Classification Dataset with Binary and Multi-class Labels

By Philipp Schmid (From Huggingface) [source]

About this dataset

The dataset is provided in two separate files: train.csv and test.csv. The train.csv file contains a substantial amount of labeled data with columns for the text data itself, as well as their corresponding binary and multi-class labels. This enables users to develop and train machine learning models effectively using this dataset.

Similarly, test.csv includes additional examples for evaluating pre-trained models or assessing model performance after training on train.csv. It follows a similar structure as train.csv with columns representing text data, binary labels, and multi-class labels.

With its rich content and extensive labeling scheme for binary and multi-class classification tasks combined with its ease of use due to its tabular format in CSV files makes this dataset an excellent choice for anyone looking to advance their NLP capabilities through diverse text classification challenges

How to use the dataset

How to Use this Dataset for Text Classification

This guide will provide you with useful information on how to effectively utilize this dataset for your text classification projects.

Understanding the Columns

The dataset consists of several columns, each serving a specific purpose:

text: This column contains the actual text data that needs to be classified. It is the primary feature for your modeling task.

binary: This column represents the binary classification label associated with each text entry. The label indicates whether the text belongs to one class or another. For example, it could be used to classify emails as either spam or not spam.

multi: This column represents the multi-class classification label associated with each text entry. The label indicates which class or category the text belongs to out of multiple possible classes. For instance, it can be used to categorize news articles into topics like sports, politics, entertainment, etc.

Dataset Files

The dataset is provided in two files: train.csv and test.csv.

train.csv: This file contains a subset of labeled data specifically intended for training your models. It includes columns for both text data and their corresponding binary and multi-class labels.

test.csv: In order to evaluate your trained models' performance on unseen data, this file provides additional examples similar in structure and format as train.csv. It includes columns for both texts and their respective binary and multi-class labels as well.

Getting Started

To make use of this dataset effectively, here are some steps you can follow:

Download both train.csv and test.csv files containing labeled examples.

Load these datasets into your preferred machine learning environment (such as Python with libraries like Pandas or Scikit-learn).

Explore the dataset by examining its structure, summary statistics, and visualizations.

Preprocess the text data as needed, which may include techniques like tokenization, removing stop words, stemming/lemmatizing, and encoding text into numerical representations (such as bag-of-words or TF-IDF vectors).

Consider splitting the train.csv data further into training and validation sets for model development and evaluation.

Select appropriate machine learning algorithms for your text classification task (e.g., Naive Bayes, Logistic Regression, Support Vector Machines) and train them

Research Ideas

Sentiment Analysis: The dataset can be used to classify text data into positive or negative sentiment, based on the binary classification label. This can be helpful in analyzing customer reviews, social media sentiment, and feedback analysis.

Topic Categorization: The multi-class classification label can be used to categorize text into different topics or themes. This can be useful in organizing large amounts of text data, such as news articles or research papers.

Spam Detection: The binary classification label can be used to identify whether a text message or email is spam or not. This can help users filter out unwanted messages and improve their overall communication experience. Overall, this dataset provides an opportunity to create models for various applications of text classification such as sentiment analysis, topic categorization, and spam detection

Acknowledgements

If you use this dataset in your research, please credit the original authors. [Data Source](https://huggingface.co/datase...
Youtube-video-dataset
kaggle.com
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Anand (2020). Youtube-video-dataset [Dataset]. https://www.kaggle.com/rahulanand0070/youtubevideodataset
Explore at:
zip(5033478 bytes)Available download formats
Dataset updated
Jan 24, 2020
Authors
Rahul Anand
Area covered
YouTube
Description
Context

YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

This dataset contain video-Title,Videourl,Category,Description Task-Make the prediction of Category using Description or title of video

Content

This dataset contain one file name Youtube Video Dataset.csv
in this file there are 4 columns-Title,Videourl,Category,Description
Title-Title or Name of video
Videourl-Unique videoID or URL
Category-Category
Description-Description of video

Acknowledgements

This dataset was collected using the YouTube API and web scraping

Inspiration

Possible uses for this dataset could include:
• Make the prediction of Category using Description or title of video
• Data visualization
For further inspiration, see the kernels on this dataset!
Apparel image dataset 2
kaggle.com
zip
Updated Jan 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hwiyong joe (2020). Apparel image dataset 2 [Dataset]. https://www.kaggle.com/datasets/airplane2230/apparel-image-dataset-2
Explore at:
zip(260860979 bytes)Available download formats
Dataset updated
Jan 29, 2020
Authors
hwiyong joe
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
For the data set provided first, see the next page.
+ original: https://www.kaggle.com/trolukovich/apparel-images-dataset

I added a csv file containing colors and labels. See data.

ex) black_dress --> [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

Also, the image column of the csv file contains the full path where the image exists.

Content

The dataset consist of 11385 images and includes next categories:

black_dress: 450

black_pants: 871

black_shirt: 715

black_shoes: 766

black_shorts: 328

blue_dress: 502

blue_pants: 798

blue_shirt: 741

blue_shoes: 523

blue_shorts: 299

brown_pants: 311

brown_shoes: 464

brown_shorts: 40

green_pants: 227

green_shirt: 230

green_shoes: 455

green_shorts: 135

red_dress: 800

red_pants: 308

red_shoes: 610

white_dress: 818

white_pants: 274

white_shoes: 600

white_shorts: 120
AG News Classification Dataset
kaggle.com
zip
Updated Apr 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Anand (2020). AG News Classification Dataset [Dataset]. https://www.kaggle.com/amananandrai/ag-news-classification-dataset
Explore at:
zip(11949309 bytes)Available download formats
Dataset updated
Apr 20, 2020
Authors
Aman Anand
Description
AG's News Topic Classification Dataset

ORIGIN

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

DESCRIPTION

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

The file classes.txt contains a list of classes corresponding to each label.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
Sentiment Analysis Dataset
kaggle.com
zip
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
Explore at:
zip(9105036 bytes)Available download formats
Dataset updated
May 3, 2025
Authors
abdelmalek eladjelet
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

📌 Description

This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

0 — Negative

1 — Neutral

2 — Positive

The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

📊 Columns

Column Description
Comment User-generated text content
Sentiment Sentiment label (0=Negative, 1=Neutral, 2=Positive)

🚀 Use Cases

🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa

🔍 Evaluate preprocessing and tokenization strategies

📈 Benchmark NLP models on multi-class classification tasks

🎓 Educational projects and research in opinion mining or text classification

🧪 Fine-tune transformer models on a large and diverse sentiment dataset

💬 Example

Comment: "apple pay is so convenient secure and easy to use" Sentiment: 2 (Positive)
Sinhala News Article Classification Dataset
kaggle.com
zip
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yathindra K (2021). Sinhala News Article Classification Dataset [Dataset]. https://www.kaggle.com/yathindrak/sinhala-news-article-dataset
Explore at:
zip(716434 bytes)Available download formats
Dataset updated
Mar 14, 2021
Authors
Yathindra K
Description
Context

The dataset has been built using the publically available news data from Hiru news website which is a reputable news source in Sri Lanka.

Please cite to the AdaptText research paper

Content

Format: CSV - Single File

Inspiration

Lack of proper Sinhala multiclass datasets has made me the inspiration to contribute a new dataset for the research community.
Cirrhosis Outcomes Dataset
kaggle.com
zip
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshit Sharma (2024). Cirrhosis Outcomes Dataset [Dataset]. https://www.kaggle.com/datasets/harshitstark/prediction-of-cirrhosis-outcomes
Explore at:
zip(516311 bytes)Available download formats
Dataset updated
Feb 9, 2024
Authors
Harshit Sharma
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is tailored for multi-class prediction of cirrhosis outcomes, containing meticulously curated training and testing sets. The training set comprises a diverse array of patient data with associated cirrhosis outcomes, while the test set is prepared for model evaluation. Participants are challenged to predict outcomes for unseen data and submit their predictions in CSV format following the specified submission guidelines. Dive into this comprehensive dataset to advance predictive modeling in cirrhosis research.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18544731%2F7b208395f699c097b57ae81cee1299be%2FCirrhosis%201.png?generation=1706785225894190&alt=media" alt="">
Arabic News Texts Corpus
kaggle.com
zip
Updated Nov 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toofan (2019). Arabic News Texts Corpus [Dataset]. https://www.kaggle.com/datasets/muhammedfathi/arabic-news-texts-corpus/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(3764777 bytes)Available download formats
Dataset updated
Nov 21, 2019
Authors
Toofan
Description
Context

This is Arabic news data with 9 categories in csv format

original data link: https://www.kaggle.com/antcorpus/antcorpus
Shopee CodeLeague 2020 Product Detection (Resized)
kaggle.com
zip
Updated Jun 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YKFlash (2020). Shopee CodeLeague 2020 Product Detection (Resized) [Dataset]. https://www.kaggle.com/tanyongkeong/shopee-code-league-2020-product-detection
Explore at:
zip(3700955073 bytes)Available download formats
Dataset updated
Jun 20, 2020
Authors
YKFlash
Description
Context

This dataset is specifically created for Shopee Code League 2020 Product Detection competition. This competition lasts for 2 weeks which required all the teams and participants to come out with a image classification model. The purpose of creating this dataset is to resize the original dataset provided into 299x299 images that match to Kaggle Kernel limitation. The number of images is same as the number of rows provided in the train.csv and test.csv.

Please refer: https://www.kaggle.com/c/shopee-product-detection-open/overview

Content

This dataset consists for 1 folder and 2 csv files which are images folders, train.csv and test.csv

Acknowledgements

We would like to thank Shopee for hosting a series of great competitions and giving chances for us to work with real world problems.
Product_sentiment_classification
kaggle.com
zip
Updated Sep 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghana Kankanala (2020). Product_sentiment_classification [Dataset]. https://www.kaggle.com/meghanakankanala/product-sentiment-classification
Explore at:
zip(406824 bytes)Available download formats
Dataset updated
Sep 4, 2020
Authors
Meghana Kankanala
Description
Context

Dataset Description:

Train.csv - 6364 rows x 4 columns (Includes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv

Content

Attribute Description:

Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment
Questions Chapter Classification
kaggle.com
Updated Nov 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ultron (2020). Questions Chapter Classification [Dataset]. https://www.kaggle.com/mrutyunjaybiswal/questions-chapter-classification/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ultron
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

In India, every year lacs of students sit for competitive examinations like JEE Advanced, JEE Mains, NEET, etc. These exams are said to be the gateway to get admission into India's premier Institutes such as IITs, NITs, AIIMS, etc. Keeping in mind that the competition is tough as lacs of students appear for these examinations, there has been an enormous development in Ed Tech Industry in India, fortuning the dreams of lacs of aspirants via providing online as well as offline coaching, mentoring, etc. This particular dataset consists of questions/doubts raised by students preparing for such examinations.

Content

The dataset contains 3 CSV files. All of them have the same columns as it is no competition. The dataset is split randomly across these 3 CSV files. Inside the CSV file, we have four columns:

q_id: Questions id, unique for every question

eng: The full question or description of the questions

class: The question belongs to which class/grade in the Indian Education system.

chapter: Target classes,

So, it's basically an NLP problem where we have the question description and we need to find out which chapter does this question belongs to. Note: More updates might be added in the future versions.
Flower Type Prediction Machine Hack
kaggle.com
zip
Updated Aug 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
V.Prasanna Kumar (2020). Flower Type Prediction Machine Hack [Dataset]. https://www.kaggle.com/datasets/vpkprasanna/flower-type-prediction-machine-hack
Explore at:
zip(402827 bytes)Available download formats
Dataset updated
Aug 21, 2020
Authors
V.Prasanna Kumar
Description
Context

There's a story behind every dataset and here's your opportunity to share yours.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Welcome to another exciting weekend hackathon to flex your machine learning classification skills by classifying various classes of flowers into 8 different classes. To recognize the right flower you will be using 6 different attributes to classify them into the right set of classes(0-7). Using computer vision to do such recognition has reached state-of-the-art. Collecting Image data needs lots of human labor to annotate the images with the labels/bounding-boxes for detection/segmentation based tasks. Hence, some generic attribute which can be collected easily from various Area/Locality/Region were captured for over various species of flowers.

In this hackathon, we are challenging the machinehack community to use classical machine learning classification techniques to come up with a machine learning model that can generalize well on the unseen data provided explanatory attributes about the flower species instead of a picture.

In this competition, you will be learning advanced classification techniques, handling higher cardinality categorical variables, and much more.

Dataset Description:

Train.csv - 12666 rows x 7 columns (includes Class as target column) Test.csv - 29555 rows x 6 columns Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission.

Attributes Description:

Area_Code - Generic Area code, species were collected from Locality_Code - Locality code, species were collected from Region_Code - Region code, species were collected from Height - Height collected from lab data Diameter - Diameter collected from lab data Species - Species of the flower Class - Target Column (0-7) classes

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Scene Classification: Images and Audio
kaggle.com
zip
Updated Feb 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan J. Bird (2020). Scene Classification: Images and Audio [Dataset]. https://www.kaggle.com/datasets/birdy654/scene-classification-images-and-audio
Explore at:
zip(1730810662 bytes)Available download formats
Dataset updated
Feb 1, 2020
Authors
Jordan J. Bird
Description
Do images and audio complement one another in scene classification?

These dataset is made up of images from 8 different environments. 37 video sources have been processed, every 1 second an image is extracted (frame at 0.5s, 1.5s, 2.5s ... and so on) and to accompany that image, the MFCC audio statistics are also extracted from the relevant second of video.

In this dataset, you will notice some common errors from single classifiers. For example, in the video of London, the image classifier confuses the environment with "FOREST" when a lady walks past with flowing hair. Likewise, the audio classifier gets confused by "RIVER" when we walk past a large fountain in Las Vegas due to the sounds of flowing water. Both of these errors can be fixed by a multi-modal approach, where fusion allows for the correction of errors. In our study, both of these issues were classified as "CITY" since multimodality can provide a solution for single-modal errors due to anomalous data occurring.

Please cite this study if you use the dataset

Look and Listen: A Multi-Modal Late Fusion Approach to Scene Classification for Autonomous Machines Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, and George Vogiatzis

Context

In this challenge, we can learn environments ("Where am I?") from either images, audio, or take a multimodal approach to fuse the data.

Multi-modal fusion often requires far fewer computing resources than temporal models, but sometimes at the cost of classification ability. Can a method of fusion overcome this? Let's find out!

Content

Class data are given as strings in dataset.csv

Each row of the dataset contains a path to the image, as well as the MFCC data extracted from the second of video that accompany the frame.

MFCC Extraction

(copied and pasted from the paper) we extract the the Mel-Frequency Cepstral Coefficients (MFCC) of the audio clips through a set of sliding windows 0.25s in length (ie frame size of 4K sampling points) and an additional set of overlapping windows, thus producing 8 sliding windows, 8 frames/sec. From each audio-frame, we extract 13 MFCC attributes, producing 104 attributes per 1 second clip.

These are numbered in sequence from MFCC_1

Two Classes?

The original study deals with Class 2 (the actual environment, 8 classes) but we have included Class 1 also. Class 1 is a much easier binary classification problem of "Outdoors" and "Indoors"
Aurora sightings
kaggle.com
zip
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
labyrinthinesecurity (2025). Aurora sightings [Dataset]. https://www.kaggle.com/datasets/labyrinthinesecurity/aurora-1913
Explore at:
zip(94341 bytes)Available download formats
Dataset updated
Feb 4, 2025
Authors
labyrinthinesecurity
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Here are the 3 files requires to take part to the Polar Expedition 1913 challenge (https://labyrinthinesecurity.github.io/aurora_1913/index.html):

historical_records.csv (15000 past observations)

challenge.csv, predictions to make for the month of February 1913

stations.json, a list of all 32 train stations along with their respective distances

Column	Description
`Comment`	User-generated text content
`Sentiment`	Sentiment label (0=Negative, 1=Neutral, 2=Positive)

Facebook

Twitter

Click to copy link

Link copied

Cite

Sudhanshu Rastogi (2023). Multi-Class Classification Problem [Dataset]. https://www.kaggle.com/datasets/sudhanshu2198/processed-data-credit-score

Multi-Class Classification Problem

Given a person’s credit-related information, build a machine learning model that

Explore at:

zip(4317766 bytes)Available download formats

Dataset updated

Apr 14, 2023

Authors

Sudhanshu Rastogi

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Problem Statement You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.

Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.

Age: Represents the age of the person
Annual_Income: Represents the annual income of the person
Monthly_Inhand_Salary: Represents the monthly base salary of a person
Num_Bank_Accounts:Represents the number of bank accounts a person holds
Num_Credit_Card: Represents the number of other credit cards held by a person
Interest_Rate: Represents the interest rate on credit card
Num_of_Loan: Represents the number of loans taken from the bank
Delay_from_due_date: Represents the average number of days delayed from the payment date
Num_of_Delayed_Payment: Represents the average number of payments delayed by a person
Changed_Credit_Limit: Represents the percentage change in credit card limit
Num_Credit_Inquiries: Represents the number of credit card inquiries
Credit_Mix: Represents the classification of the mix of credits
Outstanding_Debt: Represents the remaining debt to be paid (in USD)
Credit_Utilization_Ratio: Represents the utilization ratio of credit card
Credit_History_Age: Represents the age of credit history of the person
Payment_of_Min_Amount: Represents whether only the minimum amount was paid by the person
Total_EMI_per_month: Represents the monthly EMI payments (in USD)
Amount_invested_monthly: Represents the monthly amount invested by the customer (in USD)
Monthly_Balance: Represents the monthly balance amount of the customer (in USD)
Credit_Score: Represents the bracket of credit score (Poor, Standard, Good)

Clear search

Close search

Google apps

Main menu

Multi-Class Classification Problem

📊 Yahoo Answers 10 categories for NLP CSV

BBC Full Text Preprocessed

Original Dataset

This Dataset

BBC Full Text Document Classification

Alaska2 Train-Valid 4 Class .csv

Context

Content

Acknowledgements

🎭Movie Reviews Sentences for Sentiment Analysis

Germeval18 - Text Classification Dataset

Text Classification Dataset

Text Classification Dataset with Binary and Multi-class Labels

About this dataset

How to use the dataset

How to Use this Dataset for Text Classification

Understanding the Columns

Dataset Files

Getting Started

Research Ideas

Acknowledgements

Youtube-video-dataset

Context

Content

Acknowledgements

Inspiration

Apparel image dataset 2

Content

AG News Classification Dataset

AG's News Topic Classification Dataset

ORIGIN

DESCRIPTION

Sentiment Analysis Dataset

🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

📌 Description

📊 Columns

🚀 Use Cases

💬 Example

Sinhala News Article Classification Dataset

Context

Content

Inspiration

Cirrhosis Outcomes Dataset

Arabic News Texts Corpus

Context

Shopee CodeLeague 2020 Product Detection (Resized)

Context

Content

Acknowledgements

Product_sentiment_classification

Context

Dataset Description:

Content

Attribute Description:

Questions Chapter Classification

Context

Content

Flower Type Prediction Machine Hack

Context

Content

Acknowledgements

Inspiration

Scene Classification: Images and Audio

Do images and audio complement one another in scene classification?

Please cite this study if you use the dataset

Context

Content

MFCC Extraction

Two Classes?

Aurora sightings

Multi-Class Classification Problem

Given a person’s credit-related information, build a machine learning model that