100+ datasets found
  1. Multi-Class Classification Problem

    • kaggle.com
    zip
    Updated Apr 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudhanshu Rastogi (2023). Multi-Class Classification Problem [Dataset]. https://www.kaggle.com/datasets/sudhanshu2198/processed-data-credit-score
    Explore at:
    zip(4317766 bytes)Available download formats
    Dataset updated
    Apr 14, 2023
    Authors
    Sudhanshu Rastogi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Problem Statement You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.

    Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.

    • Age: Represents the age of the person
    • Annual_Income: Represents the annual income of the person
    • Monthly_Inhand_Salary: Represents the monthly base salary of a person
    • Num_Bank_Accounts:Represents the number of bank accounts a person holds
    • Num_Credit_Card: Represents the number of other credit cards held by a person
    • Interest_Rate: Represents the interest rate on credit card
    • Num_of_Loan: Represents the number of loans taken from the bank
    • Delay_from_due_date: Represents the average number of days delayed from the payment date
    • Num_of_Delayed_Payment: Represents the average number of payments delayed by a person
    • Changed_Credit_Limit: Represents the percentage change in credit card limit
    • Num_Credit_Inquiries: Represents the number of credit card inquiries
    • Credit_Mix: Represents the classification of the mix of credits
    • Outstanding_Debt: Represents the remaining debt to be paid (in USD)
    • Credit_Utilization_Ratio: Represents the utilization ratio of credit card
    • Credit_History_Age: Represents the age of credit history of the person
    • Payment_of_Min_Amount: Represents whether only the minimum amount was paid by the person
    • Total_EMI_per_month: Represents the monthly EMI payments (in USD)
    • Amount_invested_monthly: Represents the monthly amount invested by the customer (in USD)
    • Monthly_Balance: Represents the monthly balance amount of the customer (in USD)
    • Credit_Score: Represents the bracket of credit score (Poor, Standard, Good)
  2. 📊 Yahoo Answers 10 categories for NLP CSV

    • kaggle.com
    zip
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yassir Acharki (2023). 📊 Yahoo Answers 10 categories for NLP CSV [Dataset]. https://www.kaggle.com/datasets/yacharki/yahoo-answers-10-categories-for-nlp-csv
    Explore at:
    zip(324009471 bytes)Available download formats
    Dataset updated
    Apr 7, 2023
    Authors
    Yassir Acharki
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

    The file classes.txt contains a list of classes corresponding to each label.

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".

  3. BBC Full Text Preprocessed

    • kaggle.com
    zip
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dheemanth Bhat (2023). BBC Full Text Preprocessed [Dataset]. https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed
    Explore at:
    zip(3006056 bytes)Available download formats
    Dataset updated
    Feb 23, 2023
    Authors
    Dheemanth Bhat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original Dataset

    Original dataset consists of 2225 documents (as text files) from the BBC news website corresponding to stories in five topical areas from 2004-2005. Files are segregated into 5 folders:

    1. business
    2. entertainment
    3. politics
    4. sport
    5. tech

    This Dataset

    As part of Data Wrangling, original dataset is pre-processed in three stages:

    1. Stage 1: Extract Metadata from files that are segregated in 5 folders into a single csv.
    2. Stage 2: Clean and compress text content (remove extra spaces and newlines) in files into a single csv.
    3. Stage 3: Process English language (stop-word removal, lemmatization and NER) using spaCy.

    Note: Every next stage persists and improves data from previous stage into a new csv file.

  4. BBC Full Text Document Classification

    • kaggle.com
    zip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Al Fath Terry (2024). BBC Full Text Document Classification [Dataset]. https://www.kaggle.com/datasets/alfathterry/bbc-full-text-document-classification
    Explore at:
    zip(1929885 bytes)Available download formats
    Dataset updated
    Apr 4, 2024
    Authors
    Al Fath Terry
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    this is the csv and clean version of this dataset link_to_the_original_Data. You can use this data to train your NLP skills.

  5. Alaska2 Train-Valid 4 Class .csv

    • kaggle.com
    zip
    Updated Jun 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrada (2020). Alaska2 Train-Valid 4 Class .csv [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/alaska2-trainvalid-4-class-csv/data
    Explore at:
    zip(1180671 bytes)Available download formats
    Dataset updated
    Jun 4, 2020
    Authors
    Andrada
    Description

    Context

    This dataset is based on the 300,000 images fo training in Alaska2 Competition.

    Content

    train: 225,000 observations labeled from 0 to 3. valid: 75,000 observations labeled from 0 to 3.

    https://i.imgur.com/x6dsHc1.png" width="500">

    Acknowledgements

    Paths are based on the image folders in Alaska2 Competition on Kaggle.

  6. 🎭Movie Reviews Sentences for Sentiment Analysis

    • kaggle.com
    zip
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yassir Acharki (2022). 🎭Movie Reviews Sentences for Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/yacharki/movie-review-sentiment-analysis/discussion
    Explore at:
    zip(1812584 bytes)Available download formats
    Dataset updated
    Jul 25, 2022
    Authors
    Yassir Acharki
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    This is a dataset of movie reviews to be used for the NLP task of sentiment analysis, it's in the form of sentences, were every sentence is given a sentiment score fro 0 to 4 (1 = Very Bad 2 = Bad 3 = Neutral 4 = Good 5 = Very Good).

  7. Germeval18 - Text Classification Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Germeval18 - Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/text-classification-dataset
    Explore at:
    zip(538082 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Text Classification Dataset

    Text Classification Dataset with Binary and Multi-class Labels

    By Philipp Schmid (From Huggingface) [source]

    About this dataset

    The dataset is provided in two separate files: train.csv and test.csv. The train.csv file contains a substantial amount of labeled data with columns for the text data itself, as well as their corresponding binary and multi-class labels. This enables users to develop and train machine learning models effectively using this dataset.

    Similarly, test.csv includes additional examples for evaluating pre-trained models or assessing model performance after training on train.csv. It follows a similar structure as train.csv with columns representing text data, binary labels, and multi-class labels.

    With its rich content and extensive labeling scheme for binary and multi-class classification tasks combined with its ease of use due to its tabular format in CSV files makes this dataset an excellent choice for anyone looking to advance their NLP capabilities through diverse text classification challenges

    How to use the dataset

    How to Use this Dataset for Text Classification

    This guide will provide you with useful information on how to effectively utilize this dataset for your text classification projects.

    Understanding the Columns

    The dataset consists of several columns, each serving a specific purpose:

    • text: This column contains the actual text data that needs to be classified. It is the primary feature for your modeling task.

    • binary: This column represents the binary classification label associated with each text entry. The label indicates whether the text belongs to one class or another. For example, it could be used to classify emails as either spam or not spam.

    • multi: This column represents the multi-class classification label associated with each text entry. The label indicates which class or category the text belongs to out of multiple possible classes. For instance, it can be used to categorize news articles into topics like sports, politics, entertainment, etc.

    Dataset Files

    The dataset is provided in two files: train.csv and test.csv.

    • train.csv: This file contains a subset of labeled data specifically intended for training your models. It includes columns for both text data and their corresponding binary and multi-class labels.

    • test.csv: In order to evaluate your trained models' performance on unseen data, this file provides additional examples similar in structure and format as train.csv. It includes columns for both texts and their respective binary and multi-class labels as well.

    Getting Started

    To make use of this dataset effectively, here are some steps you can follow:

    • Download both train.csv and test.csv files containing labeled examples.
    • Load these datasets into your preferred machine learning environment (such as Python with libraries like Pandas or Scikit-learn).
    • Explore the dataset by examining its structure, summary statistics, and visualizations.
    • Preprocess the text data as needed, which may include techniques like tokenization, removing stop words, stemming/lemmatizing, and encoding text into numerical representations (such as bag-of-words or TF-IDF vectors).
    • Consider splitting the train.csv data further into training and validation sets for model development and evaluation.
    • Select appropriate machine learning algorithms for your text classification task (e.g., Naive Bayes, Logistic Regression, Support Vector Machines) and train them

    Research Ideas

    • Sentiment Analysis: The dataset can be used to classify text data into positive or negative sentiment, based on the binary classification label. This can be helpful in analyzing customer reviews, social media sentiment, and feedback analysis.
    • Topic Categorization: The multi-class classification label can be used to categorize text into different topics or themes. This can be useful in organizing large amounts of text data, such as news articles or research papers.
    • Spam Detection: The binary classification label can be used to identify whether a text message or email is spam or not. This can help users filter out unwanted messages and improve their overall communication experience. Overall, this dataset provides an opportunity to create models for various applications of text classification such as sentiment analysis, topic categorization, and spam detection

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. [Data Source](https://huggingface.co/datase...

  8. Youtube-video-dataset

    • kaggle.com
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Anand (2020). Youtube-video-dataset [Dataset]. https://www.kaggle.com/rahulanand0070/youtubevideodataset
    Explore at:
    zip(5033478 bytes)Available download formats
    Dataset updated
    Jan 24, 2020
    Authors
    Rahul Anand
    Area covered
    YouTube
    Description

    Context

    YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

    This dataset contain video-Title,Videourl,Category,Description Task-Make the prediction of Category using Description or title of video

    Content

    This dataset contain one file name Youtube Video Dataset.csv
    in this file there are 4 columns-Title,Videourl,Category,Description
    Title-Title or Name of video
    Videourl-Unique videoID or URL
    Category-Category
    Description-Description of video

    Acknowledgements

    • This dataset was collected using the YouTube API and web scraping

    Inspiration

    Possible uses for this dataset could include:
    • Make the prediction of Category using Description or title of video
    • Data visualization
    For further inspiration, see the kernels on this dataset!

  9. Apparel image dataset 2

    • kaggle.com
    zip
    Updated Jan 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hwiyong joe (2020). Apparel image dataset 2 [Dataset]. https://www.kaggle.com/datasets/airplane2230/apparel-image-dataset-2
    Explore at:
    zip(260860979 bytes)Available download formats
    Dataset updated
    Jan 29, 2020
    Authors
    hwiyong joe
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    For the data set provided first, see the next page.
    + original: https://www.kaggle.com/trolukovich/apparel-images-dataset

    I added a csv file containing colors and labels. See data.

    ex) black_dress --> [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

    Also, the image column of the csv file contains the full path where the image exists.

    Content

    The dataset consist of 11385 images and includes next categories:

    • black_dress: 450
    • black_pants: 871
    • black_shirt: 715
    • black_shoes: 766
    • black_shorts: 328
    • blue_dress: 502
    • blue_pants: 798
    • blue_shirt: 741
    • blue_shoes: 523
    • blue_shorts: 299
    • brown_pants: 311
    • brown_shoes: 464
    • brown_shorts: 40
    • green_pants: 227
    • green_shirt: 230
    • green_shoes: 455
    • green_shorts: 135
    • red_dress: 800
    • red_pants: 308
    • red_shoes: 610
    • white_dress: 818
    • white_pants: 274
    • white_shoes: 600
    • white_shorts: 120
  10. AG News Classification Dataset

    • kaggle.com
    zip
    Updated Apr 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Anand (2020). AG News Classification Dataset [Dataset]. https://www.kaggle.com/amananandrai/ag-news-classification-dataset
    Explore at:
    zip(11949309 bytes)Available download formats
    Dataset updated
    Apr 20, 2020
    Authors
    Aman Anand
    Description

    AG's News Topic Classification Dataset

    ORIGIN

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    DESCRIPTION

    The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

    The file classes.txt contains a list of classes corresponding to each label.

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".

  11. Sentiment Analysis Dataset

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abdelmalek eladjelet (2025). Sentiment Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/abdelmalekeladjelet/sentiment-analysis-dataset
    Explore at:
    zip(9105036 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    abdelmalek eladjelet
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧠 Multi-Class Sentiment Analysis Dataset (240K+ English Comments)

    📌 Description

    This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

    • 0 — Negative
    • 1 — Neutral
    • 2 — Positive

    The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
    https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

    The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines.

    📊 Columns

    ColumnDescription
    CommentUser-generated text content
    SentimentSentiment label (0=Negative, 1=Neutral, 2=Positive)

    🚀 Use Cases

    • 🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa
    • 🔍 Evaluate preprocessing and tokenization strategies
    • 📈 Benchmark NLP models on multi-class classification tasks
    • 🎓 Educational projects and research in opinion mining or text classification
    • 🧪 Fine-tune transformer models on a large and diverse sentiment dataset

    💬 Example

    Comment: "apple pay is so convenient secure and easy to use"
    Sentiment: 2 (Positive)
    
  12. Sinhala News Article Classification Dataset

    • kaggle.com
    zip
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yathindra K (2021). Sinhala News Article Classification Dataset [Dataset]. https://www.kaggle.com/yathindrak/sinhala-news-article-dataset
    Explore at:
    zip(716434 bytes)Available download formats
    Dataset updated
    Mar 14, 2021
    Authors
    Yathindra K
    Description

    Context

    The dataset has been built using the publically available news data from Hiru news website which is a reputable news source in Sri Lanka.

    Please cite to the AdaptText research paper

    Content

    Format: CSV - Single File

    Inspiration

    Lack of proper Sinhala multiclass datasets has made me the inspiration to contribute a new dataset for the research community.

  13. Cirrhosis Outcomes Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshit Sharma (2024). Cirrhosis Outcomes Dataset [Dataset]. https://www.kaggle.com/datasets/harshitstark/prediction-of-cirrhosis-outcomes
    Explore at:
    zip(516311 bytes)Available download formats
    Dataset updated
    Feb 9, 2024
    Authors
    Harshit Sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is tailored for multi-class prediction of cirrhosis outcomes, containing meticulously curated training and testing sets. The training set comprises a diverse array of patient data with associated cirrhosis outcomes, while the test set is prepared for model evaluation. Participants are challenged to predict outcomes for unseen data and submit their predictions in CSV format following the specified submission guidelines. Dive into this comprehensive dataset to advance predictive modeling in cirrhosis research.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18544731%2F7b208395f699c097b57ae81cee1299be%2FCirrhosis%201.png?generation=1706785225894190&alt=media" alt="">

  14. Arabic News Texts Corpus

    • kaggle.com
    zip
    Updated Nov 21, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toofan (2019). Arabic News Texts Corpus [Dataset]. https://www.kaggle.com/datasets/muhammedfathi/arabic-news-texts-corpus/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(3764777 bytes)Available download formats
    Dataset updated
    Nov 21, 2019
    Authors
    Toofan
    Description

    Context

    This is Arabic news data with 9 categories in csv format

    original data link: https://www.kaggle.com/antcorpus/antcorpus

  15. Shopee CodeLeague 2020 Product Detection (Resized)

    • kaggle.com
    zip
    Updated Jun 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YKFlash (2020). Shopee CodeLeague 2020 Product Detection (Resized) [Dataset]. https://www.kaggle.com/tanyongkeong/shopee-code-league-2020-product-detection
    Explore at:
    zip(3700955073 bytes)Available download formats
    Dataset updated
    Jun 20, 2020
    Authors
    YKFlash
    Description

    Context

    This dataset is specifically created for Shopee Code League 2020 Product Detection competition. This competition lasts for 2 weeks which required all the teams and participants to come out with a image classification model. The purpose of creating this dataset is to resize the original dataset provided into 299x299 images that match to Kaggle Kernel limitation. The number of images is same as the number of rows provided in the train.csv and test.csv.

    Please refer: https://www.kaggle.com/c/shopee-product-detection-open/overview

    Content

    This dataset consists for 1 folder and 2 csv files which are images folders, train.csv and test.csv

    Acknowledgements

    We would like to thank Shopee for hosting a series of great competitions and giving chances for us to work with real world problems.

  16. Product_sentiment_classification

    • kaggle.com
    zip
    Updated Sep 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meghana Kankanala (2020). Product_sentiment_classification [Dataset]. https://www.kaggle.com/meghanakankanala/product-sentiment-classification
    Explore at:
    zip(406824 bytes)Available download formats
    Dataset updated
    Sep 4, 2020
    Authors
    Meghana Kankanala
    Description

    Context

    Dataset Description:

    Train.csv - 6364 rows x 4 columns (Includes Sentiment Columns as Target) Test.csv - 2728 rows x 3 columns Sample Submission.csv

    Content

    Attribute Description:

    Text_ID - Unique Identifier Product_Description - Description of the product review by a user Product_Type - Different types of product (9 unique products) Class - Represents various sentiments 0 - Cannot Say 1 - Negative 2 - Positive 3 - No Sentiment

  17. Questions Chapter Classification

    • kaggle.com
    Updated Nov 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ultron (2020). Questions Chapter Classification [Dataset]. https://www.kaggle.com/mrutyunjaybiswal/questions-chapter-classification/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ultron
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    In India, every year lacs of students sit for competitive examinations like JEE Advanced, JEE Mains, NEET, etc. These exams are said to be the gateway to get admission into India's premier Institutes such as IITs, NITs, AIIMS, etc. Keeping in mind that the competition is tough as lacs of students appear for these examinations, there has been an enormous development in Ed Tech Industry in India, fortuning the dreams of lacs of aspirants via providing online as well as offline coaching, mentoring, etc. This particular dataset consists of questions/doubts raised by students preparing for such examinations.

    Content

    The dataset contains 3 CSV files. All of them have the same columns as it is no competition. The dataset is split randomly across these 3 CSV files. Inside the CSV file, we have four columns:

    • q_id: Questions id, unique for every question
    • eng: The full question or description of the questions
    • class: The question belongs to which class/grade in the Indian Education system.
    • chapter: Target classes,

    So, it's basically an NLP problem where we have the question description and we need to find out which chapter does this question belongs to. Note: More updates might be added in the future versions.

  18. Flower Type Prediction Machine Hack

    • kaggle.com
    zip
    Updated Aug 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    V.Prasanna Kumar (2020). Flower Type Prediction Machine Hack [Dataset]. https://www.kaggle.com/datasets/vpkprasanna/flower-type-prediction-machine-hack
    Explore at:
    zip(402827 bytes)Available download formats
    Dataset updated
    Aug 21, 2020
    Authors
    V.Prasanna Kumar
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Welcome to another exciting weekend hackathon to flex your machine learning classification skills by classifying various classes of flowers into 8 different classes. To recognize the right flower you will be using 6 different attributes to classify them into the right set of classes(0-7). Using computer vision to do such recognition has reached state-of-the-art. Collecting Image data needs lots of human labor to annotate the images with the labels/bounding-boxes for detection/segmentation based tasks. Hence, some generic attribute which can be collected easily from various Area/Locality/Region were captured for over various species of flowers.

    In this hackathon, we are challenging the machinehack community to use classical machine learning classification techniques to come up with a machine learning model that can generalize well on the unseen data provided explanatory attributes about the flower species instead of a picture.

    In this competition, you will be learning advanced classification techniques, handling higher cardinality categorical variables, and much more.

    Dataset Description:

    Train.csv - 12666 rows x 7 columns (includes Class as target column)
    Test.csv - 29555 rows x 6 columns
    Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission.
    

    Attributes Description:

    Area_Code - Generic Area code, species were collected from
    Locality_Code - Locality code, species were collected from
    Region_Code - Region code, species were collected from
    Height - Height collected from lab data
    Diameter - Diameter collected from lab data
    Species - Species of the flower
    Class - Target Column (0-7) classes
    

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  19. Scene Classification: Images and Audio

    • kaggle.com
    zip
    Updated Feb 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan J. Bird (2020). Scene Classification: Images and Audio [Dataset]. https://www.kaggle.com/datasets/birdy654/scene-classification-images-and-audio
    Explore at:
    zip(1730810662 bytes)Available download formats
    Dataset updated
    Feb 1, 2020
    Authors
    Jordan J. Bird
    Description

    Do images and audio complement one another in scene classification?

    These dataset is made up of images from 8 different environments. 37 video sources have been processed, every 1 second an image is extracted (frame at 0.5s, 1.5s, 2.5s ... and so on) and to accompany that image, the MFCC audio statistics are also extracted from the relevant second of video.

    In this dataset, you will notice some common errors from single classifiers. For example, in the video of London, the image classifier confuses the environment with "FOREST" when a lady walks past with flowing hair. Likewise, the audio classifier gets confused by "RIVER" when we walk past a large fountain in Las Vegas due to the sounds of flowing water. Both of these errors can be fixed by a multi-modal approach, where fusion allows for the correction of errors. In our study, both of these issues were classified as "CITY" since multimodality can provide a solution for single-modal errors due to anomalous data occurring.

    Please cite this study if you use the dataset

    Look and Listen: A Multi-Modal Late Fusion Approach to Scene Classification for Autonomous Machines Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, and George Vogiatzis

    Context

    In this challenge, we can learn environments ("Where am I?") from either images, audio, or take a multimodal approach to fuse the data.

    Multi-modal fusion often requires far fewer computing resources than temporal models, but sometimes at the cost of classification ability. Can a method of fusion overcome this? Let's find out!

    Content

    Class data are given as strings in dataset.csv

    Each row of the dataset contains a path to the image, as well as the MFCC data extracted from the second of video that accompany the frame.

    MFCC Extraction

    (copied and pasted from the paper) we extract the the Mel-Frequency Cepstral Coefficients (MFCC) of the audio clips through a set of sliding windows 0.25s in length (ie frame size of 4K sampling points) and an additional set of overlapping windows, thus producing 8 sliding windows, 8 frames/sec. From each audio-frame, we extract 13 MFCC attributes, producing 104 attributes per 1 second clip.

    These are numbered in sequence from MFCC_1

    Two Classes?

    The original study deals with Class 2 (the actual environment, 8 classes) but we have included Class 1 also. Class 1 is a much easier binary classification problem of "Outdoors" and "Indoors"

  20. Aurora sightings

    • kaggle.com
    zip
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    labyrinthinesecurity (2025). Aurora sightings [Dataset]. https://www.kaggle.com/datasets/labyrinthinesecurity/aurora-1913
    Explore at:
    zip(94341 bytes)Available download formats
    Dataset updated
    Feb 4, 2025
    Authors
    labyrinthinesecurity
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Here are the 3 files requires to take part to the Polar Expedition 1913 challenge (https://labyrinthinesecurity.github.io/aurora_1913/index.html):

    1. historical_records.csv (15000 past observations)
    2. challenge.csv, predictions to make for the month of February 1913
    3. stations.json, a list of all 32 train stations along with their respective distances
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sudhanshu Rastogi (2023). Multi-Class Classification Problem [Dataset]. https://www.kaggle.com/datasets/sudhanshu2198/processed-data-credit-score
Organization logo

Multi-Class Classification Problem

Given a person’s credit-related information, build a machine learning model that

Explore at:
zip(4317766 bytes)Available download formats
Dataset updated
Apr 14, 2023
Authors
Sudhanshu Rastogi
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Problem Statement You are working as a data scientist in a global finance company. Over the years, the company has collected basic bank details and gathered a lot of credit-related information. The management wants to build an intelligent system to segregate the people into credit score brackets to reduce the manual efforts.

Task Given a person’s credit-related information, build a machine learning model that can classify the credit score.

  • Age: Represents the age of the person
  • Annual_Income: Represents the annual income of the person
  • Monthly_Inhand_Salary: Represents the monthly base salary of a person
  • Num_Bank_Accounts:Represents the number of bank accounts a person holds
  • Num_Credit_Card: Represents the number of other credit cards held by a person
  • Interest_Rate: Represents the interest rate on credit card
  • Num_of_Loan: Represents the number of loans taken from the bank
  • Delay_from_due_date: Represents the average number of days delayed from the payment date
  • Num_of_Delayed_Payment: Represents the average number of payments delayed by a person
  • Changed_Credit_Limit: Represents the percentage change in credit card limit
  • Num_Credit_Inquiries: Represents the number of credit card inquiries
  • Credit_Mix: Represents the classification of the mix of credits
  • Outstanding_Debt: Represents the remaining debt to be paid (in USD)
  • Credit_Utilization_Ratio: Represents the utilization ratio of credit card
  • Credit_History_Age: Represents the age of credit history of the person
  • Payment_of_Min_Amount: Represents whether only the minimum amount was paid by the person
  • Total_EMI_per_month: Represents the monthly EMI payments (in USD)
  • Amount_invested_monthly: Represents the monthly amount invested by the customer (in USD)
  • Monthly_Balance: Represents the monthly balance amount of the customer (in USD)
  • Credit_Score: Represents the bracket of credit score (Poor, Standard, Good)
Search
Clear search
Close search
Google apps
Main menu