100+ datasets found
  1. Kaggle Datasets Data

    • kaggle.com
    Updated Oct 5, 2018
    Click to copy link
    Link copied
    Trinath Reddy (2018). Kaggle Datasets Data [Dataset]. https://www.kaggle.com/datasets/trinath003/kaggle-datasets-data
    Explore at:
    Dataset updated
    Oct 5, 2018
    Dataset provided by
    Trinath Reddy

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically



    Every day a new dataset is uploaded on kaggle. In order to make different from other datasets I worked on it and finally, I got a crazy idea which made me create this dataset.

    I create a dataset on kaggle datasets (For now most voted dataset's) sounds interesting right?

    The dataset consists of all the attributes which are projected on kaggle dataset page. I am excited to share the data. https://image.ibb.co/j9Ybwz/Screenshot_from_2018_10_05_19_47_35.png" alt="enter image description here">


    Dataset consists of 1960 rows and 15 columns. All the attributes which are on kaggle are in the dataset.

    Columns details are : Votes - int64 Image- object Link - object Title - object Sub-title - object Uploader - object Updated - object Version - int64 Tags - object FileType - object FileSize - object License - object Kernels - object Discussions - float64 Views - object


    Its hard to create this dataset. The main motto is to share the knowledge and create tutorials and we learned.

  2. k


    • kaggle.com
    Updated Nov 11, 2022
    Click to copy link
    Link copied
    (2022). --Kaggle-s-All-Completed-Competition----Dataset-- [Dataset]. https://www.kaggle.com/datasets/soumendraprasad/kaggles-all-completed-competition-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 11, 2022

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically


    If you found this dataset useful make an upvote & share your feedback .

    This dataset contains all the stats of all completed competitions organized on Kaggle .It contains 15 columns . 1.Comp_name- Name of competition

    2.comp_ Reward- Type of Reward

    3.comp_link- link of competiton

    4.teams- number of participated team

    5.Entries- Number of Entries

    6.Competitors- number of competitors

    7.start_date- starting date

    8.start_month- starting month

    9.start_year- starting year

    10.Final_date- ending date

    11.Final_month- Ending month

    12.Final_year- ending year

    13.code_link- Link of one notebook on each comp

    14.Desc- Description of competition

    This dataset has been scrapped from link

  3. Books Dataset

    • kaggle.com
    Updated Feb 17, 2021
    Click to copy link
    Link copied
    Old Monk (2021). Books Dataset [Dataset]. https://www.kaggle.com/datasets/saurabhbagchi/books-dataset
    Explore at:
    Dataset updated
    Feb 17, 2021
    Dataset provided by
    Old Monk

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically



    Books read by users and ratings provided by them on Amazon


    Online data for books from Amazon along with user ratings and users who bought them


    Primarily for building recommender systems. This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0 http://www2.informatik.uni-freiburg.de/~cziegler/BX/


    Can we select and recommend the top 10 books for each user based on past purchase behavior?

  4. Kaggle Data Science Survey 2017-2021

    • kaggle.com
    Updated Nov 26, 2021
    Click to copy link
    Link copied
    Andrada (2021). Kaggle Data Science Survey 2017-2021 [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/kaggle-data-science-survey-20172021
    Explore at:
    Dataset updated
    Nov 26, 2021
    Dataset provided by


    I have created this dataset for an easier way to analyse the progression of answers from the respondents that are participating each year in the very famous Data Science Kaggle Survey.

    The sources of the present data are: * 2017: https://www.kaggle.com/kaggle/kaggle-survey-2017 * 2018: https://www.kaggle.com/kaggle/kaggle-survey-2018 * 2019: https://www.kaggle.com/c/kaggle-survey-2019/data * 2020: https://www.kaggle.com/c/kaggle-survey-2020/data * 2021: https://www.kaggle.com/c/kaggle-survey-2021/data


    This dataset was created by manually aggregating each of the 5 tables mentioned above. The full methodology was as follows:

    • The 2021 table was took as refference, as it is the latest and most "up to date" in regards with the questions and the Data Science Industry overall evolution.
    • Each year in descending order was fully analysed one by one in order to find all questions (and answers) that were the same to the ones found in 2021.
    • As we go back in time, the questions lose their completeness more and more, so I would highly suggest analysing percentages on Year, rather than absolute numbers.

    The aggregation was done manually, as the questions order, naming and types of answers differ from one year to another. Hence, the most accurate way (although not the most efficient), was to read, order and pick the questions with regards to the base table (which was the 2021 Survey).


    This dataset contains the following:

    • kaggle_survey_2017_2021.csv: the tabular dataset containing the aggregated data from 2017 to 2021.
    • style.css: a file that serves as custom styling for my notebook on this competition.
    • images folder: all images I have used for my notebook on this competition.

    Note: Notebook can be found here.


    Thank you so much to the Kaggle Team for hosting these surveys and sharing with us all the data, so we can take the pulse of the community each year.


    The Kaggle Survey is reach in information as is, but what can you find by adding another layer of information - the year? Evolutions in time could be fascinating.

  5. Heart Attack Analysis & Prediction Dataset

    • kaggle.com
    Updated Mar 22, 2021
    Click to copy link
    Link copied
    Rashik Rahman (2021). Heart Attack Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset
    Explore at:
    Dataset updated
    Mar 22, 2021
    Dataset provided by
    Rashik Rahman

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically


    Hone your analytical and ML skills by participating in tasks of my other dataset's. Given below.

    Data Science Job Posting on Glassdoor

    Groceries dataset for Market Basket Analysis(MBA)

    Dataset for Facial recognition using ML approach

    Covid_w/wo_Pneumonia Chest Xray

    Disney Movies 1937-2016 Gross Income

    Bollywood Movie data from 2000 to 2019

    17.7K English song data from 2008-2017

    About this dataset

    • Age : Age of the patient

    • Sex : Sex of the patient

    • exang: exercise induced angina (1 = yes; 0 = no)

    • ca: number of major vessels (0-3)

    • cp : Chest Pain type chest pain type

      • Value 1: typical angina
      • Value 2: atypical angina
      • Value 3: non-anginal pain
      • Value 4: asymptomatic
    • trtbps : resting blood pressure (in mm Hg)

    • chol : cholestoral in mg/dl fetched via BMI sensor

    • fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

    • rest_ecg : resting electrocardiographic results

      • Value 0: normal
      • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      • Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    • thalach : maximum heart rate achieved

    • target : 0= less chance of heart attack 1= more chance of heart attack


  6. Resume Dataset

    • kaggle.com
    Updated Feb 23, 2021
    Click to copy link
    Link copied
    Gaurav Dutta (2021). Resume Dataset [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset
    Explore at:
    Dataset updated
    Feb 23, 2021
    Dataset provided by
    Gaurav Dutta

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically


    Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates.

    Hiring the right talent is a challenge for all businesses. This challenge is magnified by the high volume of applicants if the business is labour-intensive, growing, and facing high attrition rates.

    IT departments are short of growing markets. In a typical service organization, professionals with a variety of technical skills and business domain expertise are hired and assigned to projects to resolve customer issues. This task of selecting the best talent among many others is known as Resume Screening.

    Typically, large companies do not have enough time to open each CV, so they use machine learning algorithms for the Resume Screening task.

  7. Unsupervised Learning on Country Data

    • kaggle.com
    Updated Jun 17, 2020
    Click to copy link
    Link copied
    Rohan kokkula (2020). Unsupervised Learning on Country Data [Dataset]. https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data
    Explore at:
    Dataset updated
    Jun 17, 2020
    Dataset provided by
    Rohan kokkula

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically


    Clustering the Countries by using Unsupervised Learning for HELP International


    To categorise the countries using socio-economic and health factors that determine the overall development of the country.

    About organization:

    HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

    Problem Statement:

    HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

  8. k


    • kaggle.com
    Updated Dec 7, 2022
    Click to copy link
    Link copied
    (2022). Large-Language-Models--the-tweets [Dataset]. https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2022

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically


    This Kaggle dataset that contains tweets about Large Language Models is called the "Large Language Model Tweets Dataset". This dataset includes a collection of tweets that mention or discuss various aspects of large language models, such as their development, use cases, performance, ethical considerations, and impact on society.

    The dataset contains over 10,000 tweets, from various sources, including researchers, practitioners, journalists, and the general public. The tweets are in English and cover a wide range of topics related to large language models, such as natural language processing, machine learning, deep learning, artificial intelligence, and more.

    Each tweet in the dataset includes information such as the tweet ID, timestamp, user ID, user name, tweet text, and other metadata.

    This dataset can be useful for researchers and practitioners who are interested in studying large language models from a social media perspective. It can also be used for sentiment analysis, topic modeling, and other text analytics tasks related to large language models.

    Note from KB: the description above was generated with ChatGPT itself.

    Note from KB2: Please leave an upvote if you download :-)

  9. Credit Card Approval Prediction

    • kaggle.com
    Updated Mar 24, 2020
    Click to copy link
    Link copied
    Seanny (2020). Credit Card Approval Prediction [Dataset]. https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction
    Explore at:
    zip(5578875 bytes)Available download formats
    Dataset updated
    Mar 24, 2020

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically


    A Credit Card Dataset for Machine Learning!

    Don't ask me where this data come from, the answer is I don't know!


    Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.

    Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

    At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.


    Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.

    Content & Explanation

    There're two tables could be merged by ID:

    Feature nameExplanationRemarks
    IDClient number 
    FLAG_OWN_CARIs there a car 
    FLAG_OWN_REALTYIs there a property 
    CNT_CHILDRENNumber of children 
    AMT_INCOME_TOTALAnnual income 
    NAME_INCOME_TYPEIncome category 
    NAME_EDUCATION_TYPEEducation level 
    NAME_FAMILY_STATUSMarital status 
    NAME_HOUSING_TYPEWay of living 
    DAYS_BIRTHBirthday  Count backwards from current day (0), -1 means yesterday
    DAYS_EMPLOYEDStart date of employmentCount backwards from current day(0). If positive, it means the person currently unemployed.
    FLAG_MOBILIs there a mobile phone 
    FLAG_WORK_PHONEIs there a work phone 
    FLAG_PHONEIs there a phone 
    FLAG_EMAILIs there an email 
    CNT_FAM_MEMBERSFamily size 
    Feature nameExplanationRemarks
    IDClient number 
    MONTHS_BALANCERecord monthThe month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on
    STATUSStatus0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month

    Related data : Credit Card Fraud Detection Related competition: Home Credit Default Risk

  10. Real estate price prediction

    • kaggle.com
    • airtryai.uk
    Updated Dec 8, 2018
    Click to copy link
    Link copied
    Algor_Bruce (2018). Real estate price prediction [Dataset]. https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction
    Explore at:
    zip(7143 bytes)Available download formats
    Dataset updated
    Dec 8, 2018


    This dataset was created by Algor_Bruce

    Released under Other (specified in description)


  11. Graphs Dataset

    • kaggle.com
    Updated Sep 9, 2020
    Click to copy link
    Link copied
    SunEdition (2020). Graphs Dataset [Dataset]. https://www.kaggle.com/datasets/sunedition/graphs-dataset
    Explore at:
    Dataset updated
    Sep 9, 2020
    Dataset provided by

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically


    Way to Use this Dataset

    Please refer to this notebook.

    About this Dataset

    This dataset contains 15875 samples of images of graphs divided into 8 classes.

    0 - just image 1 - bar chart 2 - diagram 3 - flow chart 4 - graph 5 - growth chart 6 - pie chart 7 - table


    Splash banner

    Banner and icon by NCOA

  12. Mobile Price Classification

    • kaggle.com
    Updated Jan 28, 2018
    Click to copy link
    Link copied
    Abhishek Sharma (2018). Mobile Price Classification [Dataset]. https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification
    Explore at:
    zip(72340 bytes)Available download formats
    Dataset updated
    Jan 28, 2018
    Abhishek Sharma


    Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

    He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

    Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

    In this problem you do not have to predict actual price but a price range indicating how high the price is

  13. Network Intrusion Detection

    • kaggle.com
    Updated Oct 9, 2018
    Click to copy link
    Link copied
    Sampada Bhosale (2018). Network Intrusion Detection [Dataset]. https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
    Explore at:
    zip(838086 bytes)Available download formats
    Dataset updated
    Oct 9, 2018
    Sampada Bhosale

    Background The dataset to be audited was provided which consists of a wide variety of intrusions simulated in a military network environment. It created an environment to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was focused like a real environment and blasted with multiple attacks. A connection is a sequence of TCP packets starting and ending at some time duration between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Also, each connection is labelled as either normal or as an attack with exactly one specific attack type. Each connection record consists of about 100 bytes. For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features) .The class variable has two categories: • Normal • Anomalous

  14. Face Recognition Dataset

    • kaggle.com
    Updated Nov 6, 2020
    Click to copy link
    Link copied
    Vasuki Patel (2020). Face Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/vasukipatel/face-recognition-dataset
    Explore at:
    zip(761024670 bytes)Available download formats
    Dataset updated
    Nov 6, 2020
    Vasuki Patel

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically



    This dataset was created by Vasuki Patel

    Released under CC0: Public Domain


  15. School dataset csv-file

    • kaggle.com
    Updated Jun 7, 2023
    Click to copy link
    Link copied
    Abhishek Bagwan☑️ (2023). School dataset csv-file [Dataset]. https://www.kaggle.com/datasets/abhishekbagwan/school-dataset
    Explore at:
    zip(40141 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Abhishek Bagwan☑️

    A school dataset typically contains information about educational institutions, such as schools, colleges, or universities. These datasets often include various details about the schools, their locations, academic programs, and student demographics. Here is a general description of the information you may find in a school dataset:

    1. School Name: The name of the educational institution.
    2. Location: The geographical location of the school, including the address, city, state, and zip code.
    3. Contact Information: Contact details for the school, such as phone number, email address, and website.
    4. School Type: The type of educational institution, such as elementary school, high school, college, or university.
    5. Grade Levels: The range of grades or levels offered by the school (e.g., K-12, 9-12).
    6. Enrollment: The total number of students enrolled in the school.
    7. Student Demographics: Information about the student population, including gender distribution, ethnicity, or race.
    8. Faculty Information: The number of teachers or professors employed by the school.
    9. Academic Programs: Details about the curriculum, majors, or academic offerings available at the school.
    10. Facilities: Information on facilities provided by the school, such as libraries, laboratories, sports facilities, etc.
    11. Accreditation: The accreditation status of the school, indicating whether it meets certain educational standards.
    12. Performance Metrics: Data related to academic performance, standardized test scores, graduation rates, etc.
    13. Financial Information: Details about the school's budget, funding sources, and expenses.
    14. Extracurricular Activities: Information on clubs, sports teams, or other extracurricular programs offered by the school.

    It's important to note that the specific details and fields included in a school dataset may vary depending on the source and purpose of the dataset. Different organizations or educational authorities may collect and provide different sets of information. If you have a particular school dataset in mind or specific requirements, please provide additional information, and I'll do my best to assist you further.

  16. Students' Academic Performance Dataset

    • kaggle.com
    Updated Nov 26, 2016
    Click to copy link
    Link copied
    Ibrahim Aljarah (2016). Students' Academic Performance Dataset [Dataset]. https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data
    Explore at:
    zip(5675 bytes)Available download formats
    Dataset updated
    Nov 26, 2016
    Ibrahim Aljarah

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically


    Students' Academic Performance Dataset (xAPI-Edu-Data)

    Data Set Characteristics: Multivariate

    Number of Instances: 480

    Area: E-learning, Education, Predictive models, Educational Data Mining

    Attribute Characteristics: Integer/Categorical

    Number of Attributes: 16

    Date: 2016-11-8

    Associated Tasks: Classification

    Missing Values? No

    File formats: xAPI-Edu-Data.csv


    Elaf Abu Amrieh, Thair Hamtini, and Ibrahim Aljarah, The University of Jordan, Amman, Jordan, http://www.Ibrahimaljarah.com www.ju.edu.jo

    Dataset Information:

    This is an educational data set which is collected from learning management system (LMS) called Kalboard 360. Kalboard 360 is a multi-agent LMS, which has been designed to facilitate learning through the use of leading-edge technology. Such system provides users with a synchronous access to educational resources from any device with Internet connection.

    The data is collected using a learner activity tracker tool, which called experience API (xAPI). The xAPI is a component of the training and learning architecture (TLA) that enables to monitor learning progress and learner’s actions like reading an article or watching a training video. The experience API helps the learning activity providers to determine the learner, activity and objects that describe a learning experience. The dataset consists of 480 student records and 16 features. The features are classified into three major categories: (1) Demographic features such as gender and nationality. (2) Academic background features such as educational stage, grade Level and section. (3) Behavioral features such as raised hand on class, opening resources, answering survey by parents, and school satisfaction.

    The dataset consists of 305 males and 175 females. The students come from different origins such as 179 students are from Kuwait, 172 students are from Jordan, 28 students from Palestine, 22 students are from Iraq, 17 students from Lebanon, 12 students from Tunis, 11 students from Saudi Arabia, 9 students from Egypt, 7 students from Syria, 6 students from USA, Iran and Libya, 4 students from Morocco and one student from Venezuela.

    The dataset is collected through two educational semesters: 245 student records are collected during the first semester and 235 student records are collected during the second semester.

    The data set includes also the school attendance feature such as the students are classified into two categories based on their absence days: 191 students exceed 7 absence days and 289 students their absence days under 7.

    This dataset includes also a new category of features; this feature is parent parturition in the educational process. Parent participation feature have two sub features: Parent Answering Survey and Parent School Satisfaction. There are 270 of the parents answered survey and 210 are not, 292 of the parents are satisfied from the school and 188 are not.

    (See the related papers for more details).


    1 Gender - student's gender (nominal: 'Male' or 'Female’)

    2 Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

    3 Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

    4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)

    5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)

    6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)

    7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)

    8 Semester- school year semester (nominal:’ First’,’ Second’)

    9 Parent responsible for student (nominal:’mom’,’father’)

    10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)

    11- Visited resources- how many times the student visits a course content(numeric:0-100)

    12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)

    13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)

    14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)

    15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)

    16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

    The students are classified into three numerical intervals based on their total grade/mark:

    Low-Level: interval includes values from 0 to 69,

    Middle-Level: interval includes values from 70 to 89,

    High-Level: interval includes values from 90-100.

    Relevant Papers:

    • Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

    • Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

    Citation Request:

    Please include these citations if you plan to use this dataset:

    • Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

    • Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

  17. Harry Potter all books(preprocessed)

    • kaggle.com
    Updated Oct 27, 2022
    Click to copy link
    Link copied
    Mateusz Kudła (2022). Harry Potter all books(preprocessed) [Dataset]. https://www.kaggle.com/datasets/moxxis/harry-potter-lstm
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset provided by
    Mateusz Kudła

    This dataset contains 2 files: - all Harry Potter books in txt file format. - all Harry Potter books in txt file format, but i leave most of the special characters like [, "]. (each sentence ends with '|' for easier splitting)

    I made a little preprocess on them and: - removed all unnecessary special characters and left in the text only [. ! ?] characters - removed all newline characters ( ) - removed all carriage return (\r) characters - removed all unnecessary text like page number or book title on each page - added white spaces before all special characters to treat them as separate tokens - fixed all faulty words where: * special character [. ! ?] was at the end of the word * special character [. ! ?] was at the beginning of the word * special character [. ! ?] was in the middle of the word

  18. Data on Bike Buyers by using MS EXCEL

    • kaggle.com
    Updated Mar 25, 2022
    Click to copy link
    Link copied
    Umasri (2022). Data on Bike Buyers by using MS EXCEL [Dataset]. https://www.kaggle.com/datasets/unica02/data-on-bike-buyers-by-using-ms-excel
    Explore at:
    Dataset updated
    Mar 25, 2022
    Dataset provided by

    The dataset includes customer id,Martial Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age,Purchased Bike. Blog

  19. k


    • kaggle.com
    Updated Nov 15, 2023
    Click to copy link
    Link copied
    (2023). DAIGT-V2-Train-Dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023

    Please use version 2 (there were some issues with v1 that I fixed)!

    New release of DAIGT train dataset! Improvement: - new models: Cohere Command, Google Palm, GPT4 (from Radek!) - new prompts, including source texts from the original essays! - mapping of essay text to original prompt from persuade corpus - filtering by the famous "RDizzl3_seven"

    persuade_corpus            25996
    chat_gpt_moth             2421
    llama2_chat              2421
    mistral7binstruct_v2          2421
    mistral7binstruct_v1          2421
    original_moth             2421
    train_essays              1378
    llama_70b_v1              1172
    falcon_180b_v1             1055
    darragh_claude_v7           1000
    darragh_claude_v6           1000
    radek_500                500
    NousResearch/Llama-2-7b-chat-hf     400
    mistralai/Mistral-7B-Instruct-v0.1   400
    cohere-command             350
    palm-text-bison1            349
    radekgpt4                200

    Sources (please upvote the original datasets!): - Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) - Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) - Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) - Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays) - 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic) - LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai) - Official train essays - Essays I generated with various LLMs

    License: MIT for the data I generated. Check source datasets for the other sources mentioned above.

  20. k


    • kaggle.com
    Updated Mar 14, 2021
    Click to copy link
    Link copied
    (2021). 100-Sports-Image-Classification [Dataset]. https://www.kaggle.com/datasets/gpiosenka/sports-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2021

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically



    Please upvote if you find this dataset of use. - Thank you This version is an update of the earlier version. I ran a data set quality evaluation program on the previous version which found a considerable number of duplicate and near duplicate images. Duplicate images can lead to falsely higher values of validation and test set accuracy and I have eliminated these images in this version of the dataset. Images were gathered from internet searches. The images were scanned with a duplicate image detector program I wrote. Any duplicate images were removed to prevent bleed through of images between the train, test and valid data sets. All images were then resized to 224 X224 X 3 and converted to jpg format. A csv file is included that for each image file contains the relative path to the image file, the image file class label and the dataset (train, test or valid) that the image file resides in. This is a clean dataset. If you build a good model you should achieve at least 95% accuracy on the test set. If you build a very good model for example using transfer learning you should be able to achieve 98%+ on test set accuracy. If you find this data set useful please upvote. Thanks


    Collection of sports images covering 100 different sports.. Images are 224,224,3 jpg format. Data is separated into train, test and valid directories. Additionallly a csv file is included for those that wish to use it to create there own train, test and validation datasets. .


    Wanted to build a high quality clean data set that was easy to use and had no bad images or duplication between the train, test and validation data sets. Provides a good data set to test your models on. Design for straight forward application of keras preprocessing functions like ImageDataenerator.flow_from_directory or if you use the csv file ImageDataGenerator.flow_from_dataframe. This dataset was carefully created so that the region of interest (ROI) in this case the sport occupies approximately 50% of the pixels in the image. As a consequence even models of moderate complexity should achieve training and validation accuracies in the high 90's.

Click to copy link
Link copied
Trinath Reddy (2018). Kaggle Datasets Data [Dataset]. https://www.kaggle.com/datasets/trinath003/kaggle-datasets-data
Organization logo

Kaggle Datasets Data

Data Manipulation&Visualisation on Kaggle Datasets

Explore at:
22 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 5, 2018
Dataset provided by
Trinath Reddy

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically



Every day a new dataset is uploaded on kaggle. In order to make different from other datasets I worked on it and finally, I got a crazy idea which made me create this dataset.

I create a dataset on kaggle datasets (For now most voted dataset's) sounds interesting right?

The dataset consists of all the attributes which are projected on kaggle dataset page. I am excited to share the data. https://image.ibb.co/j9Ybwz/Screenshot_from_2018_10_05_19_47_35.png" alt="enter image description here">


Dataset consists of 1960 rows and 15 columns. All the attributes which are on kaggle are in the dataset.

Columns details are : Votes - int64 Image- object Link - object Title - object Sub-title - object Uploader - object Updated - object Version - int64 Tags - object FileType - object FileSize - object License - object Kernels - object Discussions - float64 Views - object


Its hard to create this dataset. The main motto is to share the knowledge and create tutorials and we learned.

Clear search
Close search
Google apps
Main menu