100+ datasets found

Kaggle Datasets Data
kaggle.com
Updated Oct 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trinath Reddy (2018). Kaggle Datasets Data [Dataset]. https://www.kaggle.com/datasets/trinath003/kaggle-datasets-data
Explore at:
Dataset updated
Oct 5, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Trinath Reddy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Context

Every day a new dataset is uploaded on kaggle. In order to make different from other datasets I worked on it and finally, I got a crazy idea which made me create this dataset.

I create a dataset on kaggle datasets (For now most voted dataset's) sounds interesting right?

The dataset consists of all the attributes which are projected on kaggle dataset page. I am excited to share the data. https://image.ibb.co/j9Ybwz/Screenshot_from_2018_10_05_19_47_35.png" alt="enter image description here">

Content

Dataset consists of 1960 rows and 15 columns. All the attributes which are on kaggle are in the dataset.

Columns details are : Votes - int64 Image- object Link - object Title - object Sub-title - object Uploader - object Updated - object Version - int64 Tags - object FileType - object FileSize - object License - object Kernels - object Discussions - float64 Views - object

Acknowledgements

Its hard to create this dataset. The main motto is to share the knowledge and create tutorials and we learned.
k
--Kaggle-s-All-Completed-Competition----Dataset--
kaggle.com
Updated Nov 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). --Kaggle-s-All-Completed-Competition----Dataset-- [Dataset]. https://www.kaggle.com/datasets/soumendraprasad/kaggles-all-completed-competition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 11, 2022
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
If you found this dataset useful make an upvote & share your feedback .

This dataset contains all the stats of all completed competitions organized on Kaggle .It contains 15 columns . 1.Comp_name- Name of competition

2.comp_ Reward- Type of Reward

3.comp_link- link of competiton

4.teams- number of participated team

5.Entries- Number of Entries

6.Competitors- number of competitors

7.start_date- starting date

8.start_month- starting month

9.start_year- starting year

10.Final_date- ending date

11.Final_month- Ending month

12.Final_year- ending year

13.code_link- Link of one notebook on each comp

14.Desc- Description of competition

This dataset has been scrapped from link
Books Dataset
kaggle.com
Updated Feb 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Old Monk (2021). Books Dataset [Dataset]. https://www.kaggle.com/datasets/saurabhbagchi/books-dataset
Explore at:
Dataset updated
Feb 17, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Old Monk
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Context

Books read by users and ratings provided by them on Amazon

Content

Online data for books from Amazon along with user ratings and users who bought them

Acknowledgements

Primarily for building recommender systems. This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0 http://www2.informatik.uni-freiburg.de/~cziegler/BX/

Inspiration

Can we select and recommend the top 10 books for each user based on past purchase behavior?
Kaggle Data Science Survey 2017-2021
kaggle.com
Updated Nov 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrada (2021). Kaggle Data Science Survey 2017-2021 [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/kaggle-data-science-survey-20172021
Explore at:
Dataset updated
Nov 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Andrada
Description
Context

I have created this dataset for an easier way to analyse the progression of answers from the respondents that are participating each year in the very famous Data Science Kaggle Survey.

The sources of the present data are: * 2017: https://www.kaggle.com/kaggle/kaggle-survey-2017 * 2018: https://www.kaggle.com/kaggle/kaggle-survey-2018 * 2019: https://www.kaggle.com/c/kaggle-survey-2019/data * 2020: https://www.kaggle.com/c/kaggle-survey-2020/data * 2021: https://www.kaggle.com/c/kaggle-survey-2021/data

Methodology

This dataset was created by manually aggregating each of the 5 tables mentioned above. The full methodology was as follows:

The 2021 table was took as refference, as it is the latest and most "up to date" in regards with the questions and the Data Science Industry overall evolution.

Each year in descending order was fully analysed one by one in order to find all questions (and answers) that were the same to the ones found in 2021.

As we go back in time, the questions lose their completeness more and more, so I would highly suggest analysing percentages on Year, rather than absolute numbers.

The aggregation was done manually, as the questions order, naming and types of answers differ from one year to another. Hence, the most accurate way (although not the most efficient), was to read, order and pick the questions with regards to the base table (which was the 2021 Survey).

Content

This dataset contains the following:

kaggle_survey_2017_2021.csv: the tabular dataset containing the aggregated data from 2017 to 2021.

style.css: a file that serves as custom styling for my notebook on this competition.

images folder: all images I have used for my notebook on this competition.

Note: Notebook can be found here.

Acknowledgements

Thank you so much to the Kaggle Team for hosting these surveys and sharing with us all the data, so we can take the pulse of the community each year.

Inspiration

The Kaggle Survey is reach in information as is, but what can you find by adding another layer of information - the year? Evolutions in time could be fascinating.
Heart Attack Analysis & Prediction Dataset
kaggle.com
Updated Mar 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rashik Rahman (2021). Heart Attack Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset
Explore at:
Dataset updated
Mar 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rashik Rahman
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Hone your analytical and ML skills by participating in tasks of my other dataset's. Given below.

Data Science Job Posting on Glassdoor

Groceries dataset for Market Basket Analysis(MBA)

Dataset for Facial recognition using ML approach

Covid_w/wo_Pneumonia Chest Xray

Disney Movies 1937-2016 Gross Income

Bollywood Movie data from 2000 to 2019

17.7K English song data from 2008-2017

About this dataset

Age : Age of the patient

Sex : Sex of the patient

exang: exercise induced angina (1 = yes; 0 = no)

ca: number of major vessels (0-3)

cp : Chest Pain type chest pain type

Value 1: typical angina

Value 2: atypical angina

Value 3: non-anginal pain

Value 4: asymptomatic

trtbps : resting blood pressure (in mm Hg)

chol : cholestoral in mg/dl fetched via BMI sensor

fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

rest_ecg : resting electrocardiographic results

Value 0: normal

Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

thalach : maximum heart rate achieved

target : 0= less chance of heart attack 1= more chance of heart attack

n
Resume Dataset
kaggle.com
Updated Feb 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Dutta (2021). Resume Dataset [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset
Explore at:
Dataset updated
Feb 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Dutta
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates.

Hiring the right talent is a challenge for all businesses. This challenge is magnified by the high volume of applicants if the business is labour-intensive, growing, and facing high attrition rates.

IT departments are short of growing markets. In a typical service organization, professionals with a variety of technical skills and business domain expertise are hired and assigned to projects to resolve customer issues. This task of selecting the best talent among many others is known as Resume Screening.

Typically, large companies do not have enough time to open each CV, so they use machine learning algorithms for the Resume Screening task.
Unsupervised Learning on Country Data
kaggle.com
Updated Jun 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohan kokkula (2020). Unsupervised Learning on Country Data [Dataset]. https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data
Explore at:
Dataset updated
Jun 17, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rohan kokkula
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Clustering the Countries by using Unsupervised Learning for HELP International

Objective:

To categorise the countries using socio-economic and health factors that determine the overall development of the country.

About organization:

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

Problem Statement:

HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
k
Large-Language-Models--the-tweets
kaggle.com
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Large-Language-Models--the-tweets [Dataset]. https://www.kaggle.com/datasets/konradb/chatgpt-the-tweets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2022
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This Kaggle dataset that contains tweets about Large Language Models is called the "Large Language Model Tweets Dataset". This dataset includes a collection of tweets that mention or discuss various aspects of large language models, such as their development, use cases, performance, ethical considerations, and impact on society.

The dataset contains over 10,000 tweets, from various sources, including researchers, practitioners, journalists, and the general public. The tweets are in English and cover a wide range of topics related to large language models, such as natural language processing, machine learning, deep learning, artificial intelligence, and more.

Each tweet in the dataset includes information such as the tweet ID, timestamp, user ID, user name, tweet text, and other metadata.

This dataset can be useful for researchers and practitioners who are interested in studying large language models from a social media perspective. It can also be used for sentiment analysis, topic modeling, and other text analytics tasks related to large language models.

Note from KB: the description above was generated with ChatGPT itself.

Note from KB2: Please leave an upvote if you download :-)

Credit Card Approval Prediction

kaggle.com

zip

Updated Mar 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Seanny (2020). Credit Card Approval Prediction [Dataset]. https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction

Explore at:

zip(5578875 bytes)Available download formats

Dataset updated

Mar 24, 2020

Authors

Seanny

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

A Credit Card Dataset for Machine Learning!

Don't ask me where this data come from, the answer is I don't know!

Context

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.

Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.

Task

Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.

Content & Explanation

There're two tables could be merged by ID:

application_record.csv
Feature name	Explanation	Remarks
`ID`	Client number
`CODE_GENDER`	Gender
`FLAG_OWN_CAR`	Is there a car
`FLAG_OWN_REALTY`	Is there a property
`CNT_CHILDREN`	Number of children
`AMT_INCOME_TOTAL`	Annual income
`NAME_INCOME_TYPE`	Income category
`NAME_EDUCATION_TYPE`	Education level
`NAME_FAMILY_STATUS`	Marital status
`NAME_HOUSING_TYPE`	Way of living
`DAYS_BIRTH`	Birthday	Count backwards from current day (0), -1 means yesterday
`DAYS_EMPLOYED`	Start date of employment	Count backwards from current day(0). If positive, it means the person currently unemployed.
`FLAG_MOBIL`	Is there a mobile phone
`FLAG_WORK_PHONE`	Is there a work phone
`FLAG_PHONE`	Is there a phone
`FLAG_EMAIL`	Is there an email
`OCCUPATION_TYPE`	Occupation
`CNT_FAM_MEMBERS`	Family size

credit_record.csv
Feature name	Explanation	Remarks
`ID`	Client number
`MONTHS_BALANCE`	Record month	The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on
`STATUS`	Status	0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month

Related data : Credit Card Fraud Detection Related competition: Home Credit Default Risk

Real estate price prediction
kaggle.com
airtryai.uk
zip
Updated Dec 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Algor_Bruce (2018). Real estate price prediction [Dataset]. https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction
Explore at:
zip(7143 bytes)Available download formats
Dataset updated
Dec 8, 2018
Authors
Algor_Bruce
Description
Dataset

This dataset was created by Algor_Bruce

Released under Other (specified in description)

Contents
Graphs Dataset
kaggle.com
Updated Sep 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SunEdition (2020). Graphs Dataset [Dataset]. https://www.kaggle.com/datasets/sunedition/graphs-dataset
Explore at:
Dataset updated
Sep 9, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SunEdition
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Way to Use this Dataset

Please refer to this notebook.

About this Dataset

This dataset contains 15875 samples of images of graphs divided into 8 classes.

0 - just image 1 - bar chart 2 - diagram 3 - flow chart 4 - graph 5 - growth chart 6 - pie chart 7 - table

Acknowledgements

Splash banner

Banner and icon by NCOA
Mobile Price Classification
kaggle.com
zip
Updated Jan 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Sharma (2018). Mobile Price Classification [Dataset]. https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification
Explore at:
zip(72340 bytes)Available download formats
Dataset updated
Jan 28, 2018
Authors
Abhishek Sharma
Description
Context

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is
Network Intrusion Detection
kaggle.com
zip
Updated Oct 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sampada Bhosale (2018). Network Intrusion Detection [Dataset]. https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
Explore at:
zip(838086 bytes)Available download formats
Dataset updated
Oct 9, 2018
Authors
Sampada Bhosale
Description
Background The dataset to be audited was provided which consists of a wide variety of intrusions simulated in a military network environment. It created an environment to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was focused like a real environment and blasted with multiple attacks. A connection is a sequence of TCP packets starting and ending at some time duration between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Also, each connection is labelled as either normal or as an attack with exactly one specific attack type. Each connection record consists of about 100 bytes. For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features) .The class variable has two categories: • Normal • Anomalous
Face Recognition Dataset
kaggle.com
zip
Updated Nov 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vasuki Patel (2020). Face Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/vasukipatel/face-recognition-dataset
Explore at:
zip(761024670 bytes)Available download formats
Dataset updated
Nov 6, 2020
Authors
Vasuki Patel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Vasuki Patel

Released under CC0: Public Domain

Contents
School dataset csv-file
kaggle.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Bagwan☑️ (2023). School dataset csv-file [Dataset]. https://www.kaggle.com/datasets/abhishekbagwan/school-dataset
Explore at:
zip(40141 bytes)Available download formats
Dataset updated
Jun 7, 2023
Authors
Abhishek Bagwan☑️
Description
A school dataset typically contains information about educational institutions, such as schools, colleges, or universities. These datasets often include various details about the schools, their locations, academic programs, and student demographics. Here is a general description of the information you may find in a school dataset:

School Name: The name of the educational institution.

Location: The geographical location of the school, including the address, city, state, and zip code.

Contact Information: Contact details for the school, such as phone number, email address, and website.

School Type: The type of educational institution, such as elementary school, high school, college, or university.

Grade Levels: The range of grades or levels offered by the school (e.g., K-12, 9-12).

Enrollment: The total number of students enrolled in the school.

Student Demographics: Information about the student population, including gender distribution, ethnicity, or race.

Faculty Information: The number of teachers or professors employed by the school.

Academic Programs: Details about the curriculum, majors, or academic offerings available at the school.

Facilities: Information on facilities provided by the school, such as libraries, laboratories, sports facilities, etc.

Accreditation: The accreditation status of the school, indicating whether it meets certain educational standards.

Performance Metrics: Data related to academic performance, standardized test scores, graduation rates, etc.

Financial Information: Details about the school's budget, funding sources, and expenses.

Extracurricular Activities: Information on clubs, sports teams, or other extracurricular programs offered by the school.

It's important to note that the specific details and fields included in a school dataset may vary depending on the source and purpose of the dataset. Different organizations or educational authorities may collect and provide different sets of information. If you have a particular school dataset in mind or specific requirements, please provide additional information, and I'll do my best to assist you further.
Students' Academic Performance Dataset
kaggle.com
zip
Updated Nov 26, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Aljarah (2016). Students' Academic Performance Dataset [Dataset]. https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data
Explore at:
zip(5675 bytes)Available download formats
Dataset updated
Nov 26, 2016
Authors
Ibrahim Aljarah
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Students' Academic Performance Dataset (xAPI-Edu-Data)

Data Set Characteristics: Multivariate

Number of Instances: 480

Area: E-learning, Education, Predictive models, Educational Data Mining

Attribute Characteristics: Integer/Categorical

Number of Attributes: 16

Date: 2016-11-8

Associated Tasks: Classification

Missing Values? No

File formats: xAPI-Edu-Data.csv

Source:

Elaf Abu Amrieh, Thair Hamtini, and Ibrahim Aljarah, The University of Jordan, Amman, Jordan, http://www.Ibrahimaljarah.com www.ju.edu.jo

Dataset Information:

This is an educational data set which is collected from learning management system (LMS) called Kalboard 360. Kalboard 360 is a multi-agent LMS, which has been designed to facilitate learning through the use of leading-edge technology. Such system provides users with a synchronous access to educational resources from any device with Internet connection.

The data is collected using a learner activity tracker tool, which called experience API (xAPI). The xAPI is a component of the training and learning architecture (TLA) that enables to monitor learning progress and learner’s actions like reading an article or watching a training video. The experience API helps the learning activity providers to determine the learner, activity and objects that describe a learning experience. The dataset consists of 480 student records and 16 features. The features are classified into three major categories: (1) Demographic features such as gender and nationality. (2) Academic background features such as educational stage, grade Level and section. (3) Behavioral features such as raised hand on class, opening resources, answering survey by parents, and school satisfaction.

The dataset consists of 305 males and 175 females. The students come from different origins such as 179 students are from Kuwait, 172 students are from Jordan, 28 students from Palestine, 22 students are from Iraq, 17 students from Lebanon, 12 students from Tunis, 11 students from Saudi Arabia, 9 students from Egypt, 7 students from Syria, 6 students from USA, Iran and Libya, 4 students from Morocco and one student from Venezuela.

The dataset is collected through two educational semesters: 245 student records are collected during the first semester and 235 student records are collected during the second semester.

The data set includes also the school attendance feature such as the students are classified into two categories based on their absence days: 191 students exceed 7 absence days and 289 students their absence days under 7.

This dataset includes also a new category of features; this feature is parent parturition in the educational process. Parent participation feature have two sub features: Parent Answering Survey and Parent School Satisfaction. There are 270 of the parents answered survey and 210 are not, 292 of the parents are satisfied from the school and 188 are not.

(See the related papers for more details).

Attributes

1 Gender - student's gender (nominal: 'Male' or 'Female’)

2 Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

3 Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)

5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)

6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)

7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)

8 Semester- school year semester (nominal:’ First’,’ Second’)

9 Parent responsible for student (nominal:’mom’,’father’)

10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)

11- Visited resources- how many times the student visits a course content(numeric:0-100)

12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)

13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)

14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)

15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)

16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

The students are classified into three numerical intervals based on their total grade/mark:

Low-Level: interval includes values from 0 to 69,

Middle-Level: interval includes values from 70 to 89,

High-Level: interval includes values from 90-100.

Relevant Papers:

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

Citation Request:

Please include these citations if you plan to use this dataset:

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.
Harry Potter all books(preprocessed)
kaggle.com
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateusz Kudła (2022). Harry Potter all books(preprocessed) [Dataset]. https://www.kaggle.com/datasets/moxxis/harry-potter-lstm
Explore at:
Dataset updated
Oct 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mateusz Kudła
Description
This dataset contains 2 files: - all Harry Potter books in txt file format. - all Harry Potter books in txt file format, but i leave most of the special characters like [, "]. (each sentence ends with '|' for easier splitting)

I made a little preprocess on them and: - removed all unnecessary special characters and left in the text only [. ! ?] characters - removed all newline characters ( ) - removed all carriage return (\r) characters - removed all unnecessary text like page number or book title on each page - added white spaces before all special characters to treat them as separate tokens - fixed all faulty words where: * special character [. ! ?] was at the end of the word * special character [. ! ?] was at the beginning of the word * special character [. ! ?] was in the middle of the word
Data on Bike Buyers by using MS EXCEL
kaggle.com
Updated Mar 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umasri (2022). Data on Bike Buyers by using MS EXCEL [Dataset]. https://www.kaggle.com/datasets/unica02/data-on-bike-buyers-by-using-ms-excel
Explore at:
Dataset updated
Mar 25, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Umasri
Description
The dataset includes customer id,Martial Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age,Purchased Bike. Blog
k
DAIGT-V2-Train-Dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DAIGT-V2-Train-Dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Description
Please use version 2 (there were some issues with v1 that I fixed)!

New release of DAIGT train dataset! Improvement: - new models: Cohere Command, Google Palm, GPT4 (from Radek!) - new prompts, including source texts from the original essays! - mapping of essay text to original prompt from persuade corpus - filtering by the famous "RDizzl3_seven"

persuade_corpus 25996 chat_gpt_moth 2421 llama2_chat 2421 mistral7binstruct_v2 2421 mistral7binstruct_v1 2421 original_moth 2421 train_essays 1378 llama_70b_v1 1172 falcon_180b_v1 1055 darragh_claude_v7 1000 darragh_claude_v6 1000 radek_500 500 NousResearch/Llama-2-7b-chat-hf 400 mistralai/Mistral-7B-Instruct-v0.1 400 cohere-command 350 palm-text-bison1 349 radekgpt4 200

Sources (please upvote the original datasets!): - Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) - Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) - Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) - Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays) - 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic) - LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai) - Official train essays - Essays I generated with various LLMs

License: MIT for the data I generated. Check source datasets for the other sources mentioned above.
k
100-Sports-Image-Classification
kaggle.com
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). 100-Sports-Image-Classification [Dataset]. https://www.kaggle.com/datasets/gpiosenka/sports-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2021
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Context

Please upvote if you find this dataset of use. - Thank you This version is an update of the earlier version. I ran a data set quality evaluation program on the previous version which found a considerable number of duplicate and near duplicate images. Duplicate images can lead to falsely higher values of validation and test set accuracy and I have eliminated these images in this version of the dataset. Images were gathered from internet searches. The images were scanned with a duplicate image detector program I wrote. Any duplicate images were removed to prevent bleed through of images between the train, test and valid data sets. All images were then resized to 224 X224 X 3 and converted to jpg format. A csv file is included that for each image file contains the relative path to the image file, the image file class label and the dataset (train, test or valid) that the image file resides in. This is a clean dataset. If you build a good model you should achieve at least 95% accuracy on the test set. If you build a very good model for example using transfer learning you should be able to achieve 98%+ on test set accuracy. If you find this data set useful please upvote. Thanks

Content

Collection of sports images covering 100 different sports.. Images are 224,224,3 jpg format. Data is separated into train, test and valid directories. Additionallly a csv file is included for those that wish to use it to create there own train, test and validation datasets. .

Inspiration

Wanted to build a high quality clean data set that was easy to use and had no bad images or duplication between the train, test and validation data sets. Provides a good data set to test your models on. Design for straight forward application of keras preprocessing functions like ImageDataenerator.flow_from_directory or if you use the csv file ImageDataGenerator.flow_from_dataframe. This dataset was carefully created so that the region of interest (ROI) in this case the sport occupies approximately 50% of the pixels in the image. As a consequence even models of moderate complexity should achieve training and validation accuracies in the high 90's.

Facebook

Twitter

Click to copy link

Link copied

Cite

Trinath Reddy (2018). Kaggle Datasets Data [Dataset]. https://www.kaggle.com/datasets/trinath003/kaggle-datasets-data

Kaggle Datasets Data

Data Manipulation&Visualisation on Kaggle Datasets

Explore at:

22 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Oct 5, 2018

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Trinath Reddy

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Context

Every day a new dataset is uploaded on kaggle. In order to make different from other datasets I worked on it and finally, I got a crazy idea which made me create this dataset.

I create a dataset on kaggle datasets (For now most voted dataset's) sounds interesting right?

The dataset consists of all the attributes which are projected on kaggle dataset page. I am excited to share the data. https://image.ibb.co/j9Ybwz/Screenshot_from_2018_10_05_19_47_35.png" alt="enter image description here">

Content

Dataset consists of 1960 rows and 15 columns. All the attributes which are on kaggle are in the dataset.

Columns details are : Votes - int64 Image- object Link - object Title - object Sub-title - object Uploader - object Updated - object Version - int64 Tags - object FileType - object FileSize - object License - object Kernels - object Discussions - float64 Views - object

Acknowledgements

Its hard to create this dataset. The main motto is to share the knowledge and create tutorials and we learned.

Clear search

Close search

Google apps

Main menu

Kaggle Datasets Data

Context

Content

Acknowledgements

--Kaggle-s-All-Completed-Competition----Dataset--

If you found this dataset useful make an upvote & share your feedback .

Books Dataset

Context

Content

Acknowledgements

Inspiration

Kaggle Data Science Survey 2017-2021

Context

Methodology

Content

Acknowledgements

Inspiration

Heart Attack Analysis & Prediction Dataset

Hone your analytical and ML skills by participating in tasks of my other dataset's. Given below.

About this dataset

Resume Dataset

Unsupervised Learning on Country Data

Clustering the Countries by using Unsupervised Learning for HELP International

Objective:

About organization:

Problem Statement:

Large-Language-Models--the-tweets

Credit Card Approval Prediction

A Credit Card Dataset for Machine Learning!

Context

Task

Content & Explanation

Real estate price prediction

Dataset

Contents

Graphs Dataset

Way to Use this Dataset

About this Dataset

Acknowledgements

Splash banner

Mobile Price Classification

Context

Network Intrusion Detection

Face Recognition Dataset

Dataset

Contents

School dataset csv-file

Students' Academic Performance Dataset

Students' Academic Performance Dataset (xAPI-Edu-Data)

Source:

Dataset Information:

Attributes

The students are classified into three numerical intervals based on their total grade/mark:

Relevant Papers:

Citation Request:

Harry Potter all books(preprocessed)

Data on Bike Buyers by using MS EXCEL

DAIGT-V2-Train-Dataset

100-Sports-Image-Classification

Context

Content

Inspiration

Kaggle Datasets Data

Data Manipulation&Visualisation on Kaggle Datasets

Context

Content

Acknowledgements