CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Every day a new dataset is uploaded on kaggle. In order to make different from other datasets I worked on it and finally, I got a crazy idea which made me create this dataset.
I create a dataset on kaggle datasets (For now most voted dataset's) sounds interesting right?
The dataset consists of all the attributes which are projected on kaggle dataset page. I am excited to share the data. https://image.ibb.co/j9Ybwz/Screenshot_from_2018_10_05_19_47_35.png" alt="enter image description here">
Dataset consists of 1960 rows and 15 columns. All the attributes which are on kaggle are in the dataset.
Columns details are : Votes - int64 Image- object Link - object Title - object Sub-title - object Uploader - object Updated - object Version - int64 Tags - object FileType - object FileSize - object License - object Kernels - object Discussions - float64 Views - object
Its hard to create this dataset. The main motto is to share the knowledge and create tutorials and we learned.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the stats of all completed competitions organized on Kaggle .It contains 15 columns . 1.Comp_name- Name of competition
2.comp_ Reward- Type of Reward
3.comp_link- link of competiton
4.teams- number of participated team
5.Entries- Number of Entries
6.Competitors- number of competitors
7.start_date- starting date
8.start_month- starting month
9.start_year- starting year
10.Final_date- ending date
11.Final_month- Ending month
12.Final_year- ending year
13.code_link- Link of one notebook on each comp
14.Desc- Description of competition
This dataset has been scrapped from link
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Books read by users and ratings provided by them on Amazon
Online data for books from Amazon along with user ratings and users who bought them
Primarily for building recommender systems. This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0 http://www2.informatik.uni-freiburg.de/~cziegler/BX/
Can we select and recommend the top 10 books for each user based on past purchase behavior?
I have created this dataset for an easier way to analyse the progression of answers from the respondents that are participating each year in the very famous Data Science Kaggle Survey.
The sources of the present data are: * 2017: https://www.kaggle.com/kaggle/kaggle-survey-2017 * 2018: https://www.kaggle.com/kaggle/kaggle-survey-2018 * 2019: https://www.kaggle.com/c/kaggle-survey-2019/data * 2020: https://www.kaggle.com/c/kaggle-survey-2020/data * 2021: https://www.kaggle.com/c/kaggle-survey-2021/data
This dataset was created by manually aggregating each of the 5 tables mentioned above. The full methodology was as follows:
The aggregation was done manually, as the questions order, naming and types of answers differ from one year to another. Hence, the most accurate way (although not the most efficient), was to read, order and pick the questions with regards to the base table (which was the 2021 Survey).
This dataset contains the following:
kaggle_survey_2017_2021.csv
: the tabular dataset containing the aggregated data from 2017 to 2021.style.css
: a file that serves as custom styling for my notebook on this competition.images
folder: all images I have used for my notebook on this competition.Note: Notebook can be found here.
Thank you so much to the Kaggle Team for hosting these surveys and sharing with us all the data, so we can take the pulse of the community each year.
The Kaggle Survey is reach in information as is, but what can you find by adding another layer of information - the year? Evolutions in time could be fascinating.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Science Job Posting on Glassdoor
Groceries dataset for Market Basket Analysis(MBA)
Dataset for Facial recognition using ML approach
Covid_w/wo_Pneumonia Chest Xray
Disney Movies 1937-2016 Gross Income
Bollywood Movie data from 2000 to 2019
17.7K English song data from 2008-2017
Age : Age of the patient
Sex : Sex of the patient
exang: exercise induced angina (1 = yes; 0 = no)
ca: number of major vessels (0-3)
cp : Chest Pain type chest pain type
trtbps : resting blood pressure (in mm Hg)
chol : cholestoral in mg/dl fetched via BMI sensor
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg : resting electrocardiographic results
thalach : maximum heart rate achieved
target : 0= less chance of heart attack 1= more chance of heart attack
n
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Companies often receive thousands of resumes for each job posting and employ dedicated screening officers to screen qualified candidates.
Hiring the right talent is a challenge for all businesses. This challenge is magnified by the high volume of applicants if the business is labour-intensive, growing, and facing high attrition rates.
IT departments are short of growing markets. In a typical service organization, professionals with a variety of technical skills and business domain expertise are hired and assigned to projects to resolve customer issues. This task of selecting the best talent among many others is known as Resume Screening.
Typically, large companies do not have enough time to open each CV, so they use machine learning algorithms for the Resume Screening task.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
To categorise the countries using socio-economic and health factors that determine the overall development of the country.
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Kaggle dataset that contains tweets about Large Language Models is called the "Large Language Model Tweets Dataset". This dataset includes a collection of tweets that mention or discuss various aspects of large language models, such as their development, use cases, performance, ethical considerations, and impact on society.
The dataset contains over 10,000 tweets, from various sources, including researchers, practitioners, journalists, and the general public. The tweets are in English and cover a wide range of topics related to large language models, such as natural language processing, machine learning, deep learning, artificial intelligence, and more.
Each tweet in the dataset includes information such as the tweet ID, timestamp, user ID, user name, tweet text, and other metadata.
This dataset can be useful for researchers and practitioners who are interested in studying large language models from a social media perspective. It can also be used for sentiment analysis, topic modeling, and other text analytics tasks related to large language models.
Note from KB: the description above was generated with ChatGPT itself.
Note from KB2: Please leave an upvote if you download :-)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Don't ask me where this data come from, the answer is I don't know!
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.
At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.
Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.
There're two tables could be merged by ID
:
application_record.csv | ||
---|---|---|
Feature name | Explanation | Remarks |
ID | Client number | |
CODE_GENDER | Gender | |
FLAG_OWN_CAR | Is there a car | |
FLAG_OWN_REALTY | Is there a property | |
CNT_CHILDREN | Number of children | |
AMT_INCOME_TOTAL | Annual income | |
NAME_INCOME_TYPE | Income category | |
NAME_EDUCATION_TYPE | Education level | |
NAME_FAMILY_STATUS | Marital status | |
NAME_HOUSING_TYPE | Way of living | |
DAYS_BIRTH | Birthday | Count backwards from current day (0), -1 means yesterday |
DAYS_EMPLOYED | Start date of employment | Count backwards from current day(0). If positive, it means the person currently unemployed. |
FLAG_MOBIL | Is there a mobile phone | |
FLAG_WORK_PHONE | Is there a work phone | |
FLAG_PHONE | Is there a phone | |
FLAG_EMAIL | Is there an email | |
OCCUPATION_TYPE | Occupation | |
CNT_FAM_MEMBERS | Family size |
credit_record.csv | ||
---|---|---|
Feature name | Explanation | Remarks |
ID | Client number | |
MONTHS_BALANCE | Record month | The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on |
STATUS | Status | 0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X: No loan for the month |
Related data : Credit Card Fraud Detection Related competition: Home Credit Default Risk
This dataset was created by Algor_Bruce
Released under Other (specified in description)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Please refer to this notebook.
This dataset contains 15875 samples of images of graphs divided into 8 classes.
0 - just image 1 - bar chart 2 - diagram 3 - flow chart 4 - graph 5 - growth chart 6 - pie chart 7 - table
Banner and icon by NCOA
Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.
He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.
Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.
In this problem you do not have to predict actual price but a price range indicating how high the price is
Background The dataset to be audited was provided which consists of a wide variety of intrusions simulated in a military network environment. It created an environment to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was focused like a real environment and blasted with multiple attacks. A connection is a sequence of TCP packets starting and ending at some time duration between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Also, each connection is labelled as either normal or as an attack with exactly one specific attack type. Each connection record consists of about 100 bytes. For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features) .The class variable has two categories: • Normal • Anomalous
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was created by Vasuki Patel
Released under CC0: Public Domain
A school dataset typically contains information about educational institutions, such as schools, colleges, or universities. These datasets often include various details about the schools, their locations, academic programs, and student demographics. Here is a general description of the information you may find in a school dataset:
It's important to note that the specific details and fields included in a school dataset may vary depending on the source and purpose of the dataset. Different organizations or educational authorities may collect and provide different sets of information. If you have a particular school dataset in mind or specific requirements, please provide additional information, and I'll do my best to assist you further.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data Set Characteristics: Multivariate
Number of Instances: 480
Area: E-learning, Education, Predictive models, Educational Data Mining
Attribute Characteristics: Integer/Categorical
Number of Attributes: 16
Date: 2016-11-8
Associated Tasks: Classification
Missing Values? No
File formats: xAPI-Edu-Data.csv
Elaf Abu Amrieh, Thair Hamtini, and Ibrahim Aljarah, The University of Jordan, Amman, Jordan, http://www.Ibrahimaljarah.com www.ju.edu.jo
This is an educational data set which is collected from learning management system (LMS) called Kalboard 360. Kalboard 360 is a multi-agent LMS, which has been designed to facilitate learning through the use of leading-edge technology. Such system provides users with a synchronous access to educational resources from any device with Internet connection.
The data is collected using a learner activity tracker tool, which called experience API (xAPI). The xAPI is a component of the training and learning architecture (TLA) that enables to monitor learning progress and learner’s actions like reading an article or watching a training video. The experience API helps the learning activity providers to determine the learner, activity and objects that describe a learning experience. The dataset consists of 480 student records and 16 features. The features are classified into three major categories: (1) Demographic features such as gender and nationality. (2) Academic background features such as educational stage, grade Level and section. (3) Behavioral features such as raised hand on class, opening resources, answering survey by parents, and school satisfaction.
The dataset consists of 305 males and 175 females. The students come from different origins such as 179 students are from Kuwait, 172 students are from Jordan, 28 students from Palestine, 22 students are from Iraq, 17 students from Lebanon, 12 students from Tunis, 11 students from Saudi Arabia, 9 students from Egypt, 7 students from Syria, 6 students from USA, Iran and Libya, 4 students from Morocco and one student from Venezuela.
The dataset is collected through two educational semesters: 245 student records are collected during the first semester and 235 student records are collected during the second semester.
The data set includes also the school attendance feature such as the students are classified into two categories based on their absence days: 191 students exceed 7 absence days and 289 students their absence days under 7.
This dataset includes also a new category of features; this feature is parent parturition in the educational process. Parent participation feature have two sub features: Parent Answering Survey and Parent School Satisfaction. There are 270 of the parents answered survey and 210 are not, 292 of the parents are satisfied from the school and 188 are not.
(See the related papers for more details).
1 Gender - student's gender (nominal: 'Male' or 'Female’)
2 Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
3 Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)
5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)
6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)
7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)
8 Semester- school year semester (nominal:’ First’,’ Second’)
9 Parent responsible for student (nominal:’mom’,’father’)
10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)
11- Visited resources- how many times the student visits a course content(numeric:0-100)
12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)
13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)
14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)
15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)
16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)
Low-Level: interval includes values from 0 to 69,
Middle-Level: interval includes values from 70 to 89,
High-Level: interval includes values from 90-100.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.
Please include these citations if you plan to use this dataset:
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.
Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.
This dataset contains 2 files: - all Harry Potter books in txt file format. - all Harry Potter books in txt file format, but i leave most of the special characters like [, "]. (each sentence ends with '|' for easier splitting)
I made a little preprocess on them and: - removed all unnecessary special characters and left in the text only [. ! ?] characters - removed all newline characters ( ) - removed all carriage return (\r) characters - removed all unnecessary text like page number or book title on each page - added white spaces before all special characters to treat them as separate tokens - fixed all faulty words where: * special character [. ! ?] was at the end of the word * special character [. ! ?] was at the beginning of the word * special character [. ! ?] was in the middle of the word
The dataset includes customer id,Martial Status,Gender,Income,Children,Education,Occupation,Home Owner,Cars,Commute Distance,Region,Age,Purchased Bike. Blog
Please use version 2 (there were some issues with v1 that I fixed)!
New release of DAIGT train dataset! Improvement: - new models: Cohere Command, Google Palm, GPT4 (from Radek!) - new prompts, including source texts from the original essays! - mapping of essay text to original prompt from persuade corpus - filtering by the famous "RDizzl3_seven"
persuade_corpus 25996
chat_gpt_moth 2421
llama2_chat 2421
mistral7binstruct_v2 2421
mistral7binstruct_v1 2421
original_moth 2421
train_essays 1378
llama_70b_v1 1172
falcon_180b_v1 1055
darragh_claude_v7 1000
darragh_claude_v6 1000
radek_500 500
NousResearch/Llama-2-7b-chat-hf 400
mistralai/Mistral-7B-Instruct-v0.1 400
cohere-command 350
palm-text-bison1 349
radekgpt4 200
Sources (please upvote the original datasets!): - Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) - Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) - Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) - Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays) - 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic) - LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai) - Official train essays - Essays I generated with various LLMs
License: MIT for the data I generated. Check source datasets for the other sources mentioned above.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Please upvote if you find this dataset of use. - Thank you This version is an update of the earlier version. I ran a data set quality evaluation program on the previous version which found a considerable number of duplicate and near duplicate images. Duplicate images can lead to falsely higher values of validation and test set accuracy and I have eliminated these images in this version of the dataset. Images were gathered from internet searches. The images were scanned with a duplicate image detector program I wrote. Any duplicate images were removed to prevent bleed through of images between the train, test and valid data sets. All images were then resized to 224 X224 X 3 and converted to jpg format. A csv file is included that for each image file contains the relative path to the image file, the image file class label and the dataset (train, test or valid) that the image file resides in. This is a clean dataset. If you build a good model you should achieve at least 95% accuracy on the test set. If you build a very good model for example using transfer learning you should be able to achieve 98%+ on test set accuracy. If you find this data set useful please upvote. Thanks
Collection of sports images covering 100 different sports.. Images are 224,224,3 jpg format. Data is separated into train, test and valid directories. Additionallly a csv file is included for those that wish to use it to create there own train, test and validation datasets. .
Wanted to build a high quality clean data set that was easy to use and had no bad images or duplication between the train, test and validation data sets. Provides a good data set to test your models on. Design for straight forward application of keras preprocessing functions like ImageDataenerator.flow_from_directory or if you use the csv file ImageDataGenerator.flow_from_dataframe. This dataset was carefully created so that the region of interest (ROI) in this case the sport occupies approximately 50% of the pixels in the image. As a consequence even models of moderate complexity should achieve training and validation accuracies in the high 90's.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Every day a new dataset is uploaded on kaggle. In order to make different from other datasets I worked on it and finally, I got a crazy idea which made me create this dataset.
I create a dataset on kaggle datasets (For now most voted dataset's) sounds interesting right?
The dataset consists of all the attributes which are projected on kaggle dataset page. I am excited to share the data. https://image.ibb.co/j9Ybwz/Screenshot_from_2018_10_05_19_47_35.png" alt="enter image description here">
Dataset consists of 1960 rows and 15 columns. All the attributes which are on kaggle are in the dataset.
Columns details are : Votes - int64 Image- object Link - object Title - object Sub-title - object Uploader - object Updated - object Version - int64 Tags - object FileType - object FileSize - object License - object Kernels - object Discussions - float64 Views - object
Its hard to create this dataset. The main motto is to share the knowledge and create tutorials and we learned.