Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.
The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.
Facebook
TwitterThe dataset contains information on around 10,000 online courses from popular online learning platforms as : Coursera, Udacity, Simplilearn, and FutureLearn. The data was scraped and compiled, with the dataset being updated until the year 2023. This dataset provides valuable information for analyzing and understanding the online learning landscape as of that year.
The dataset is typically available in a structured format, such as a CSV (Comma-Separated Values) file or a spreadsheet, with each row representing a course and each column representing a specific attribute or feature of the course.
Potential Applications:
1- Course Recommendations: Analyzing the dataset can provide insights for recommending courses to individuals based on their interests, skill level, and career goals.
2- Market Analysis: Researchers or analysts can use the dataset to study the market share and popularity of different online learning platforms and subject areas.
3- Skill Demand Analysis: The dataset can help identify the most in-demand skills and subject areas among online learners.
4- Educational Research: Researchers can leverage the dataset to investigate trends and patterns in online learning, instructional design, and course delivery.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/
Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:
Over 8 million 311 service requests from 2012-2016
More than 1 million motor vehicle collisions 2012-present
Citi Bike stations and 30 million Citi Bike trips 2013-present
Over 1 billion Yellow and Green Taxi rides from 2009-present
Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015
This dataset is deprecated and not being updated.
Fork this kernel to get started with this dataset.
https://opendata.cityofnewyork.us/
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.
The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.
Banner Photo by @bicadmedia from Unplash.
On which New York City streets are you most likely to find a loud party?
Can you find the Virginia Pines in New York City?
Where was the only collision caused by an animal that injured a cyclist?
What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here">
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus. Most people infected with COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness. During the entire course of the pandemic, one of the main problems that healthcare providers have faced is the shortage of medical resources and a proper plan to efficiently distribute them. In these tough times, being able to predict what kind of resource an individual might require at the time of being tested positive or even before that will be of immense help to the authorities as they would be able to procure and arrange for the resources necessary to save the life of that patient.
The main goal of this project is to build a machine learning model that, given a Covid-19 patient's current symptom, status, and medical history, will predict whether the patient is in high risk or not.
The dataset was provided by the Mexican government (link). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The generated fake reviews dataset, containing 20k fake reviews and 20k real product reviews. OR = Original reviews (presumably human created and authentic); CG = Computer-generated fake reviews.
Salminen, J., Kandpal, C., Kamel, A. M., Jung, S., & Jansen, B. J. (2022). Creating and detecting fake reviews of online products. Journal of Retailing and Consumer Services, 64, 102771. https://doi.org/10.1016/j.jretconser.2021.102771
Foto von Brett Jordan auf Unsplash
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of products with the attributes
Category
Kari, Venkatram (2023), “Product Dataset”, Mendeley Data, V1, doi: 10.17632/v8yt3r8th2.1
Facebook
TwitterMSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A STATISTICAL RESEARCH ON THE EFFECTS OF MENTAL HEALTH ON STUDENTS’ CGPA dataset This Data set was collected by a survey conducted by Google forms from University student in order to examine their current academic situation and mental health.
All the data was based on Malaysia and collected from Iium (International Islamic University Malaysia).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.
The dataset contains 45,000 records and 14 variables, each described below:
| Column | Description | Type |
|---|---|---|
person_age | Age of the person | Float |
person_gender | Gender of the person | Categorical |
person_education | Highest education level | Categorical |
person_income | Annual income | Float |
person_emp_exp | Years of employment experience | Integer |
person_home_ownership | Home ownership status (e.g., rent, own, mortgage) | Categorical |
loan_amnt | Loan amount requested | Float |
loan_intent | Purpose of the loan | Categorical |
loan_int_rate | Loan interest rate | Float |
loan_percent_income | Loan amount as a percentage of annual income | Float |
cb_person_cred_hist_length | Length of credit history in years | Float |
credit_score | Credit score of the person | Integer |
previous_loan_defaults_on_file | Indicator of previous loan defaults | Categorical |
loan_status (target variable) | Loan approval status: 1 = approved; 0 = rejected | Integer |
The dataset can be used for multiple purposes:
loan_status variable (approved/not approved) for potential applicants.credit_score variable based on individual and loan-related attributes. Mind the data issue from the original data, such as the instance > 100-year-old as age.
This dataset provides a rich basis for understanding financial risk factors and simulating predictive modeling processes for loan approval and credit scoring.
Facebook
TwitterFamous paintings and their artists. This data set is published to help students have interesting data to practice SQL
Foto von Steve Johnson auf Unsplash
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚
datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.
ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.
ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.
ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName 👩💻: The name of the dataset creator, which could be different from the owner.
creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.
creatorUserId 💼: The unique user ID of the dataset creator.
scriptCount 📜: The number of scripts (kernels) associated with this dataset.
scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount 👀: The number of views the dataset page has received on Kaggle.
downloadCount ⬇️: The number of times the dataset has been downloaded by users.
dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated 🔄: The date when the dataset was last updated or modified.
voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl ⬇️: A direct link to download the dataset files.
newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.
usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.
datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.
rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names 📑: A comma-separated string of category names that represent the dataset’s classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.
The dataset is in ".csv" format with two columns - the first column is the class name and the second one is the datapoint of that class. The data point is the product and description from the e-commerce website.
The dataset has the following features :
Data Set Characteristics: Multivariate
Number of Instances: 50425
Number of classes: 4
Area: Computer science
Attribute Characteristics: Real
Number of Attributes: 1
Associated Tasks: Classification
Missing Values? No
Gautam. (2019). E commerce text dataset (version - 2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3355823
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://github.com/dean-kg/RoadToExpertRanking_Kaggle/blob/main/kg_medal.png?raw=true" alt="medals">
Dataset Medals are awarded to popular public datasets published to the site, as measured by number of upvotes. Not all upvotes count towards medals: votes by novices are excluded from medal calculation.
Metadata of 42,955 datasets on Kaggle from 2015-12 to 2021-11
Tidied up version of dataset provided by @kukuroo3
Source: https://www.kaggle.com/kukuroo3/dataset-of-kaggle-dataset-include-medalvotecount
Facebook
TwitterData Set Information:
This is an event log of an incident management process extracted from data gathered from the audit system of an instance of the ServiceNowTM platform used by an IT company. The event log is enriched with data loaded from a relational database underlying a corresponding process-aware information system. Information was anonymized for privacy.
Number of instances: 141,712 events (24,918 incidents) Number of attributes: 36 attributes (1 case identifier, 1 state identifier, 32 descriptive attributes, 2 dependent variables)
Attribute Information:
Relevant Papers:
Amaral, C. A. L., Fantinato, M., Reijers, H. A., Peres, S. M., Enhancing Completion Time Prediction Through Attribute Selection. Proceedings of the 15th International Conference on Advanced Information Technologies for Management (AITM 2018) and 13th International Conference on Information Systems Management (ISM 2018), Revised Selected Papers – Lecture Notes in Business Information Processing, v. 346, pp. 3-23, 2019. [Web Link]
Amaral, C. A. L., Fantinato, M., Peres, S. M., Attribute Selection with Filter and Wrapper: An Application on Incident Management Process. Proceedings of the 14th Federated Conference on Computer Science and Information Systems (FedCSIS 2018), pp. 679-682, 2018. [Web Link]
Maita, A. R. C., Martins, L. C., Paz, C. R. L., Rafferty, L., Hung, P., Peres, S. M., Fantinato, M. A systematic mapping study of process mining. Enterprise Information Systems, v. 12, n. 5, pp. 505-549, 2018. [Web Link]
Citation Request:
Please cite this paper if you use this dataset: Amaral, C. A. L., Fantinato, M., Reijers, H. A., Peres, S. M., Enhancing Completion Time Prediction Through Attribute Selection. Proceedings of the 15th International Conference on Advanced Information Technologies for Management (AITM 2018) and 13th International Conference on Information Systems Management (ISM 2018), Revised Selected Papers – Lecture Notes in Business Information Processing, v. 346, pp. 3-23, 2019. [Web Link]
Facebook
TwitterIn astronomy, stellar classification is the classification of stars based on their spectral characteristics. The classification scheme of galaxies, quasars, and stars is one of the most fundamental in astronomy. The early cataloguing of stars and their distribution in the sky has led to the understanding that they make up our own galaxy and, following the distinction that Andromeda was a separate galaxy to our own, numerous galaxies began to be surveyed as more powerful telescopes were built. This datasat aims to classificate stars, galaxies, and quasars based on their spectral characteristics.
The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar. 1. obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS 1. alpha = Right Ascension angle (at J2000 epoch) 1. delta = Declination angle (at J2000 epoch) 1. u = Ultraviolet filter in the photometric system 1. g = Green filter in the photometric system 1. r = Red filter in the photometric system 1. i = Near Infrared filter in the photometric system 1. z = Infrared filter in the photometric system 1. run_ID = Run Number used to identify the specific scan 1. rereun_ID = Rerun Number to specify how the image was processed 1. cam_col = Camera column to identify the scanline within the run 1. field_ID = Field number to identify each field 1. spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class) 1. class = object class (galaxy, star or quasar object) 1. redshift = redshift value based on the increase in wavelength 1. plate = plate ID, identifies each plate in SDSS 1. MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken 1. fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation
fedesoriano. (January 2022). Stellar Classification Dataset - SDSS17. Retrieved [Date Retrieved] from https://www.kaggle.com/fedesoriano/stellar-classification-dataset-sdss17.
The data released by the SDSS is under public domain. Its taken from the current data release RD17. - More information about the license: http://www.sdss.org/science/image-gallery/
SDSS Publications: - Abdurro’uf et al., The Seventeenth data release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 DATA (Abdurro’uf et al. submitted to ApJS) [arXiv:2112.02026]
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).
The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.
The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.
There are two data files. Both are arranged on "structureId" of the protein:
pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.
data_seq.csv contains >400,000 protein structure sequences.
Original data set down loaded from http://www.rcsb.org/pdb/
Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.
Facebook
TwitterThe "Framingham" heart disease dataset includes over 4,240 records,16 columns and 15 attributes. The goal of the dataset is to predict whether the patient has 10-year risk of future (CHD) coronary heart disease
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Financial Risk Assessment Dataset provides detailed information on individual financial profiles. It includes demographic, financial, and behavioral data to assess financial risk. The dataset features various columns such as income, credit score, and risk rating, with intentional imbalances and missing values to simulate real-world scenarios.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.