Facebook
TwitterThis dataset was created by NIYIBIGIRA Geredi
Facebook
TwitterThis dataset was created by Deepika Ravinutala
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Abdelrahman Attiea
Released under Apache 2.0
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Abdirizak MX
Released under CC0: Public Domain
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a synthetic yet realistic representation of personal auto insurance data, crafted using real-world statistics. While actual insurance data is sensitive and unavailable for public use, this dataset bridges the gap by offering a safe and practical alternative for building robust data science projects.
Why This Dataset? - Realistic Foundation: Synthetic data generated from real-world statistical patterns ensures practical relevance. - Safe for Use: No personal or sensitive information—completely anonymized and compliant with data privacy standards. - Flexible Applications: Ideal for testing models, developing prototypes, and showcasing portfolio projects.
How You Can Use It: - Build machine learning models for predicting customer conversion and retention. - Design risk assessment tools or premium optimization algorithms. - Create dashboards to visualize trends in customer segmentation and policy data. - Explore innovative solutions for the insurance industry using a realistic data foundation.
This dataset empowers you to work on real-world insurance scenarios without compromising on data sensitivity.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As a Data Scientist, you most likely at some point already have heard of TDS. It is an amazing publication about lots of AI-related topics, providing Hands-On project expertise, interesting framework and technology discussions and the theory behind hundreds of algorithms.
I scraped the archive of TDS from 2018 until 2021 to collect the titles, taglines, urls and date of (almost) every article in that year). You can apply various techniques on this data, such as for instance topic modeling.
If needed, I can also continue labeling this dataset. Just drop me a note what you'd be interested in, and I'll add labels to this dataset.
Of course, special thanks to Towards Data Science and its editors for providing such great content on their publication. Reading such articles is always a great start into the day for me 😁
Think about ways to make sense of this data. What kind of articles have been published the most? What are the topics of the respective years or months?
Tip: You might also want to think about how you can enrich this data? There are many ways to do so!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context:This synthetic healthcare dataset has been created to serve as a valuable resource for data science, machine learning, and data analysis enthusiasts. It is designed to mimic real-world healthcare data, enabling users to practice, develop, and showcase their data manipulation and analysis skills in the context of the healthcare industry.
Inspiration:The inspiration behind this dataset is rooted in the need for practical and diverse healthcare data for educational and research purposes. Healthcare data is often sensitive and subject to privacy regulations, making it challenging to access for learning and experimentation. To address this gap, I have leveraged Python's Faker library to generate a dataset that mirrors the structure and attributes commonly found in healthcare records. By providing this synthetic data, I hope to foster innovation, learning, and knowledge sharing in the healthcare analytics domain.
Dataset Information:Each column provides specific information about the patient, their admission, and the healthcare services provided, making this dataset suitable for various data analysis and modeling tasks in the healthcare domain. Here's a brief explanation of each column in the dataset - - Name: This column represents the name of the patient associated with the healthcare record. - Age: The age of the patient at the time of admission, expressed in years. - Gender: Indicates the gender of the patient, either "Male" or "Female." - Blood Type: The patient's blood type, which can be one of the common blood types (e.g., "A+", "O-", etc.). - Medical Condition: This column specifies the primary medical condition or diagnosis associated with the patient, such as "Diabetes," "Hypertension," "Asthma," and more. - Date of Admission: The date on which the patient was admitted to the healthcare facility. - Doctor: The name of the doctor responsible for the patient's care during their admission. - Hospital: Identifies the healthcare facility or hospital where the patient was admitted. - Insurance Provider: This column indicates the patient's insurance provider, which can be one of several options, including "Aetna," "Blue Cross," "Cigna," "UnitedHealthcare," and "Medicare." - Billing Amount: The amount of money billed for the patient's healthcare services during their admission. This is expressed as a floating-point number. - Room Number: The room number where the patient was accommodated during their admission. - Admission Type: Specifies the type of admission, which can be "Emergency," "Elective," or "Urgent," reflecting the circumstances of the admission. - Discharge Date: The date on which the patient was discharged from the healthcare facility, based on the admission date and a random number of days within a realistic range. - Medication: Identifies a medication prescribed or administered to the patient during their admission. Examples include "Aspirin," "Ibuprofen," "Penicillin," "Paracetamol," and "Lipitor." - Test Results: Describes the results of a medical test conducted during the patient's admission. Possible values include "Normal," "Abnormal," or "Inconclusive," indicating the outcome of the test.
Usage Scenarios:This dataset can be utilized for a wide range of purposes, including: - Developing and testing healthcare predictive models. - Practicing data cleaning, transformation, and analysis techniques. - Creating data visualizations to gain insights into healthcare trends. - Learning and teaching data science and machine learning concepts in a healthcare context. - You can treat it as a Multi-Class Classification Problem and solve it for Test Results which contains 3 categories(Normal, Abnormal, and Inconclusive).
Acknowledgments:Image Credit:Image by BC Y from Pixabay
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset includes 405,100 publications, 139,005 phase diagrams, 409,771 crystalline nanostructures, 1,075,676 physical property sets, and 189,682 material phases. It integrates decades of scientific research on inorganic materials, enabling computational materials design, machine learning applications, and materials informatics studies across industry and academia.
Built on data extracted from about half a million peer-reviewed scientific publications, it offers standardized data, detailed chemical structures, crystal structures, and extensive metadata on various materials. - Get the data
The dataset helps researchers and engineers advance scientific discovery, predicting materials behavior, and accelerating materials innovation through data-driven research.
It allows researchers and engineers to explore computational chemistry, develop machine learning models for predicting materials behaviors. By combining raw data, experimental records, and computational analyses, MPDS helps scientists and materials experts design new compounds, identify similar materials, and optimize materials properties for engineering applications.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Asraf28
Released under Apache 2.0
Facebook
TwitterI've been creating videos on YouTube since November of 2017 (https://www.youtube.com/c/KenJee1) with the mission of making data science accessible to more people. One of the best ways to do this is to tell stories and working on projects. This is my attempt at my first community project. I am making my YouTube data available for everyone to help better understand the growth of my YouTube community and think about ways that it could be improved! I would love for everyone in the community feel like they had some hand in contributing to the channel.
Announcement Video: https://youtu.be/YPph59-rTxA
I will be sharing my favorite projects in a few of my videos (with permission of course), and would also like to give away a few small prizes to the top featured notebooks. I hope you have fun with the analysis, I'm interested in seeing what you find in the data!
For those looking for a place to start, some things I'm thinking about are: - What are the themes of the comment data? - What types of video titles and thumbnails drive the most traffic? - Who is my core audience and what are they interested in? - What types of videos have lead to the most growth? - What type of content are people engaging with the most or watching the longest?
Some advanced projects could be: - Creating a chat bot to respond to common comments with videos where I have addressed a topic - Pulling sentiment from thumbnails and titles and comparing that with performance
Data I would like to add over time - Video descriptions - Video subtitles - Actual video data
There are four files in this repo. The relevant data included in most of them is from Nov 2017 - Jan 2022. I gathered some of this data via the YouTube API and the rest from my specific analytics.
1) Aggregated Metrics By Video - This has all the topline metrics from my channel from its start (around 2015 to Jan 22 2022). I didn't post my first video until around 2) Aggregated Metrics By Video with Country and Subscriber Status - This has the same data as aggregated metrics by video, but it includes dimensions for which country people are viewing from and if the viewers are subscribed to the channel or not. 3) Video Performance Over Time - This has the daily data from each of my videos. 4) All Comments - This is all of my comment data gathered from the YouTube API. I have anonymized the users so don't worry about your name showing up!
This obviously wouldn't be possible without all of the wonderful people who watch and interact with my videos! I'm incredibly grateful for you all and I'm so happy I can share this project with you!
I collected this data from the YouTube API and through my own google analytics. Thus use of it must uphold the YouTube API's terms of service: https://developers.google.com/youtube/terms/api-services-terms-of-service
Facebook
TwitterThis dataset was created by Pawan Kumar
Facebook
TwitterThis dataset has been scraped from Glassdoor.com and contains comprehensive information about job offers. It is designed for those looking to analyze job market trends, salary estimates, company ratings, and other relevant job-related data.
This dataset includes a collection of 1000 job listings, providing a wide range of details for each job offer. It is freely accessible and can be used for various analytical purposes, including salary analysis, job trend research, and company evaluation.
Feel free to use this dataset for your analysis and projects. Whether you are studying salary trends, job market patterns, or company ratings, this dataset provides a rich source of information to support your work.
Facebook
TwitterThis dataset was created by Keval joshi
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning
To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.
My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.
A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
Facebook
TwitterThis dataset was created by Shekhar Parcha
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Hussein Al Chami
Released under MIT
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a comprehensive collection of computed properties for a wide range of materials, sourced from the Materials Project database. Each entry represents a unique material, identified by a material_id, and includes detailed information about its chemical composition and physical properties. These properties are calculated using density functional theory (DFT), a widely used computational method in materials science for predicting material behavior. The dataset is ideal for researchers, data scientists, and machine learning practitioners interested in materials discovery, property prediction, and exploratory analysis.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides a comprehensive collection of time series data sourced from the World Bank Open Data Platform, covering a wide range of global indicators from 1960 to the most recently published year. It includes economic, social, environmental, and demographic metrics, making it an ideal resource for researchers, data scientists, and policymakers interested in global development trends, economic forecasting, or socio-economic analysis.
A tutorial on how to combined the dataset topics together into one large dataset can be found here
My motivation for this project was to curate a high-quality collection of datasets for World Bank indicators organized by topics and structured in time-series, making them more accessible for data science projects. Since the World Bank’s Kaggle datasets have not been updated since 2019 https://www.kaggle.com/organizations/theworldbank, I saw an opportunity to provide more current data for the data analysis community.
This collection brings together more than 800 World Bank indicators organized into 18 topic‑specific CSV files. Each file is structured as a country‑year panel: every row represents a unique combination of year (1960‑present) and ISO‑3 country code, while the columns hold the topic’s indicators.
The collection includes datasets with a variety of indicators, such as:
- Economic Metrics: GDP growth (%), GDP per capita, consumer price inflation, merchandise trade, gross capital formation, and more.
- Social Metrics: School enrollment (primary, secondary, tertiary), infant mortality rate, maternal mortality rate, poverty headcount, and more.
- Environmental Metrics: Forest area, renewable energy consumption, food production indices, and more.
- Demographic Metrics: Urban population, life expectancy, net migration, and more.
This dataset is ideal for a variety of applications, including:
- Economic forecasting and trend analysis (e.g., GDP growth, inflation).
- Socio-economic studies (e.g., education, health, poverty).
- Environmental impact analysis (e.g., renewable energy adoption).
- Demographic research (e.g., population trends, migration).
Topic datasets can be merged with each other using year and country code. This tutorial with notebook code can help you get started quickly.
The data is collected via a custom software application that discovers and groups high-quality indicators with rules-based logic & artificial intelligence, generates metadata, and performs ETL for the data from the World Bank API. The result is a clean, up‑to‑date collection of World Bank indicators in time-series format that is ready for analysis—no manual downloads or data wrangling required.
The original World Bank data has been aggregated and transformed for ease of use. Missing values have been preserved as provided by the World Bank, and no significant transformations have been applied beyond formatting and aggregation into a single file.
The World Bank: World Development Indicators
This dataset is publicly available and sourced from the World Bank Open Data Platform and is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. When using this data, please attribute the World Bank as follows: "Data sourced from the World Bank, licensed under CC BY 4.0." For more details on the World Bank’s terms of use, visit: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets.
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Feel free to use this data in Kaggle notebooks, academic research, or policy analysis. If you create a derived dataset or analysis, I encourage you to share it with the Kaggle community.
Facebook
TwitterThis dataset was created by NIYIBIGIRA Geredi