https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)
source:
smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain
heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain
water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain
customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain
insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain
credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain
income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain
machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain
skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)
score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Exercise: Machine Learning Competitions
When you click on Run / All, the notebook will give you an error: "Files doesn't exist" With this DataSet you fix that. It's the same from DanB. Please UPVOTE!
Enjoy!
This dataset was created by Omar Aboelwafa
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
The 2020-2021 School Learning Modalities dataset provides weekly estimates of school learning modality (including in-person, remote, or hybrid learning) for U.S. K-12 public and independent charter school districts for the 2020-2021 school year, from August 2020 – June 2021.
These data were modeled using multiple sources of input data (see below) to infer the most likely learning modality of a school district for a given week. These data should be considered district-level estimates and may not always reflect true learning modality, particularly for districts in which data are unavailable. If a district reports multiple modality types within the same week, the modality offered for the majority of those days is reflected in the weekly estimate. All school district metadata are sourced from the National Center for Educational Statistics (NCES) for 2020-2021.
School learning modality types are defined as follows:
In-Person: All schools within the district offer face-to-face instruction 5 days per week to all students at all available grade levels. Remote: Schools within the district do not offer face-to-face instruction; all learning is conducted online/remotely to all students at all available grade levels. Hybrid: Schools within the district offer a combination of in-person and remote learning; face-to-face instruction is offered less than 5 days per week, or only to a subset of students.
Data Information
School learning modality data provided here are model estimates using combined input data and are not guaranteed to be 100% accurate. This learning modality dataset was generated by combining data from four different sources: Burbio [1], MCH Strategic Data [2], the AEI/Return to Learn Tracker [3], and state dashboards [4-20]. These data were combined using a Hidden Markov model which infers the sequence of learning modalities (In-Person, Hybrid, or Remote) for each district that is most likely to produce the modalities reported by these sources. This model was trained using data from the 2020-2021 school year. Metadata describing the location, number of schools and number of students in each district comes from NCES [21]. You can read more about the model in the CDC MMWR: COVID-19–Related School Closures and Learning Modality Changes — United States, August 1–September 17, 2021. The metrics listed for each school learning modality reflect totals by district and the number of enrolled students per district for which data are available. School districts represented here exclude private schools and include the following NCES subtypes:
Public school district that is NOT a component of a supervisory union Public school district that is a component of a supervisory union Independent charter district
“BI” in the state column refers to school districts funded by the Bureau of Indian Education.
Technical Notes
Data from September 1, 2020 to June 25, 2021 correspond to the 2020-2021 school year. During this timeframe, all four sources of data were available. Inferred modalities with a probability below 0.75 were deemed inconclusive and were omitted. Data for the month of July may show “In Person” status although most school districts are effectively closed during this time for summer break. Users may wish to exclude July data from use for this reason where applicable.
Sources
K-12 School Opening Tracker. Burbio 2021; https
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 1,000 job postings for Machine Learning-related roles across the United States, scraped between late 2024 and early 2025. The data was collected directly from company career pages and job boards, focusing on full job descriptions and associated company information.
Column | Description |
---|---|
job_posted_date | The date the job was posted (format: YYYY-MM-DD). |
company_address_locality | The city or locality of the job or company. |
company_address_region | The U.S. state or region where the job is located. |
company_name | The name of the company posting the job. |
company_website | The official website of the company. |
company_description | A short description or mission statement of the company. |
job_description_text | The full job description text as listed in the original posting. |
seniority_level | The required seniority level (e.g., Internship, Entry level, Mid-Senior). |
job_title | The full job title listed in the posting. |
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12038776%2F5a9c101d1a2498a37406d3a91cebb66c%2Fpkx1jz0terhb9bm50stm.jpg?generation=1713517466786485&alt=media" alt="">
This project aims to develop a personalized course recommendation engine integrated with a Django web application, leveraging machine learning techniques. Utilizing a dataset from Udemy containing course information, the system analyzes user preferences and behaviors to provide tailored recommendations. The recommendation engine employs machine learning algorithms to predict courses that align with the user's interests based on input provided. This project demonstrates the significance of recommendation engines in enhancing user experience, increasing engagement, and driving revenue growth in the competitive digital landscape.
Dataset : * The dataset contains information on 3678 courses available on Udemy, spanning various subjects and levels of difficulty. Here's a description of the columns: * course_id: Unique identifier for each course. * course_title: Title of the course. * url: URL of the course. * is_paid: Boolean indicating whether the course is paid or not. * price: Price of the course. * num_subscribers: Number of subscribers enrolled in the course. * num_reviews: Number of reviews for the course. * num_lectures: Number of lectures in the course. * level: Difficulty level of the course (e.g., Beginner, Intermediate, Advanced). * content_duration: Duration of the course content. * published_timestamp: Timestamp indicating when the course was published. * subject: Subject category of the course. * This dataset provides comprehensive information about Udemy courses, including their popularity (measured by the number of subscribers and reviews), pricing, content duration, and level of difficulty. It covers a wide range of subjects, making it suitable for building a recommendation engine to suggest courses based on user preferences and interests.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄
This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.
Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Yousef Saber
Released under MIT
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to support research and development in the field of Augmented Reality (AR)-based English vocabulary learning. It simulates data collected from an AR-based educational platform that combines gamification, interactive features, and real-time feedback to enhance student engagement and learning outcomes.
The dataset includes a range of features such as demographic details (e.g., age, grade level), activity types, AR features used, engagement scores, and performance metrics. The target variable (Post_Test_Category) categorizes students' post-test performance into three levels: Low, Medium, and High.
Key Highlights: Focus on AR-driven interactive learning experiences. Includes gamified activities and real-world tasks for vocabulary enhancement. Tracks pre-test and post-test performance to evaluate learning outcomes. Incorporates both objective metrics (accuracy, completion rates) and subjective feedback (engagement scores). Suitable for machine learning tasks like classification, clustering, and predictive modeling.
This dataset was created by Devendra Parihar
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for 19 PVC specimens with planted defects. The dataset is available for academic and research use only. Papers using this dataset are kindly requested to refer to the paper at https://www.preprints.org/manuscript/202301.0483/v1
DESCRIPTION
Create a model that predicts whether or not a loan will be default using the historical data.
Problem Statement:
For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
Domain: Finance
Analysis to be done: Perform data preprocessing and build a deep learning prediction model.
Content:
Dataset columns and definition:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
Steps to perform:
Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.
Tasks:
Transform categorical values into numerical values (discrete)
Exploratory data analysis of different factors of the dataset.
Additional Feature Engineering
You will check the correlation between features and will drop those features which have a strong correlation
This will help reduce the number of features and will leave you with the most relevant features
After applying EDA and feature engineering, you are now ready to build the predictive models
In this part, you will create a deep learning model using Keras with Tensorflow backend
This dataset was created by Josue Faneittes
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description:
This dataset comprises a comprehensive collection of 200 000 question-and-answer pairs across 2353 academic subjects. It is designed to serve as a resource for educational research, natural language processing, and machine learning applications.
Data Source:
Subjects Covered: 2353 academic subjects spanning various disciplines.
Total Entries: 200 000 Q&A pairs.
Model Name: Mixtral-8x7B
Dataset Features:
Questions: Text fields containing academic questions.
Answers: Text fields containing the corresponding answers, generated by Mixtral-8x7B.
Usage Examples:
Educational Research: Analyzing trends in academic queries and responses.
NLP Applications: Training and benchmarking NLP models for question-answering systems.
Machine Learning: Supervised learning tasks, such as text classification and answer generation.
Data Quality and Limitations:
Accuracy: Information about the accuracy of the answers, based on validation checks or user feedback.
Bias and Fairness: Any known biases in the dataset, steps taken to mitigate them, and areas where the dataset may not be representative.
Limitations: Areas where the dataset may not be sufficient or appropriate for use (e.g., specific subjects not adequately covered).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14886202%2Ff366b6f91d9b8e2f15be6354e2d42de1%2FCabernet_Sauvignon_Gaillac.jpg?generation=1722803447627177&alt=media" alt="">
The dataset includes images of Kecimen and Besni raisin varieties grown in Turkey, with a total of 900 raisin grains, including 450 pieces from each variety. These images were captured using CVS and underwent various stages of pre-processing. A total of 7 morphological features were extracted from these images and classified using three different artificial intelligence techniques.
Data Fields:
Çinar,İ̇lkay, Koklu,Murat, and Tasdemir,Sakir. (2023). Raisin. UCI Machine Learning Repository. https://doi.org/10.24432/C5660T.
This dataset was created by Nishant Kumar
This dataset was created by Kunaal Naik
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Personalized Recommendation Systems Dataset (150,000 Entries)
This dataset is a fictional representation of user interactions within an e-commerce or streaming platform, created specifically for educational and training purposes. It simulates realistic user behavior and interactions to aid in developing and testing machine learning models for personalized recommendation systems. With 150,000 entries, it offers a rich variety of features suitable for building and evaluating algorithms in recommendation systems, user behavior analysis, and predictive modeling.
Dataset Features:
1. User_ID: A unique identifier for each user (e.g., User_1
, User_2
, etc.), representing individual profiles on the platform.
2. Item_ID: A unique identifier for each item, such as a product, movie, or song.
3. Category: The type of item interacted with (e.g., Electronics, Books, Music, Movies, etc.), providing insights into user preferences.
4. Rating: User-assigned ratings on a scale of 1.0 to 5.0, reflecting the level of satisfaction with the item.
5. Timestamp: The exact date and time of the interaction, useful for time-based analysis.
6. Price: The price of the item at the time of interaction, recorded in USD.
7. Platform: The platform or device used to interact with the system (e.g., Web, Mobile App, Smart TV, Tablet), capturing multi-device behavior.
8. Location: The geographic region of the user, categorized into areas such as North America, Europe, Asia, etc., for regional behavioral analysis.
Applications:
This dataset is versatile and can be used for:
- Collaborative Filtering Models: Harness user-item interaction data to recommend items based on similar users or items.
- Content-Based Recommendation Systems: Leverage item attributes to generate personalized recommendations.
- User Behavior Analysis: Uncover insights into user preferences, habits, and trends to inform marketing strategies.
- Predictive Modeling: Train machine learning models to predict user preferences or future interactions.
Important Note: This dataset is fictional and does not represent real-world data. It has been generated solely for educational and training purposes, making it ideal for students, researchers, and data scientists who want to practice building machine learning models without using sensitive or proprietary data.
Why Use This Dataset?
1. Diverse and Realistic Features: Simulates key aspects of user interaction in modern platforms.
2. Scalable Size: Provides sufficient data for training advanced machine learning models, ensuring robust validation.
3. Rich Metadata: Enables detailed analysis and multiple use cases, from recommendation systems to business analytics.
This dataset is a great resource for exploring personalized recommendations or enhancing machine learning skills in a practical and safe manner.
This dataset was created by Alexis Cook
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)
source:
smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain
heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain
water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain
customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain
insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain
credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain
income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain
machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain
skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)
score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain