100+ datasets found
  1. Datasets for federated learning

    • kaggle.com
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wonghoitin (2022). Datasets for federated learning [Dataset]. https://www.kaggle.com/datasets/wonghoitin/datasets-for-federated-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    wonghoitin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)

    source:

    1. smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain

    2. heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain

    3. water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain

    4. customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain

    5. insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain

    6. credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain

    7. income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain

    8. machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain

    9. skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)

    10. score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain

  2. home data for ml course

    • kaggle.com
    zip
    Updated Aug 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julián Pérez Pesce (2019). home data for ml course [Dataset]. https://www.kaggle.com/datasets/estrotococo/home-data-for-ml-course
    Explore at:
    zip(199207 bytes)Available download formats
    Dataset updated
    Aug 27, 2019
    Authors
    Julián Pérez Pesce
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Exercise: Machine Learning Competitions

    When you click on Run / All, the notebook will give you an error: "Files doesn't exist" With this DataSet you fix that. It's the same from DanB. Please UPVOTE!

    Enjoy!

  3. Kaggle's learning path map

    • kaggle.com
    Updated May 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Aboelwafa (2022). Kaggle's learning path map [Dataset]. https://www.kaggle.com/datasets/omaraboelwafa/kaggles-learning-path-map/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Omar Aboelwafa
    Description

    Dataset

    This dataset was created by Omar Aboelwafa

    Contents

  4. School Learning Modalities, 2020-2021

    • kaggle.com
    • healthdata.gov
    • +3more
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Javad Babaei (2023). School Learning Modalities, 2020-2021 [Dataset]. https://www.kaggle.com/datasets/mhmtaha/school-learning-modalities-2020-2021
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Kaggle
    Authors
    Mohamad Javad Babaei
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    The 2020-2021 School Learning Modalities dataset provides weekly estimates of school learning modality (including in-person, remote, or hybrid learning) for U.S. K-12 public and independent charter school districts for the 2020-2021 school year, from August 2020 – June 2021.

    These data were modeled using multiple sources of input data (see below) to infer the most likely learning modality of a school district for a given week. These data should be considered district-level estimates and may not always reflect true learning modality, particularly for districts in which data are unavailable. If a district reports multiple modality types within the same week, the modality offered for the majority of those days is reflected in the weekly estimate. All school district metadata are sourced from the National Center for Educational Statistics (NCES) for 2020-2021.

    School learning modality types are defined as follows:

    In-Person: All schools within the district offer face-to-face instruction 5 days per week to all students at all available grade levels. Remote: Schools within the district do not offer face-to-face instruction; all learning is conducted online/remotely to all students at all available grade levels. Hybrid: Schools within the district offer a combination of in-person and remote learning; face-to-face instruction is offered less than 5 days per week, or only to a subset of students.

    Data Information

    School learning modality data provided here are model estimates using combined input data and are not guaranteed to be 100% accurate. This learning modality dataset was generated by combining data from four different sources: Burbio [1], MCH Strategic Data [2], the AEI/Return to Learn Tracker [3], and state dashboards [4-20]. These data were combined using a Hidden Markov model which infers the sequence of learning modalities (In-Person, Hybrid, or Remote) for each district that is most likely to produce the modalities reported by these sources. This model was trained using data from the 2020-2021 school year. Metadata describing the location, number of schools and number of students in each district comes from NCES [21]. You can read more about the model in the CDC MMWR: COVID-19–Related School Closures and Learning Modality Changes — United States, August 1–September 17, 2021. The metrics listed for each school learning modality reflect totals by district and the number of enrolled students per district for which data are available. School districts represented here exclude private schools and include the following NCES subtypes:

    Public school district that is NOT a component of a supervisory union Public school district that is a component of a supervisory union Independent charter district

    “BI” in the state column refers to school districts funded by the Bureau of Indian Education.

    Technical Notes

    Data from September 1, 2020 to June 25, 2021 correspond to the 2020-2021 school year. During this timeframe, all four sources of data were available. Inferred modalities with a probability below 0.75 were deemed inconclusive and were omitted. Data for the month of July may show “In Person” status although most school districts are effectively closed during this time for summer break. Users may wish to exclude July data from use for this reason where applicable.

    Sources

    K-12 School Opening Tracker. Burbio 2021; https

  5. Machine Learning Job Postings in the US

    • kaggle.com
    • opendatabay.com
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Kumeyko (2025). Machine Learning Job Postings in the US [Dataset]. https://www.kaggle.com/datasets/ivankmk/thousand-ml-jobs-in-usa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ivan Kumeyko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This dataset contains 1,000 job postings for Machine Learning-related roles across the United States, scraped between late 2024 and early 2025. The data was collected directly from company career pages and job boards, focusing on full job descriptions and associated company information.

    Column Descriptions

    ColumnDescription
    job_posted_dateThe date the job was posted (format: YYYY-MM-DD).
    company_address_localityThe city or locality of the job or company.
    company_address_regionThe U.S. state or region where the job is located.
    company_nameThe name of the company posting the job.
    company_websiteThe official website of the company.
    company_descriptionA short description or mission statement of the company.
    job_description_textThe full job description text as listed in the original posting.
    seniority_levelThe required seniority level (e.g., Internship, Entry level, Mid-Senior).
    job_titleThe full job title listed in the posting.
  6. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(143722388562 bytes)Available download formats
    Dataset updated
    Jun 5, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  7. Udemy Course Recommender System: Unlocking Persona

    • kaggle.com
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nayana Ck (2024). Udemy Course Recommender System: Unlocking Persona [Dataset]. https://www.kaggle.com/datasets/nayanack/udemy-courses
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2024
    Dataset provided by
    Kaggle
    Authors
    Nayana Ck
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12038776%2F5a9c101d1a2498a37406d3a91cebb66c%2Fpkx1jz0terhb9bm50stm.jpg?generation=1713517466786485&alt=media" alt="">

    Objective :

    This project aims to develop a personalized course recommendation engine integrated with a Django web application, leveraging machine learning techniques. Utilizing a dataset from Udemy containing course information, the system analyzes user preferences and behaviors to provide tailored recommendations. The recommendation engine employs machine learning algorithms to predict courses that align with the user's interests based on input provided. This project demonstrates the significance of recommendation engines in enhancing user experience, increasing engagement, and driving revenue growth in the competitive digital landscape.

    Dataset : * The dataset contains information on 3678 courses available on Udemy, spanning various subjects and levels of difficulty. Here's a description of the columns: * course_id: Unique identifier for each course. * course_title: Title of the course. * url: URL of the course. * is_paid: Boolean indicating whether the course is paid or not. * price: Price of the course. * num_subscribers: Number of subscribers enrolled in the course. * num_reviews: Number of reviews for the course. * num_lectures: Number of lectures in the course. * level: Difficulty level of the course (e.g., Beginner, Intermediate, Advanced). * content_duration: Duration of the course content. * published_timestamp: Timestamp indicating when the course was published. * subject: Subject category of the course. * This dataset provides comprehensive information about Udemy courses, including their popularity (measured by the number of subscribers and reviews), pricing, content duration, and level of difficulty. It covers a wide range of subjects, making it suitable for building a recommendation engine to suggest courses based on user preferences and interests.

  8. Learn Pandas

    • kaggle.com
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaidik Patel (2023). Learn Pandas [Dataset]. https://www.kaggle.com/datasets/js1js2js3js4js5/learn-pandas/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vaidik Patel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄

    This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.

    Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳

  9. 2021 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Saber (2024). 2021 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/yousefsaber27/2021-kaggle-machine-learning-and-data-science-survey
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yousef Saber
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Yousef Saber

    Released under MIT

    Contents

  10. AR-Based English Vocabulary Learning Dataset

    • kaggle.com
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). AR-Based English Vocabulary Learning Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/ar-based-english-vocabulary-learning-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is designed to support research and development in the field of Augmented Reality (AR)-based English vocabulary learning. It simulates data collected from an AR-based educational platform that combines gamification, interactive features, and real-time feedback to enhance student engagement and learning outcomes.

    The dataset includes a range of features such as demographic details (e.g., age, grade level), activity types, AR features used, engagement scores, and performance metrics. The target variable (Post_Test_Category) categorizes students' post-test performance into three levels: Low, Medium, and High.

    Key Highlights: Focus on AR-driven interactive learning experiences. Includes gamified activities and real-world tasks for vocabulary enhancement. Tracks pre-test and post-test performance to evaluate learning outcomes. Incorporates both objective metrics (accuracy, completion rates) and subjective feedback (engagement scores). Suitable for machine learning tasks like classification, clustering, and predictive modeling.

  11. Machine Learning Datasets

    • kaggle.com
    Updated Oct 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devendra Parihar (2022). Machine Learning Datasets [Dataset]. https://www.kaggle.com/datasets/dev523/machine-learning-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Devendra Parihar
    Description

    Dataset

    This dataset was created by Devendra Parihar

    Contents

  12. PVC-Infrared dataset for deep learning

    • kaggle.com
    Updated Dec 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziang Wei (2023). PVC-Infrared dataset for deep learning [Dataset]. https://www.kaggle.com/datasets/ziangwei/irtpvc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ziang Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for 19 PVC specimens with planted defects. The dataset is available for academic and research use only. Papers using this dataset are kindly requested to refer to the paper at https://www.preprints.org/manuscript/202301.0483/v1

  13. Lending Club Loan Data Analysis - Deep Learning

    • kaggle.com
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deependra Verma
    Description

    DESCRIPTION

    Create a model that predicts whether or not a loan will be default using the historical data.

    Problem Statement:

    For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

    Domain: Finance

    Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

    Content:

    Dataset columns and definition:

    credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

    purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

    int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

    installment: The monthly installments owed by the borrower if the loan is funded.

    log.annual.inc: The natural log of the self-reported annual income of the borrower.

    dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

    fico: The FICO credit score of the borrower.

    days.with.cr.line: The number of days the borrower has had a credit line.

    revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

    revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

    inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

    delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

    pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

    Steps to perform:

    Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

    Tasks:

    1. Feature Transformation

    Transform categorical values into numerical values (discrete)

    1. Exploratory data analysis of different factors of the dataset.

    2. Additional Feature Engineering

    You will check the correlation between features and will drop those features which have a strong correlation

    This will help reduce the number of features and will leave you with the most relevant features

    1. Modeling

    After applying EDA and feature engineering, you are now ready to build the predictive models

    In this part, you will create a deep learning model using Keras with Tensorflow backend

  14. Housing Prices Competition for Kaggle Learn Users

    • kaggle.com
    zip
    Updated Aug 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josue Faneittes (2020). Housing Prices Competition for Kaggle Learn Users [Dataset]. https://www.kaggle.com/josuefaneittes/housing-prices-competition-for-kaggle-learn-users
    Explore at:
    zip(13835 bytes)Available download formats
    Dataset updated
    Aug 29, 2020
    Authors
    Josue Faneittes
    Description

    Dataset

    This dataset was created by Josue Faneittes

    Contents

  15. Q&A With Mixtral-8x7B

    • kaggle.com
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2023). Q&A With Mixtral-8x7B [Dataset]. http://doi.org/10.34740/kaggle/dsv/7310806
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description:

    This dataset comprises a comprehensive collection of 200 000 question-and-answer pairs across 2353 academic subjects. It is designed to serve as a resource for educational research, natural language processing, and machine learning applications.

    Data Source:

    Subjects Covered: 2353 academic subjects spanning various disciplines.

    Total Entries: 200 000 Q&A pairs.

    Model Name: Mixtral-8x7B

    Dataset Features:

    Questions: Text fields containing academic questions.

    Answers: Text fields containing the corresponding answers, generated by Mixtral-8x7B.

    Usage Examples:

    Educational Research: Analyzing trends in academic queries and responses.

    NLP Applications: Training and benchmarking NLP models for question-answering systems.

    Machine Learning: Supervised learning tasks, such as text classification and answer generation.

    Data Quality and Limitations:

    Accuracy: Information about the accuracy of the answers, based on validation checks or user feedback.

    Bias and Fairness: Any known biases in the dataset, steps taken to mitigate them, and areas where the dataset may not be representative.

    Limitations: Areas where the dataset may not be sufficient or appropriate for use (e.g., specific subjects not adequately covered).

  16. Raisin Variety Classification Dataset

    • kaggle.com
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huseyin Cenik (2023). Raisin Variety Classification Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/6752008
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Huseyin Cenik
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14886202%2Ff366b6f91d9b8e2f15be6354e2d42de1%2FCabernet_Sauvignon_Gaillac.jpg?generation=1722803447627177&alt=media" alt="">

    Data Description

    The dataset includes images of Kecimen and Besni raisin varieties grown in Turkey, with a total of 900 raisin grains, including 450 pieces from each variety. These images were captured using CVS and underwent various stages of pre-processing. A total of 7 morphological features were extracted from these images and classified using three different artificial intelligence techniques.

    Data Fields:

    • Area: The number of pixels within the boundaries of the raisin.
    • Perimeter: The measurement of the boundary by calculating the distance around the raisin's edges.
    • MajorAxisLength: The length of the main axis, which is the longest line that can be drawn on the raisin.
    • MinorAxisLength: The length of the minor axis, which is the shortest line that can be drawn on the raisin.
    • Eccentricity: A measure of the eccentricity of the ellipse that has the same moments as the raisin.
    • ConvexArea: The number of pixels in the smallest convex shell encompassing the region formed by the raisin.
    • Extent: The ratio of the region formed by the raisin to the total pixels in the bounding box.
    • Class: The variety of raisin, either Kecimen or Besni.

    Çinar,İ̇lkay, Koklu,Murat, and Tasdemir,Sakir. (2023). Raisin. UCI Machine Learning Repository. https://doi.org/10.24432/C5660T.

  17. College Dataset (unsupervised learning)

    • kaggle.com
    zip
    Updated May 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishant Kumar (2020). College Dataset (unsupervised learning) [Dataset]. https://www.kaggle.com/datasets/nishantpatyal/college-dataset-unsupervised-learning
    Explore at:
    zip(32388 bytes)Available download formats
    Dataset updated
    May 2, 2020
    Authors
    Nishant Kumar
    Description

    Dataset

    This dataset was created by Nishant Kumar

    Contents

  18. Python for Machine Learning - Crash Course

    • kaggle.com
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kunaal Naik (2023). Python for Machine Learning - Crash Course [Dataset]. https://www.kaggle.com/datasets/funxexcel/python-for-machine-learning-crash-course/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kunaal Naik
    Description

    Dataset

    This dataset was created by Kunaal Naik

    Contents

  19. Data from: Personalized Recommendation Systems Dataset

    • kaggle.com
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfaris Bachmid (2024). Personalized Recommendation Systems Dataset [Dataset]. https://www.kaggle.com/datasets/alfarisbachmid/personalized-recommendation-systems-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alfaris Bachmid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Personalized Recommendation Systems Dataset (150,000 Entries)

    This dataset is a fictional representation of user interactions within an e-commerce or streaming platform, created specifically for educational and training purposes. It simulates realistic user behavior and interactions to aid in developing and testing machine learning models for personalized recommendation systems. With 150,000 entries, it offers a rich variety of features suitable for building and evaluating algorithms in recommendation systems, user behavior analysis, and predictive modeling.

    Dataset Features: 1. User_ID: A unique identifier for each user (e.g., User_1, User_2, etc.), representing individual profiles on the platform.
    2. Item_ID: A unique identifier for each item, such as a product, movie, or song.
    3. Category: The type of item interacted with (e.g., Electronics, Books, Music, Movies, etc.), providing insights into user preferences.
    4. Rating: User-assigned ratings on a scale of 1.0 to 5.0, reflecting the level of satisfaction with the item.
    5. Timestamp: The exact date and time of the interaction, useful for time-based analysis.
    6. Price: The price of the item at the time of interaction, recorded in USD.
    7. Platform: The platform or device used to interact with the system (e.g., Web, Mobile App, Smart TV, Tablet), capturing multi-device behavior.
    8. Location: The geographic region of the user, categorized into areas such as North America, Europe, Asia, etc., for regional behavioral analysis.

    Applications: This dataset is versatile and can be used for: - Collaborative Filtering Models: Harness user-item interaction data to recommend items based on similar users or items.
    - Content-Based Recommendation Systems: Leverage item attributes to generate personalized recommendations.
    - User Behavior Analysis: Uncover insights into user preferences, habits, and trends to inform marketing strategies.
    - Predictive Modeling: Train machine learning models to predict user preferences or future interactions.

    Important Note: This dataset is fictional and does not represent real-world data. It has been generated solely for educational and training purposes, making it ideal for students, researchers, and data scientists who want to practice building machine learning models without using sensitive or proprietary data.

    Why Use This Dataset? 1. Diverse and Realistic Features: Simulates key aspects of user interaction in modern platforms.
    2. Scalable Size: Provides sufficient data for training advanced machine learning models, ensuring robust validation.
    3. Rich Metadata: Enables detailed analysis and multiple use cases, from recommendation systems to business analytics.

    This dataset is a great resource for exploring personalized recommendations or enhancing machine learning skills in a practical and safe manner.

  20. Geospatial Learn Course Data

    • kaggle.com
    Updated Sep 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Cook (2019). Geospatial Learn Course Data [Dataset]. https://www.kaggle.com/alexisbcook/geospatial-learn-course-data/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alexis Cook
    Description

    Dataset

    This dataset was created by Alexis Cook

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
wonghoitin (2022). Datasets for federated learning [Dataset]. https://www.kaggle.com/datasets/wonghoitin/datasets-for-federated-learning
Organization logo

Datasets for federated learning

Divided Kaggle datasets for horizontal/vertical federated learning experiments

Explore at:
160 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
wonghoitin
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)

source:

  1. smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain

  2. heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain

  3. water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain

  4. customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain

  5. insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain

  6. credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain

  7. income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain

  8. machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain

  9. skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)

  10. score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain

Search
Clear search
Close search
Google apps
Main menu