19 datasets found
  1. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

  2. Morocco: Rainfall Indicators at Subnational Level

    • data.humdata.org
    • kaggle.com
    csv
    Updated Sep 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WFP - World Food Programme (2025). Morocco: Rainfall Indicators at Subnational Level [Dataset]. https://data.humdata.org/dataset/mar-rainfall-subnational
    Explore at:
    csv(1460994), csv(13837942)Available download formats
    Dataset updated
    Sep 14, 2025
    Dataset provided by
    World Food Programmehttp://da.wfp.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Morocco
    Description

    This dataset contains dekadal rainfall indicators, computed from Climate Hazards Group InfraRed Precipitation satellite imagery with insitu Station data (CHIRPS) version 2 and the CHIRPS-GEFS short term rainfall forecasts, aggregated by subnational administrative units.

    Included indicators are (for each dekad):

    • 10 day rainfall mm
    • rainfall 1-month rolling aggregation mm
    • rainfall 3-month rolling aggregation mm
    • rainfall long term average mm
    • rainfall 1-month rolling aggregation long term average mm
    • rainfall 3-month rolling aggregation long term average mm
    • rainfall anomaly %
    • rainfall 1-month anomaly %
    • rainfall 3-month anomaly %

    The administrative units used for aggregation are based on WFP data and contain a Pcode reference attributed to each unit. The number of input pixels used to create the aggregates, is provided in the n_pixels column. Finally, the type column indicates if the value is based on a forecast, a preliminary or a final product.

    Forecasts are issued on the 6th, 16th, and 26th of each month for the upcoming 10-day period (dekad), then updated with improved versions on the 1st, 11th, and 21st. Preliminary observations replace the previous dekad’s forecast on the 3rd, 13th, and 23rd, and are later replaced by final observations—published mid-month (13th or 23rd)—covering all three dekads of the prior month. Please find a summary below:

    Publication Day: Forecast type, Covers (Dekad)

    • 1st: Updated forecast, 1–10 of the same month
    • 6th: Initial forecast, 11–20 of the same month
    • 11th: Updated forecast, 1–10 of the same month
    • 16th: Initial forecast, 21–end of the same month
    • 21st: Updated forecast, 11–20 of the same month
    • 26th: Initial forecast, 1–10 of the following month

    For more on CHIRPS-GEFS forecasts, see: https://www.chc.ucsb.edu/data/chirps-gefs

    For further details, please see the methodology section.

  3. 🐧 Palmer Penguins Dataset Extended

    • kaggle.com
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samy Baladram (2023). 🐧 Palmer Penguins Dataset Extended [Dataset]. http://doi.org/10.34740/kaggle/ds/3891364
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samy Baladram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    https://i.imgur.com/5rtbtpN.png" alt="Imgur">

    Overview

    The original Palmer's Penguins dataset is an invaluable resource in the world of data science, often used for statistical analysis, data visualization, and introductory machine learning tasks. Collected in the Palmer Archipelago near Antarctica, the dataset provides information on three species of penguins, including Adélie, Gentoo, and Chinstrap, and covers essential biological metrics such as bill dimensions and body mass.

    Our extended dataset aims to build upon this foundational work by incorporating new, realistic features. We have included additional variables like diet, year of observation, life stage, and health metrics. These extra features allow for a more nuanced understanding of penguin biology and ecology, making it ideal for more complex analyses, including but not limited to educational, ecological, and advanced machine learning applications.

    Columns

    The dataset consists of the following columns:

    • Species: Species of the penguin (Adelie, Chinstrap, Gentoo)
    • Island: Island where the penguin was found (Biscoe, Dream, Torgensen)
    • Sex: Gender of the penguin (Male, Female)
    • Diet: Primary diet of the penguin (Fish, Krill, Squid)
    • Year: Year the data was collected (2021-2025)
    • Life Stage: The life stage of the penguin (Chick, Juvenile, Adult)
    • Body Mass (g): Body mass in grams
    • Bill Length (mm): Bill length in millimeters
    • Bill Depth (mm): Bill depth in millimeters
    • Flipper Length (mm): Flipper length in millimeters
    • Health Metrics: Health status of the penguin (Healthy, Overweight, Underweight)

    What Sets This Dataset Apart?

    Temporal Insight

    The inclusion of yearly data from 2021 to 2025 allows for longitudinal studies, providing a temporal dimension that can help track the impact of climate change, dietary shifts, or other ecological factors on penguin populations over time.

    Comprehensive Health Indicators

    We introduce the 'Health Metrics' column, which takes into account the body mass, life stage, and species to categorize each penguin's health status. This provides a multi-faceted view of individual well-being and can be crucial for conservation studies.

    Multi-Dimensional Diet and Life Stages

    Our data structure enables the mapping of the diet to specific life stages, offering a granular understanding of penguin ecology. This added detail can be crucial for studying nutritional needs at different life stages.

    Accounting for Sexual Dimorphism

    Recognizing the importance of gender-based variations in penguin biology, our dataset incorporates attributes that allow for the study of sexual dimorphism, such as differing body sizes and potential diet variations between males and females.

    Ideal Usage Scenarios

    This enriched dataset is particularly suitable for: - Advanced ecological models that require multiple layers of data. - Educational case studies focusing on biology, ecology, or data science. - Data-driven conservation efforts aimed at penguin species. - Machine learning algorithms that benefit from diverse and multi-dimensional data.

    Acknowledgment

    We wish to express our deepest respect and acknowledgment to the original research team behind the Palmer's Penguins dataset. This Extended Palmer's Penguins dataset is designed to build upon the solid foundation laid by the original work. It is created to serve as a complementary resource that adds additional dimensions for research and educational purposes. In no way is this artificial dataset intended to discredit or disrespect the invaluable contributions made through the original dataset.

    All illustrations in this dataset are AI-generated. https://i.imgur.com/yzroo3h.png" alt="Imgur">

  4. How Common is Your Birthday?

    • kaggle.com
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). How Common is Your Birthday? [Dataset]. https://www.kaggle.com/datasets/thedevastator/us-births-how-common-is-your-birthday
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    US Births - How Common is Your Birthday?

    How popular is your birthday?

    By Andy Kriebel [source]

    About this dataset

    The file contains data on births in the United States from 1994 to 2014. The data includes the following columns: year: The year of the observation. (Integer) month: The month of the observation. (Integer) date_of_month: The date of the observation. (Integer) day_of_week: The day of the week of the observation. (Integer) births: The number of births on the given day. (Integer)

    How to use the dataset

    The US Births dataset on Kaggle contains data on births in the United States from 1994 to 2014. The data is broken down by year, month, date of month, day of week, and births.

    This dataset can be used to answer questions about when people are born, how common certain birthdays are, and any trends over time. For example, you could use this dataset to find out which day of the week has the most births or which month has the most births

    Research Ideas

    • Determining which day of the year and what time of day that people are mostly born to help with staffing levels in maternity wards
    • Identifying trends in baby names over time
    • Predicting the number of births on a given day

    Acknowledgements

    This data set is a combined effort of the U.S. National Center for Health Statistics and the U.S. Social Security Administration, provided by FiveThirtyEight. It contains data on births in the United States from 1994 to 2014, with the following columns: year, month, date_of_month, day_of_week, births

    ->Thank you to FiveThirtyEight for providing this dataset!

    Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: US_births_1994-2014.csv | Column name | Description | |:------------------|:---------------------------------------------| | year | Year of the data. (Integer) | | month | Month of the data. (Integer) | | date_of_month | Day of the month of the data. (Integer) | | day_of_week | Day of the week of the data. (Integer) | | births | Number of births on the given day. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit Andy Kriebel.

  5. Weather Long-term Time Series Forecasting

    • kaggle.com
    Updated Nov 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair King (2024). Weather Long-term Time Series Forecasting [Dataset]. https://www.kaggle.com/datasets/alistairking/weather-long-term-time-series-forecasting
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2024
    Dataset provided by
    Kaggle
    Authors
    Alistair King
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Weather Long-term Time Series Forecasting (2020)

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8734253%2F832430253683be01796f74de8f532b34%2Fweather%20forecasting.png?generation=1730602999355141&alt=media" alt="">

    Dataset Description

    Weather is recorded every 10 minutes throughout the entire year of 2020, comprising 20 meteorological indicators measured at a Max Planck Institute weather station. The dataset provides comprehensive atmospheric measurements including air temperature, humidity, wind patterns, radiation, and precipitation. With over 52,560 data points per variable (365 days × 24 hours × 6 measurements per hour), this high-frequency sampling offers detailed insights into weather patterns and atmospheric conditions. The measurements include both basic weather parameters and derived quantities such as vapor pressure deficit and potential temperature, making it suitable for both meteorological research and practical applications. You can find some initial analysis using this dataset here: "Weather Long-term Time Series Forecasting Analysis".

    File Structure

    The dataset is provided in a CSV format with the following columns:

    Column NameDescription
    dateDate and time of the observation.
    pAtmospheric pressure in millibars (mbar).
    TAir temperature in degrees Celsius (°C).
    TpotPotential temperature in Kelvin (K), representing the temperature an air parcel would have if moved to a standard pressure level.
    TdewDew point temperature in degrees Celsius (°C), indicating the temperature at which air becomes saturated with moisture.
    rhRelative humidity as a percentage (%), showing the amount of moisture in the air relative to the maximum it can hold at that temperature.
    VPmaxMaximum vapor pressure in millibars (mbar), representing the maximum pressure exerted by water vapor at the given temperature.
    VPactActual vapor pressure in millibars (mbar), indicating the current water vapor pressure in the air.
    VPdefVapor pressure deficit in millibars (mbar), measuring the difference between maximum and actual vapor pressure, used to gauge drying potential.
    shSpecific humidity in grams per kilogram (g/kg), showing the mass of water vapor per kilogram of air.
    H2OCConcentration of water vapor in millimoles per mole (mmol/mol) of dry air.
    rhoAir density in grams per cubic meter (g/m³), reflecting the mass of air per unit volume.
    wvWind speed in meters per second (m/s), measuring the horizontal motion of air.
    max. wvMaximum wind speed in meters per second (m/s), indicating the highest recorded wind speed over the period.
    wdWind direction in degrees (°), representing the direction from which the wind is blowing.
    rainTotal rainfall in millimeters (mm), showing the amount of precipitation over the observation period.
    rainingDuration of rainfall in seconds (s), recording the time for which rain occurred during the observation period.
    SWDRShort-wave downward radiation in watts per square meter (W/m²), measuring incoming solar radiation.
    PARPhotosynthetically active radiation in micromoles per square meter per second (µmol/m²/s), indicating the amount of light available for photosynthesis.
    max. PARMaximum photosynthetically active radiation recorded in the observation period in µmol/m²/s.
    TlogTemperature logged in degrees Celsius (°C), potentially from a secondary sensor or logger.
    OTLikely refers to an "operational timestamp" or an offset in time, but may need clarification depending on the dataset's context.

    Potential Use Cases

    This high-resolution meteorological dataset enables applications across multiple domains. For weather forecasting, the frequent measurements support development of prediction models, while climate researchers can study microclimate variations and seasonal patterns. In agriculture, temperature and vapor pressure deficit data aids crop modeling and irrigation planning. The wind and radiation measurements benefit renewable energy planning, while the comprehensive atmospheric data supports environmental monitoring. The dataset's detailed nature makes it particularly suitable for machine learning applications and educational purposes in meteorology and data science.

    Credits

    • This data was provided by the Max Planck Institute, and acc...
  6. Air Quality Index Data

    • kaggle.com
    Updated Sep 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrijayan (2022). Air Quality Index Data [Dataset]. https://www.kaggle.com/cpluzshrijayan/air-quality-prediction-harbor/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shrijayan
    Description

    This dataset is totally imaginary and NOT real data this deal only with values that are created by us.

    Content

    This dataset deals with pollution in the Harbor of Chennai Kolkata and Visahapattinam has been recorded but it is a pain to create and collect all the data and arrange them in a format that interests data scientists. Hence I gathered four major pollutants and place them neatly in a CSV file.

    Content

    There is a total of 29 fields. The four pollutants (NO2, O3, SO2, and O3) each have 5 specific columns. Observations totaled. This kernel provides a good introduction to this dataset!

    For observations on specific columns visit the Column Metadata on the Data tab.

    Inspiration

    I did a related project and decided to open-source our dataset so that data scientists don't need to re-scrap from the first for historical pollution data.

  7. Sloan Digital Sky Survey - DR18

    • kaggle.com
    Updated Jul 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farid R (2023). Sloan Digital Sky Survey - DR18 [Dataset]. https://www.kaggle.com/datasets/diraf0/sloan-digital-sky-survey-dr18/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2023
    Dataset provided by
    Kaggle
    Authors
    Farid R
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16012776%2Fdb7fd8faf4277c85822f8bbfe5e113d2%2Farnaud-mariat-45Z6hW1dQMI-unsplash.jpg?generation=1690636699354713&alt=media" alt="">

    This dataset consists of 100,000 observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 42 features and 1 class column classifying the observation as either:

    • a STAR
    • a GALAXY
    • a QSO (Quasi-Stellar Object) or a Quasar.

    You can read more about the features below:

    • Objid, Specobjid - Object Identifiers
    • ra - J2000 Right Ascension
    • dec - J2000 Declination
    • redshift - Final Redshift of the celestial object
    • u, g, r, i, and z - better of DeV/Exp magnitude fit for u, g, r, i, and z. u, g, r, i, and z correspond to the five photometric bands namely ultraviolet band, green band, red band, infrared band, and near infrared band respectively.
    • run - Run number
    • rerun - Rerun number
    • camcol - Camera column
    • field - Field number

    The run number refers to a specific period in which the SDSS observes a part of the sky. SDSS is divided into several runs, each lasting for a certain amount of time, which are then combined to cover an extensive portion of the sky. The rerun number refers to the reprocessing of the data obtained.

    In each run, multiple charge-coupled device (CCD) cameras are arranged into a column which are responsible for imaging a specific portion of the sky. camcol refers to the camera column number which imaged a specific observation. A field is a specific portion of the sky that is imaged during a single exposure of the telescope. The entire sky is divided into a portion of fields and the field number column refers to the field or portion of the sky from which an observation was obtained.

    • plate - Plate number
    • fiberID - Optical Fiber ID

    A number of physical glass plates are mounted on the telescope, each containing a number of optical fibers corresponding to a specific position in the sky. When light hits these optical fibers, it is sent to spectrographs for analysis. plate number and fiberID refer to the number of the plate and the ID of the optical fiber responsible for gathering light from the celestial object respectively.

    • mjd - Modified Julian Date

    Modified Julian Date represents the number of days that have passed since midnight Nov. 17, 1858. It is used in SDSS to keep track of the time of each observation.

    • petroRad_u, petroRad_g, petroRad_r, petroRad_i, and petroRad_z - Petrosian Radii for the five photometric bands u (ultraviolet), g (green), r (red), i (infrared), and z (near-infrared) respectively.

    The petrosian radius is a measure of the size of a galaxy, and it is calculated using the petrosian flux profile. The petrosian flux profile measures how the brightness of an object varies with distance from its center. The petrosian radius is defined as the distance from the galaxy's center where the ratio of the local surface brightness to the average surface brightness reaches a certain predefined value. The local surface brightness refers to the brightness of a specific small region or pixel on the surface of an extended object. It is a measure of how much light is detected from that particular region. The average surface brightness, on the other hand, represents the mean or average brightness measured over the entire surface of the extended object. It is the total amount of light received from the object divided by its total area.

    These parameters help in characterizing the properties of celestial objects, especially when studying their morphologies, sizes, and how they evolve over time.

    • petroFlux_u, petroFlux_g, petroFlux_r, petroFlux_i, and petroFlux_z - Petrosian Fluxes for the five photometric bands u (ultraviolet), g (green), r (red), i (infrared), and z (near-infrared) respectively. These features describe the total amount of light emitted from the celestial objects.

    These parameters help in studying the photometric properties of the celestial objects, particularly in analyzing the brightness, colors, and spectral energy distribution of the objects. By using petrosian fluxes in different bands, astronomers can obtain a comprehensive view of an object's light emission across the electromagnetic spectrum.

    • petroR50_u, petroR50_g, petroR50_r, petroR50_i, and petroR50_z - Petrosian half-light radii for the five photometric bands u (ultraviolet), g (green), r (red), i (infrared), and z (near-infrared) respectively. PetroR50 is a measure of the radius at which half of the total light (or flux) emitted from a celestial object is enclosed with the petrosian aperture. The petrosian aperture is defined based on the petrosian radius, which is a measure of the size of the celestial object. The petrosian aperture allows a...
  8. Breast Cancer Dataset

    • kaggle.com
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Singh (2023). Breast Cancer Dataset [Dataset]. https://www.kaggle.com/datasets/utkarshx27/breast-cancer-dataset-used-royston-and-altman/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Utkarsh Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The data set contains patient records from a 1984-1989 trial conducted by the German Breast Cancer Study Group (GBSG) of 720 patients with node positive breast cancer; it retains the 686 patients with complete data for the prognostic variables. These data sets are used in the paper by Royston and Altman(2013). The Rotterdam data is used to create a fitted model, and the GBSG data for validation of the model. The paper gives references for the data source.

    Dataset Format

    A data set with 686 observations and 11 variables.

    ColumnsDescription
    pidpatient identifier
    ageage, years
    menomenopausal status (0= premenopausal, 1= postmenopausal)
    sizetumor size, mm
    gradetumor grade
    nodesnumber of positive lymph nodes
    pgrprogesterone receptors (fmol/l)
    erestrogen receptors (fmol/l)
    hormonhormonal therapy, 0= no, 1= yes
    rfstimerecurrence free survival time; days to first of recurrence, death or last follow-up
    status0= alive without recurrence, 1= recurrence or death

    References

    Patrick Royston and Douglas Altman, External validation of a Cox prognostic model: principles and methods. BMC Medical Research Methodology 2013, 13:33

  9. Parking Dynamic Dataset

    • kaggle.com
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit (2025). Parking Dynamic Dataset [Dataset]. https://www.kaggle.com/datasets/ankit07chy/parking-dynamic-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankit
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset captures detailed information about urban parking activity, traffic conditions, and vehicle types over time. With over 18,400 entries spread across 11 columns, it offers a sizable and rich set of observations—ideal for anyone looking to explore parking trends, analyze traffic flow, or build models to predict parking availability.

    What’s Inside:

    Timestamps: Each entry is time-stamped (from October 4 to December 19, 2016), making this a time-series dataset. That means you can track how parking behavior changes over days, weeks, or months—like identifying peak hours or weekend patterns.

    Unique Events: Every row comes with a unique ID, so each record represents a single moment or observation. That ensures clean data without duplicates.

    Parking Locations: The SystemCodeNumber column identifies where each record came from—there are 14 different locations or systems in total. Codes like BHMBONCSTB1, Broad Street, and Others-OCCSP119A show that data comes from multiple spots, which helps in comparing how parking demand varies by area.

    Capacity vs. Occupancy: Two of the most important columns show how many parking spaces were available (Capacity) and how many were filled (Occupancy) at any given time. Together, they tell us how full a lot was and help track usage levels. Some locations had space for thousands of cars, while others were much smaller.

    Geolocation: Latitude and longitude are included, meaning you can map every observation. This is especially helpful if you're working with GIS tools or want to visualize parking availability across a city.

    Vehicle Types: Most vehicles in the data are cars (81%), followed by bikes (20%) and a small number of other types (about 13%, or 3,578 entries). This breakdown can help in designing parking facilities or allocating space differently based on need.

    Traffic Conditions: The TrafficCondition column categorizes how busy the surrounding roads were: low (42%), average (35%), and high (23%). These conditions can be correlated with parking occupancy—like whether traffic is worse when lots are full.

    Queue Length: This column tracks how many vehicles were waiting for a spot (from 0 to 15), giving insight into where and when demand exceeded supply.

    Special Days: There’s also a flag (IsSpecialDay) indicating whether a day was out of the ordinary—perhaps due to an event, holiday, or other factor affecting usual patterns.

  10. Non-alcohol fatty liver disease (NAFLD)

    • kaggle.com
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Singh (2023). Non-alcohol fatty liver disease (NAFLD) [Dataset]. https://www.kaggle.com/datasets/utkarshx27/non-alcohol-fatty-liver-disease/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2023
    Dataset provided by
    Kaggle
    Authors
    Utkarsh Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data sets containing the data from a population study of non-alcoholic fatty liver disease (NAFLD). Subjects with the condition and a set of matched control subjects were followed forward for metabolic conditions, cardiac endpoints, and death.

    nafld1 is a data frame with 17549 observations on the following 10 variables.
    ColumnsDescription
    idsubject identifier
    ageage at entry to the study
    male0=female, 1=male
    weightweight in kg
    heightheight in cm
    bmibody mass index
    case.idthe id of the NAFLD case to whom this subject is matched
    futimetime to death or last follow-up
    status0= alive at last follow-up, 1=dead
    nafld2 is a data frame with 400123 observations and 4 variables containing laboratory data
    ColumnsDescription
    idsubject identifier
    daysdays since index date
    testthe type of value recorded
    valuethe numeric value
    nafld3 is a data frame with 34340 observations and 3 variables containing outcomes
    ColumnsDescription
    idsubject identifier
    daysdays since index date
    eventthe endpoint that occurred

    Details

    The primary reference for the NAFLD study is Allen (2018). The incidence of non-alcoholic fatty liver disease (NAFLD) has been rising rapidly in the last decade and it is now one of the main drivers of hepatology practice Tapper2018. It is essentially the presence of excess fat in the liver, and parallels the ongoing obesity epidemic. Approximately 20-25% of NAFLD patients will develop the inflammatory state of non-alcoholic steatohepatitis (NASH), leading to fibrosis and eventual end-stage liver disease. NAFLD can be accurately diagnosed by MRI methods, but NASH diagnosis currently requires a biopsy.

    The current study constructed a population cohort of all adult NAFLD subjects from 1997 to 2014 along with 4 potential controls for each case. To protect patient confidentiality all time intervals are in days since the index date; none of the dates from the original data were retained. Subject age is their integer age at the index date, and the subject identifier is an arbitrary integer. As a final protection, we include only a 90% random sample of the data. As a consequence analyses results will not exactly match the original paper.

    There are 3 data sets: nafld1 contains baseline data and has one observation per subject, nafld2 has one observation for each (time dependent) continuous measurement, and nafld3 has one observation for each yes/no outcome that occured.

  11. Pakistan Climate Change Observations

    • kaggle.com
    Updated Jan 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Pakistan Climate Change Observations [Dataset]. https://www.kaggle.com/datasets/thedevastator/pakistan-climate-change-observations/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Pakistan
    Description

    Pakistan Climate Change Observations

    2009-2013 Daily Weather Data for Islamabad

    By Ahsan Aman [source]

    About this dataset

    This dataset contains preprocessed climate change data for Islamabad, Pakistan from 2009 to 2013. Evaluate the shifts in local conditions to understand how climate change is impacting the region. Analyze changing patterns in maximum and minimum temperature readings, as well as atmospheric pressure, cloud cover, wind speed and rain levels. Assess how all of these factors together contribute to a dynamic weather pattern, and discover emerging trends for the years ahead. Get a detailed breakdown of daily weather measurements that can inform forecasting models and drive public awareness on climate change issues in this region of the world

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains daily climate change data for Islamabad, Pakistan for the years 2009 to 2013. This data can be used to analyze the different climate factors and to assess their impact on local weather conditions. This dataset is ideal for understanding how climate change affects daily weather patterns in an area over a long period of time.

    • First, explore the columns of this dataset and understand what each means:
    • daymonth_category: The month and day of the observation (String)
    • weather: The weather conditions during the observation (String)
    • max_temp: The maximum temperature during the observation (Float)
    • min_temp: The minimum temperature during the observation (Float)
    • wind: The wind speed during the observation (Float)
    • rain: The amount of rain during the observation (Float)
    • cloud: The amount of cloud cover during the observation Float
    • pressure :The atmospheric pressure during the observation Float - year :The year oftheobservation Integer -weathervalue intrepresenting numerical value assignedtotheweatherconditionsduringtheobservation Integer - avg_temp average temperatureduringtheobservation Float

    • After familiarizing yourself with these columns, use descriptive analysis methods such as filtering or grouping according to different criteria such as type or date range in order to explore specific trends within this data set. Depending on your purpose or research question, different kinds of filtering/grouping can provide useful insights into certain factors related to climate change. For example you may wish to look at trends related specifically to maximum temperatures in July through August in order observing yearly fluctuations that occur due heat waves etc., or you may want view rainfall trends for each month across all five years in our dataset etc..

      3 . Another important feature contained within this data set are its weathervalues which assigns numerical values associated with specific weather events occurring throughout our study period . These values can be used as labels e.g from 0to9 ,for further machine learning work related projects based off thisdata set .In addition based on these valuesyou could also create comparison graphsformeanandstandarddeviationsof particularweather eventtypesandsee howthey’rerelatedtohigher/lowertemperaturesorother factorslikerainfall ratesetc..

    Research Ideas

    • Analyzing the correlation between climate change and daily weather trends in Islamabad, Pakistan over time.
    • Understanding how different temperature ranges affect Islamabad, Pakistan's population and tourism levels during different months of the year.
    • Creating a predictive model to forecast future climate change data and weather patterns in Islamabad, Pakistan using machine learning classification algorithms like Decision Trees or Random Forests

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: dataset_preprocessed_Islamabad.csv | Column name | Description | |:----------------------|:-------------------------------------------------------| | daymonth_category | A categorization of the day and month. (String) | | weather | The weather condition observed. (String)...

  12. Historical Weather Data for 2020

    • kaggle.com
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Gaitani (2024). Historical Weather Data for 2020 [Dataset]. https://www.kaggle.com/datasets/ahmedgaitani/historical-weather-data-for-2020
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Gaitani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    This dataset contains daily historical weather data recorded at multiple weather stations from January 1, 2020, to December 30, 2020. The data includes temperature, precipitation, humidity, wind speed, and weather conditions, providing a comprehensive view of the weather patterns over the year. This dataset is ideal for climate analysis, weather prediction, and educational purposes.

    Columns

    • Date: The date of the observation.
    • Station: The weather station identifier.
    • Temperature: The recorded temperature (in Celsius).
    • Precipitation: The recorded precipitation (in mm).
    • Humidity: The recorded humidity (in %).
    • WindSpeed: The recorded wind speed (in km/h).
    • WeatherCondition: The recorded weather condition (e.g., sunny, rainy, snowy).

    Source

    Data generated synthetically for educational purposes.

    Potential Uses

    • Climate change analysis
    • Weather pattern prediction
    • Agricultural planning
  13. Insurance Premium Prediction

    • kaggle.com
    Updated Jun 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nursnaaz (2019). Insurance Premium Prediction [Dataset]. https://www.kaggle.com/noordeen/insurance-premium-prediction/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2019
    Dataset provided by
    Kaggle
    Authors
    nursnaaz
    Description

    Context

    The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.

    Acknowledgements

    Insurance.csv file is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6.

    Inspiration

    The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.

  14. Exoplanet Hunting in Deep Space

    • kaggle.com
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WΔ (2017). Exoplanet Hunting in Deep Space [Dataset]. https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data/discussion/90245
    Explore at:
    zip(58642135 bytes)Available download formats
    Dataset updated
    Apr 12, 2017
    Authors
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Search for New Earths

    GitHub

    The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2 or 1. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit; some observations are in fact multi-planet systems.

    As you can imagine, planets themselves do not emit light, but the stars that they orbit do. If said star is watched over several months or years, there may be a regular 'dimming' of the flux (the light intensity). This is evidence that there may be an orbiting body around the star; such a star could be considered to be a 'candidate' system. Further study of our candidate system, for example by a satellite that captures light at a different wavelength, could solidify the belief that the candidate can in fact be 'confirmed'.

    https://cdn.pbrd.co/images/5g0jyccQF.png" alt="Flux Diagram">

    In the above diagram, a star is orbited by a blue planet. At t = 1, the starlight intensity drops because it is partially obscured by the planet, given our position. The starlight rises back to its original value at t = 2. The graph in each box shows the measured flux (light intensity) at each time interval.

    Description

    Trainset:

    • 5087 rows or observations.
    • 3198 columns or features.
    • Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
    • 37 confirmed exoplanet-stars and 5050 non-exoplanet-stars.

    Testset:

    • 570 rows or observations.
    • 3198 columns or features.
    • Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
    • 5 confirmed exoplanet-stars and 565 non-exoplanet-stars.

    Acknowledgements

    The data presented here are cleaned and are derived from observations made by the NASA Kepler space telescope. The Mission is ongoing - for instance data from Campaign 12 was released on 8th March 2017. Over 99% of this dataset originates from Campaign 3. To boost the number of exoplanet-stars in the dataset, confirmed exoplanets from other campaigns were also included.

    To be clear, all observations from Campaign 3 are included. And in addition to this, confirmed exoplanet-stars from other campaigns are also included.

    The datasets were prepared late-summer 2016.

    Campaign 3 was used because 'it was felt' that this Campaign is unlikely to contain any undiscovered (i.e. wrongly labelled) exoplanets.

    NASA open-sources the original Kepler Mission data and it is hosted at the Mikulski Archive. After being beamed down to Earth, NASA applies de-noising algorithms to remove artefacts generated by the telescope. The data - in the .fits format - is stored online. And with the help of a seasoned astrophysicist, anyone with an internet connection can embark on a search to find and retrieve the datafiles from the Archive.

    The cover image is copyright © 2011 by Dan Lessmann

  15. Wild Brown Rat Predator Response

    • kaggle.com
    Updated Feb 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Wild Brown Rat Predator Response [Dataset]. https://www.kaggle.com/datasets/thedevastator/wild-brown-rat-predator-response
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wild Brown Rat Predator Response

    Examining Antipredator Responses to Cat Fur and Possum Fur Odour Cues

    By [source]

    About this dataset

    The dataset contains results of an experiment examining the complex and fascinating antipredator responses of wild brown rats (Rattus norvegicus) to both cat fur and possum fur odour cues in a semi-natural environment. This experiment sought to explore how such animals think, feel, and react when presented with potential danger signals in their environment. The data reflects the behavior of wild brown rats that were housed in open air enclosures containing two feeding stations, each presented with different treatment scenarios: one pair of feeding stations received cat fur, one pair received possum fur, and another pair was provided no added odour cue as control. The results documented can offer deeper insight into the relative sensitivity of these animals when exposed to predator-related cues - not only their behavioral reactions but also their emotional state and physiological processes that are involved - informing us on their response strategies under different contexts. Furthermore, this information can be used as a practical tool for those interested in animal welfare conservation or management. So why wait? Dig deep into this dataset and uncover what wild brown rats are truly made of!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains valuable information about the responses of wild brown rats to predator odour cues in a semi-natural, open air environment. This dataset can be used in order to gain further insight into the behaviour of these animals when presented with different predator odours and offers an opportunity to study their sensitivity and response times.

    In order to use this dataset, one should look at the column headings for more information on what each observation contains. Each row will represent an observation from one rat from a particular cohort, on a given night, treated with either cat fur or possum fur odour cue treatments, along with control (i.e., no treatment) observations. The columns contain important data points such as sex, cohort number and cage number (helpful for inter-run comparisons). There are also observations regarding how much time individual rodents spent in proximity of two food stations (helpful for measuring responsiveness), as well as how long they spent in each station’s food hopper before consuming their food (helpful for measuring interest level). Finally, it will be possible to explore how each rat responded when presented with “opposite treatments” - i.e., whether they had preferences or biases towards certain predators over others due to previous diet experiences or other factors.

    Overall this dataset provides an insight into the behaviour of wild brown rats when responding to different predator stimuli - offering a great deal of potential research avenues which could be explored by analysing this data effectively!

    Research Ideas

    • Examine correlations between individual animal behaviors and their response to varying levels of predator odors.
    • Test differences in responses of wild brown rats within different environmental contexts, such as open air enclosures, laboratory cages, and natural habitats.
    • Compare rats’ responses to predator-related olfactory cues with other factors such as sex, age, or genetic make-up to identify variations in behaviors within different population subgroups

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: 2019-Oct10._Choice_McQU_DATA_1_hour_bins.csv | Column name | Description | |:---------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------| | Cohort | The group of animals being tested. (Numeric) | | **Night*...

  16. Heart Disease Prediction Dataset

    • kaggle.com
    Updated May 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utkarsh Singh (2023). Heart Disease Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/utkarshx27/heart-disease-diagnosis-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Utkarsh Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Attributes types : Real: 1,4,5,8,10,12 | Ordered:11 | Binary: 2,6,9 | Nominal:7,3,13

    Variable to be predicted: Absence (1) or presence (2) of heart disease

    Cost Matrix abse pres absence 0 1 presence 5 0 where the rows represent the true values and the columns the predicted. Total 270 observations and No missing values.

    Attribute Information:

     -- 1. age    
     -- 2. sex    
     -- 3. chest pain type (4 values)    
     -- 4. resting blood pressure 
     -- 5. serum cholestoral in mg/dl   
     -- 6. fasting blood sugar > 120 mg/dl    
     -- 7. resting electrocardiographic results (values 0,1,2) 
     -- 8. maximum heart rate achieved 
     -- 9. exercise induced angina  
     -- 10. oldpeak = ST depression induced by exercise relative to rest  
     -- 11. the slope of the peak exercise ST segment   
     -- 12. number of major vessels (0-3) colored by flourosopy    
     -- 13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
     -- 14. Target(Absence (1) or presence (2) of heart disease) 
    

    For more Info About Dataset: Link

  17. Fires from Space: Australia

    • kaggle.com
    Updated Jan 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Paradis (2020). Fires from Space: Australia [Dataset]. https://www.kaggle.com/datasets/carlosparadis/fires-from-space-australia-and-new-zeland/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carlos Paradis
    Area covered
    Australia
    Description

    Context

    Current news about Australia bushfire has been spreading fast, however, the same can't be said about the datasets. This NASA FIRMS MODIS and VIIRS Fire/Hotspot provide an initial dataset for fires in Australia. My main motivation to provide this dataset was the following article:

    See the Ideas Section at the bottom for more inspiration.

    Content

    The current dataset provides 4 tables from 2 NASA Satellite Instruments:

    • MODIS C6 Tables

      • fire_archive_M6_96619.csv
      • fire_nrt_M6_96619.csv
    • VIIRS 375m Tables

      • fire_archive_V1_96617.csv
      • fire_nrt_V1_96617.csv

    The provided URLs contain more details on the two instrument differences. Each instrument contains two types of tables: Near Real Time (_nrt_) and older standard/science quality data (_archive). According to the README provided with the dataset, NRT data are replaced with standard quality data when they are available (usually with a 2-3 month lag). This time-lag can be observed on the tables by verifying the acq_date column (See available Kaggle Notebook). For more details between NRT and Archive see this link.

    More information on the instruments used for this dataset can be found here and here. All columns have been annotated with the description following Kaggle format. For data provenance, see the Metadata section.

    Acknowledgements

    We acknowledge the use of data and imagery from LANCE FIRMS operated by NASA's Earth Science Data and Information System (ESDIS) with funding provided by NASA Headquarters.

    NRT VIIRS 375 m Active Fire product VNP14IMGT. Available on-line [https://earthdata.nasa.gov/firms]. doi: 10.5067/FIRMS/VIIRS/VNP14IMGT.NRT.001.

    MODIS Collection 6 NRT Hotspot / Active Fire Detections MCD14DL. Available on-line [https://earthdata.nasa.gov/firms]. doi: 10.5067/FIRMS/MODIS/MCD14DL.NRT.006

    Inspiration

    Current news articles are a wonderful inspiration for ways to analyze this data and or combine to other datasets. Some ideas:

    There is also other information that can be combined with this dataset, such as local air quality, and local alerts to increase accuracy.

    Found an error? Let me know!

    Feel free to post on Discussion anything strange you may find, and I will be happy to follow-up.

  18. Poland Air Quality Dataset (2017-2023) + weather

    • kaggle.com
    zip
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor (2024). Poland Air Quality Dataset (2017-2023) + weather [Dataset]. https://www.kaggle.com/datasets/wisekinder/poland-air-quality-monitoring-dataset-2017-2023
    Explore at:
    zip(1050041969 bytes)Available download formats
    Dataset updated
    Sep 3, 2024
    Authors
    Igor
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Polska
    Description

    The Air Quality Dataset provides a comprehensive overview of atmospheric pollution levels across various locations in Poland from 2017 to 2023. It features extensive measurements of numerous air pollutants captured through an extensive network of air quality monitoring stations throughout the country. The dataset includes both hourly (1g) and daily (24g) averages of the recorded pollutants, offering detailed temporal resolution to study short-term peaks and long-term trends in air quality.

    Pollutants Measured:

    1. Gaseous Pollutants: Carbon Monoxide (CO), Nitrogen Dioxide (NO2), Nitric Oxide (NO), Nitrogen Oxides (NOx), Sulfur Dioxide (SO2), Ozone (O3), and Benzene (C6H6).
    2. Particulate Matter: PM10, PM2.5; and specific elements and compounds bound to PM10 such as Lead (Pb), Arsenic (As), Cadmium (Cd), Nickel (Ni), among others.
    3. Polycyclic Aromatic Hydrocarbons (PAHs) associated with PM10: Benzo[a]anthracene (BaA), Benzo[b]fluoranthene (BbF), Benzo[j]fluoranthene (BjF), Benzo[k]fluoranthene (BkF), Benzo[a]pyrene (BaP), Indeno[1,2,3-cd]pyrene (IP), Dibenzo[a,h]anthracene (DBahA).
    4. Additional Chemicals: Including various volatile organic compounds (VOCs) like ethylene, toluene, xylene variants, aldehydes, and hydrocarbons.
    

    Features of the Dataset:

    Locations: Data from numerous air quality monitoring stations distributed across various urban, suburban, and rural areas in Poland.
    Time Resolution: Measurements are provided in both hourly and daily intervals, catering to different analytical needs.
    Coverage Period: This dataset encompasses data from 2017 to the year, 2023, enabling analysis over multiple years to discern trends and assess the effectiveness of air quality management policies.
    Deployment of Deposition Sampling: Concentrations of certain pollutants in wet and dry deposition forms, noted with 'cdepoz' (cumulative deposition), providing insights into the deposition rates of airborne pollutants.
    

    Potential Applications:

    Environmental Research: Study the impact of various pollutants on air quality, health, and the environment.
    Policy Making: Assist policymakers in evaluating the effectiveness of past regulations and planning future actions to improve air quality.
    Public Health: Correlate pollutant exposure levels with health outcomes, helping public health professionals to mitigate risks associated with poor air quality.
    

    Data Format:

    The dataset is structured in a tabular format with each row representing an observation time (either hourly or daily) and columns representing different pollutants and their concentrations at various monitoring stations.
    

    This dataset is an essential resource for researchers, policymakers, environmental agencies, and health professionals who need a detailed and robust dataset to understand and combat air pollution in Poland.

    Source of data: Chief Inspectorate of Environmental Protection (GIOS)

    The historic weather dataset for Cracow and Warsaw

    The historic weather dataset for Cracow and Warsaw with suburbs, covering daily observations from 2019 to August 2024, would encompass a range of atmospheric and meteorological data points collected over the defined time period and locations. Here’s a description of what such a dataset might include and signify: Key Characteristics:

    Locations: The cities of Cracow and Warsaw, along with their suburbs. The dataset would likely specify the exact areas or measurements stations.
    Time Frame: Daily records from January 1, 2019, to August, 2024, providing a comprehensive view of weather variations through different seasons and years.
    Data Granularity: Daily data would allow trends such as temperature fluctuations, precipitation patterns, and weather anomalies to be studied in considerable detail.
    

    Likely Data Fields:

    Each record in the dataset might contain:

    DATE_VALID_STD: Representing each day within the date range specified (from 2019-01-01 to 2024-08-20 for Cracow and Warsaw suburbs).
    Temperature Fields (Min, Max, Avg): Temperature readings at specified intervals, likely in Celsius, providing insight into daily and seasonal temperature patterns and extremes.
    Humidity Fields (Min, Max, Avg): Relative and specific humidity readings to assess moisture levels in the air, which have implications for weather conditions, comfort levels, and health.
    Precipitation: Data on rainfall, snowfall, and total snow depth, essential for understanding water cycle dynamics, agricultural planning, and urban water management in these areas.
    Wind Measurements: May include minimum, average, and maximum speeds and perhaps prevailing directions, useful in sectors like aviation, construction, and event planning.
    Pressure and Tendency: Barometric pressure readings at different measurement standards to help predict weather changes.
    Radiation and Cloud Cover: D...
    
  19. Predict Mortality/Death Rate.

    • kaggle.com
    zip
    Updated Aug 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajanand Ilangovan (2017). Predict Mortality/Death Rate. [Dataset]. https://www.kaggle.com/rajanand/mortality
    Explore at:
    zip(59991550 bytes)Available download formats
    Dataset updated
    Aug 8, 2017
    Authors
    Rajanand Ilangovan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description
    "https://link.rajanand.org/sql-challenges" target="_blank"> https://link.rajanand.org/banner-01" alt="SQL Data Challenges" style="width: 700px; height: 120px">
    --- Context: ----------- **Annual Health Survey : Mortality Schedule ** This unit level dataset contains the details relating to death occurred to usual residents of sample household during the reference period and it includes information on sex of deceased, date of death, age at death, registration of death and source of medical attention received before death. For infant deaths, data related to symptoms preceding death is also provided. Mortality Schedule also includes information on various determinants of maternal mortality viz. case of deaths associated with pregnancy, information on factors leading/ contributing to death, symptoms preceding death, time between onset of complications and death, etc. There are total of 770k observations and 121 variables in this dataset. **[Survey:](http://www.who.int/bulletin/volumes/94/4/BLT-15-158493-table-T1.html)** Base line survey - 2010-11 (4.14 million households in the sample) 1st update - 2011-12 (4.28 million households in the sample) 2nd update - 2012-13 (4.32 million households in the sample) The survey was conducted in the below 9 states. A. Empowered Action Group [(EAG)](http://pib.nic.in/newsite/mbErel.aspx?relid=85350) States 1. Uttarakhand (05) 2. Rajasthan (08) 3. Uttar Pradesh (09) 4. Bihar (10) 5. Jharkhand (20) 6. Odisha (21) 7. Chhattisgarh (22) 8. Madhya Pradesh (23) B. Assam. (18) These nine states, which account for about 48 percent of the total population, 59 percent of Births, 70 percent of Infant Deaths, 75 percent of Under 5 Deaths and 62 percent of Maternal Deaths in the country, are the high focus States in view of their relatively higher fertility and mortality. Content: ----------- The files contains the below columns. **Variable Names:** 1. id 2. m_id 3. client_m_id 4. hl_id 5. house_no 6. house_hold_no 7. state 8. district 9. rural 10. stratum_code 11. psu_id 12. m_serial_no 13. deceased_sex 14. date_of_death 15. month_of_death 16. year_of_death 17. age_of_death_below_one_month 18. age_of_death_below_eleven_month 19. age_of_death_above_one_year 20. treatment_source 21. place_of_death 22. is_death_reg 23. is_death_certificate_received 24. serial_num_of_infant_mother 25. order_of_birth 26. death_symptoms 27. is_death_associated_with_pregnan 28. death_period 29. months_of_pregnancy 30. factors_contributing_death 31. factors_contributing_death_2 32. symptoms_of_death 33. time_between_onset_of_complicati 34. nearest_medical_facility 35. m_expall_status 36. field38 37. hh_id 38. client_hh_id 39. currently_dead_or_out_migrated 40. hh_serial_no 41. sex 42. usual_residance 43. relation_to_head 44. member_identity 45. father_serial_no 46. mother_serial_no 47. date_of_birth 48. month_of_birth 49. year_of_birth 50. age 51. religion 52. social_group_code 53. marital_status 54. date_of_marriage 55. month_of_marriage 56. year_of_marriage 57. currently_attending_school 58. reason_for_not_attending_school 59. highest_qualification 60. occupation_status 61. disability_status 62. injury_treatment_type 63. illness_type 64. symptoms_pertaining_illness 65. sought_medical_care 66. diagnosed_for 67. diagnosis_source 68. regular_treatment 69. regular_treatment_source 70. chew 71. smoke 72. alcohol 73. status 74. hh_expall_status 75. client_hl_id 76. serial_no 77. building_no 78. house_status 79. house_structure 80. owner_status 81. drinking_water_source 82. is_water_filter 83. water_filteration 84. toilet_used 85. is_toilet_shared 86. household_have_electricity 87. lighting_source 88. cooking_fuel 89. no_of_dwelling_rooms 90. kitchen_availability 91. is_radio 92. is_television 93. is_computer 94. is_telephone 95. is_washing_machine 96. is_refrigerator 97. is_sewing_machine 98. is_bicycle 99. is_scooter 100. is_car 101. is_tractor 102. is_water_pump 103. cart 104. land_possessed 105. hl_expall_status 106. fid 107. isdeadmigrated 108. residancial_status 109. iscoveredbyhealthscheme 110. healthscheme_1 111. healthscheme_2 112. housestatus 113. householdstatus 114. isheadchanged 115. fidh 116. fidx 117. as 118. wt 119. x 120. schedule_id 121. year **File content:** Mortality_data_dictionary.xlsx : This [**data dictionary**](https://www.kaggle.com/rajanand/mortality/downloads/Mortality_data_dictionary.xlsx) excel work book has the detailed information about each and every column and codes used in the data. Acknowledgements ---------------- [Department of Health and Family Welfare](https://nrhm-mis.nic.in/hmisreports/AHSReports.aspx), Govt. of India has published this [dataset](https://data.gov.in/catalog/annual-health-survey-mortality-schedule) in Open Govt Data Platform India portal under [Govt. Open Data License - India](https://data.gov.in/government-open-data-license-india). ---
    "https://link.rajanand.org/sql-challenges" target="_blank"> https://link.rajanand.org/banner-02" alt="SQL Data Challenges" style="width: 700px; height: 120px">
  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537

machine learning models on the WDBC dataset

Explore at:
282 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

Search
Clear search
Close search
Google apps
Main menu