Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains dekadal rainfall indicators, computed from Climate Hazards Group InfraRed Precipitation satellite imagery with insitu Station data (CHIRPS) version 2 and the CHIRPS-GEFS short term rainfall forecasts, aggregated by subnational administrative units.
Included indicators are (for each dekad):
The administrative units used for aggregation are based on WFP data and contain a Pcode reference attributed to each unit. The number of input pixels used to create the aggregates, is provided in the n_pixels
column. Finally, the type
column indicates if the value is based on a forecast, a preliminary or a final product.
Forecasts are issued on the 6th, 16th, and 26th of each month for the upcoming 10-day period (dekad), then updated with improved versions on the 1st, 11th, and 21st. Preliminary observations replace the previous dekad’s forecast on the 3rd, 13th, and 23rd, and are later replaced by final observations—published mid-month (13th or 23rd)—covering all three dekads of the prior month. Please find a summary below:
Publication Day: Forecast type, Covers (Dekad)
For more on CHIRPS-GEFS forecasts, see: https://www.chc.ucsb.edu/data/chirps-gefs
For further details, please see the methodology section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://i.imgur.com/5rtbtpN.png" alt="Imgur">
The original Palmer's Penguins dataset is an invaluable resource in the world of data science, often used for statistical analysis, data visualization, and introductory machine learning tasks. Collected in the Palmer Archipelago near Antarctica, the dataset provides information on three species of penguins, including Adélie, Gentoo, and Chinstrap, and covers essential biological metrics such as bill dimensions and body mass.
Our extended dataset aims to build upon this foundational work by incorporating new, realistic features. We have included additional variables like diet, year of observation, life stage, and health metrics. These extra features allow for a more nuanced understanding of penguin biology and ecology, making it ideal for more complex analyses, including but not limited to educational, ecological, and advanced machine learning applications.
The dataset consists of the following columns:
The inclusion of yearly data from 2021 to 2025 allows for longitudinal studies, providing a temporal dimension that can help track the impact of climate change, dietary shifts, or other ecological factors on penguin populations over time.
We introduce the 'Health Metrics' column, which takes into account the body mass, life stage, and species to categorize each penguin's health status. This provides a multi-faceted view of individual well-being and can be crucial for conservation studies.
Our data structure enables the mapping of the diet to specific life stages, offering a granular understanding of penguin ecology. This added detail can be crucial for studying nutritional needs at different life stages.
Recognizing the importance of gender-based variations in penguin biology, our dataset incorporates attributes that allow for the study of sexual dimorphism, such as differing body sizes and potential diet variations between males and females.
This enriched dataset is particularly suitable for: - Advanced ecological models that require multiple layers of data. - Educational case studies focusing on biology, ecology, or data science. - Data-driven conservation efforts aimed at penguin species. - Machine learning algorithms that benefit from diverse and multi-dimensional data.
We wish to express our deepest respect and acknowledgment to the original research team behind the Palmer's Penguins dataset. This Extended Palmer's Penguins dataset is designed to build upon the solid foundation laid by the original work. It is created to serve as a complementary resource that adds additional dimensions for research and educational purposes. In no way is this artificial dataset intended to discredit or disrespect the invaluable contributions made through the original dataset.
All illustrations in this dataset are AI-generated.
https://i.imgur.com/yzroo3h.png" alt="Imgur">
By Andy Kriebel [source]
The file contains data on births in the United States from 1994 to 2014. The data includes the following columns: year: The year of the observation. (Integer) month: The month of the observation. (Integer) date_of_month: The date of the observation. (Integer) day_of_week: The day of the week of the observation. (Integer) births: The number of births on the given day. (Integer)
The US Births dataset on Kaggle contains data on births in the United States from 1994 to 2014. The data is broken down by year, month, date of month, day of week, and births.
This dataset can be used to answer questions about when people are born, how common certain birthdays are, and any trends over time. For example, you could use this dataset to find out which day of the week has the most births or which month has the most births
- Determining which day of the year and what time of day that people are mostly born to help with staffing levels in maternity wards
- Identifying trends in baby names over time
- Predicting the number of births on a given day
This data set is a combined effort of the U.S. National Center for Health Statistics and the U.S. Social Security Administration, provided by FiveThirtyEight. It contains data on births in the United States from 1994 to 2014, with the following columns: year, month, date_of_month, day_of_week, births
->Thank you to FiveThirtyEight for providing this dataset!
License
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: US_births_1994-2014.csv | Column name | Description | |:------------------|:---------------------------------------------| | year | Year of the data. (Integer) | | month | Month of the data. (Integer) | | date_of_month | Day of the month of the data. (Integer) | | day_of_week | Day of the week of the data. (Integer) | | births | Number of births on the given day. (Integer) |
If you use this dataset in your research, please credit Andy Kriebel.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8734253%2F832430253683be01796f74de8f532b34%2Fweather%20forecasting.png?generation=1730602999355141&alt=media" alt="">
Weather is recorded every 10 minutes throughout the entire year of 2020, comprising 20 meteorological indicators measured at a Max Planck Institute weather station. The dataset provides comprehensive atmospheric measurements including air temperature, humidity, wind patterns, radiation, and precipitation. With over 52,560 data points per variable (365 days × 24 hours × 6 measurements per hour), this high-frequency sampling offers detailed insights into weather patterns and atmospheric conditions. The measurements include both basic weather parameters and derived quantities such as vapor pressure deficit and potential temperature, making it suitable for both meteorological research and practical applications. You can find some initial analysis using this dataset here: "Weather Long-term Time Series Forecasting Analysis".
The dataset is provided in a CSV format with the following columns:
Column Name | Description |
---|---|
date | Date and time of the observation. |
p | Atmospheric pressure in millibars (mbar). |
T | Air temperature in degrees Celsius (°C). |
Tpot | Potential temperature in Kelvin (K), representing the temperature an air parcel would have if moved to a standard pressure level. |
Tdew | Dew point temperature in degrees Celsius (°C), indicating the temperature at which air becomes saturated with moisture. |
rh | Relative humidity as a percentage (%), showing the amount of moisture in the air relative to the maximum it can hold at that temperature. |
VPmax | Maximum vapor pressure in millibars (mbar), representing the maximum pressure exerted by water vapor at the given temperature. |
VPact | Actual vapor pressure in millibars (mbar), indicating the current water vapor pressure in the air. |
VPdef | Vapor pressure deficit in millibars (mbar), measuring the difference between maximum and actual vapor pressure, used to gauge drying potential. |
sh | Specific humidity in grams per kilogram (g/kg), showing the mass of water vapor per kilogram of air. |
H2OC | Concentration of water vapor in millimoles per mole (mmol/mol) of dry air. |
rho | Air density in grams per cubic meter (g/m³), reflecting the mass of air per unit volume. |
wv | Wind speed in meters per second (m/s), measuring the horizontal motion of air. |
max. wv | Maximum wind speed in meters per second (m/s), indicating the highest recorded wind speed over the period. |
wd | Wind direction in degrees (°), representing the direction from which the wind is blowing. |
rain | Total rainfall in millimeters (mm), showing the amount of precipitation over the observation period. |
raining | Duration of rainfall in seconds (s), recording the time for which rain occurred during the observation period. |
SWDR | Short-wave downward radiation in watts per square meter (W/m²), measuring incoming solar radiation. |
PAR | Photosynthetically active radiation in micromoles per square meter per second (µmol/m²/s), indicating the amount of light available for photosynthesis. |
max. PAR | Maximum photosynthetically active radiation recorded in the observation period in µmol/m²/s. |
Tlog | Temperature logged in degrees Celsius (°C), potentially from a secondary sensor or logger. |
OT | Likely refers to an "operational timestamp" or an offset in time, but may need clarification depending on the dataset's context. |
This high-resolution meteorological dataset enables applications across multiple domains. For weather forecasting, the frequent measurements support development of prediction models, while climate researchers can study microclimate variations and seasonal patterns. In agriculture, temperature and vapor pressure deficit data aids crop modeling and irrigation planning. The wind and radiation measurements benefit renewable energy planning, while the comprehensive atmospheric data supports environmental monitoring. The dataset's detailed nature makes it particularly suitable for machine learning applications and educational purposes in meteorology and data science.
This dataset is totally imaginary and NOT real data this deal only with values that are created by us.
This dataset deals with pollution in the Harbor of Chennai Kolkata and Visahapattinam has been recorded but it is a pain to create and collect all the data and arrange them in a format that interests data scientists. Hence I gathered four major pollutants and place them neatly in a CSV file.
There is a total of 29 fields. The four pollutants (NO2, O3, SO2, and O3) each have 5 specific columns. Observations totaled. This kernel provides a good introduction to this dataset!
For observations on specific columns visit the Column Metadata on the Data tab.
I did a related project and decided to open-source our dataset so that data scientists don't need to re-scrap from the first for historical pollution data.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16012776%2Fdb7fd8faf4277c85822f8bbfe5e113d2%2Farnaud-mariat-45Z6hW1dQMI-unsplash.jpg?generation=1690636699354713&alt=media" alt="">
This dataset consists of 100,000 observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 42 features and 1 class column classifying the observation as either:
You can read more about the features below:
The run number refers to a specific period in which the SDSS observes a part of the sky. SDSS is divided into several runs, each lasting for a certain amount of time, which are then combined to cover an extensive portion of the sky. The rerun number refers to the reprocessing of the data obtained.
In each run, multiple charge-coupled device (CCD) cameras are arranged into a column which are responsible for imaging a specific portion of the sky. camcol refers to the camera column number which imaged a specific observation. A field is a specific portion of the sky that is imaged during a single exposure of the telescope. The entire sky is divided into a portion of fields and the field number column refers to the field or portion of the sky from which an observation was obtained.
A number of physical glass plates are mounted on the telescope, each containing a number of optical fibers corresponding to a specific position in the sky. When light hits these optical fibers, it is sent to spectrographs for analysis. plate number and fiberID refer to the number of the plate and the ID of the optical fiber responsible for gathering light from the celestial object respectively.
Modified Julian Date represents the number of days that have passed since midnight Nov. 17, 1858. It is used in SDSS to keep track of the time of each observation.
The petrosian radius is a measure of the size of a galaxy, and it is calculated using the petrosian flux profile. The petrosian flux profile measures how the brightness of an object varies with distance from its center. The petrosian radius is defined as the distance from the galaxy's center where the ratio of the local surface brightness to the average surface brightness reaches a certain predefined value. The local surface brightness refers to the brightness of a specific small region or pixel on the surface of an extended object. It is a measure of how much light is detected from that particular region. The average surface brightness, on the other hand, represents the mean or average brightness measured over the entire surface of the extended object. It is the total amount of light received from the object divided by its total area.
These parameters help in characterizing the properties of celestial objects, especially when studying their morphologies, sizes, and how they evolve over time.
These parameters help in studying the photometric properties of the celestial objects, particularly in analyzing the brightness, colors, and spectral energy distribution of the objects. By using petrosian fluxes in different bands, astronomers can obtain a comprehensive view of an object's light emission across the electromagnetic spectrum.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data set contains patient records from a 1984-1989 trial conducted by the German Breast Cancer Study Group (GBSG) of 720 patients with node positive breast cancer; it retains the 686 patients with complete data for the prognostic variables. These data sets are used in the paper by Royston and Altman(2013). The Rotterdam data is used to create a fitted model, and the GBSG data for validation of the model. The paper gives references for the data source.
A data set with 686 observations and 11 variables.
Columns | Description |
---|---|
pid | patient identifier |
age | age, years |
meno | menopausal status (0= premenopausal, 1= postmenopausal) |
size | tumor size, mm |
grade | tumor grade |
nodes | number of positive lymph nodes |
pgr | progesterone receptors (fmol/l) |
er | estrogen receptors (fmol/l) |
hormon | hormonal therapy, 0= no, 1= yes |
rfstime | recurrence free survival time; days to first of recurrence, death or last follow-up |
status | 0= alive without recurrence, 1= recurrence or death |
Patrick Royston and Douglas Altman, External validation of a Cox prognostic model: principles and methods. BMC Medical Research Methodology 2013, 13:33
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset captures detailed information about urban parking activity, traffic conditions, and vehicle types over time. With over 18,400 entries spread across 11 columns, it offers a sizable and rich set of observations—ideal for anyone looking to explore parking trends, analyze traffic flow, or build models to predict parking availability.
What’s Inside:
Timestamps: Each entry is time-stamped (from October 4 to December 19, 2016), making this a time-series dataset. That means you can track how parking behavior changes over days, weeks, or months—like identifying peak hours or weekend patterns.
Unique Events: Every row comes with a unique ID, so each record represents a single moment or observation. That ensures clean data without duplicates.
Parking Locations: The SystemCodeNumber column identifies where each record came from—there are 14 different locations or systems in total. Codes like BHMBONCSTB1, Broad Street, and Others-OCCSP119A show that data comes from multiple spots, which helps in comparing how parking demand varies by area.
Capacity vs. Occupancy: Two of the most important columns show how many parking spaces were available (Capacity) and how many were filled (Occupancy) at any given time. Together, they tell us how full a lot was and help track usage levels. Some locations had space for thousands of cars, while others were much smaller.
Geolocation: Latitude and longitude are included, meaning you can map every observation. This is especially helpful if you're working with GIS tools or want to visualize parking availability across a city.
Vehicle Types: Most vehicles in the data are cars (81%), followed by bikes (20%) and a small number of other types (about 13%, or 3,578 entries). This breakdown can help in designing parking facilities or allocating space differently based on need.
Traffic Conditions: The TrafficCondition column categorizes how busy the surrounding roads were: low (42%), average (35%), and high (23%). These conditions can be correlated with parking occupancy—like whether traffic is worse when lots are full.
Queue Length: This column tracks how many vehicles were waiting for a spot (from 0 to 15), giving insight into where and when demand exceeded supply.
Special Days: There’s also a flag (IsSpecialDay) indicating whether a day was out of the ordinary—perhaps due to an event, holiday, or other factor affecting usual patterns.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data sets containing the data from a population study of non-alcoholic fatty liver disease (NAFLD). Subjects with the condition and a set of matched control subjects were followed forward for metabolic conditions, cardiac endpoints, and death.
Columns | Description |
---|---|
id | subject identifier |
age | age at entry to the study |
male | 0=female, 1=male |
weight | weight in kg |
height | height in cm |
bmi | body mass index |
case.id | the id of the NAFLD case to whom this subject is matched |
futime | time to death or last follow-up |
status | 0= alive at last follow-up, 1=dead |
Columns | Description |
---|---|
id | subject identifier |
days | days since index date |
test | the type of value recorded |
value | the numeric value |
Columns | Description |
---|---|
id | subject identifier |
days | days since index date |
event | the endpoint that occurred |
The primary reference for the NAFLD study is Allen (2018). The incidence of non-alcoholic fatty liver disease (NAFLD) has been rising rapidly in the last decade and it is now one of the main drivers of hepatology practice Tapper2018. It is essentially the presence of excess fat in the liver, and parallels the ongoing obesity epidemic. Approximately 20-25% of NAFLD patients will develop the inflammatory state of non-alcoholic steatohepatitis (NASH), leading to fibrosis and eventual end-stage liver disease. NAFLD can be accurately diagnosed by MRI methods, but NASH diagnosis currently requires a biopsy.
The current study constructed a population cohort of all adult NAFLD subjects from 1997 to 2014 along with 4 potential controls for each case. To protect patient confidentiality all time intervals are in days since the index date; none of the dates from the original data were retained. Subject age is their integer age at the index date, and the subject identifier is an arbitrary integer. As a final protection, we include only a 90% random sample of the data. As a consequence analyses results will not exactly match the original paper.
There are 3 data sets: nafld1 contains baseline data and has one observation per subject, nafld2 has one observation for each (time dependent) continuous measurement, and nafld3 has one observation for each yes/no outcome that occured.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Ahsan Aman [source]
This dataset contains preprocessed climate change data for Islamabad, Pakistan from 2009 to 2013. Evaluate the shifts in local conditions to understand how climate change is impacting the region. Analyze changing patterns in maximum and minimum temperature readings, as well as atmospheric pressure, cloud cover, wind speed and rain levels. Assess how all of these factors together contribute to a dynamic weather pattern, and discover emerging trends for the years ahead. Get a detailed breakdown of daily weather measurements that can inform forecasting models and drive public awareness on climate change issues in this region of the world
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains daily climate change data for Islamabad, Pakistan for the years 2009 to 2013. This data can be used to analyze the different climate factors and to assess their impact on local weather conditions. This dataset is ideal for understanding how climate change affects daily weather patterns in an area over a long period of time.
- First, explore the columns of this dataset and understand what each means:
- daymonth_category: The month and day of the observation (String)
- weather: The weather conditions during the observation (String)
- max_temp: The maximum temperature during the observation (Float)
- min_temp: The minimum temperature during the observation (Float)
- wind: The wind speed during the observation (Float)
- rain: The amount of rain during the observation (Float)
- cloud: The amount of cloud cover during the observation Float
pressure :The atmospheric pressure during the observation Float - year :The year oftheobservation Integer -weathervalue intrepresenting numerical value assignedtotheweatherconditionsduringtheobservation Integer - avg_temp average temperatureduringtheobservation Float
After familiarizing yourself with these columns, use descriptive analysis methods such as filtering or grouping according to different criteria such as type or date range in order to explore specific trends within this data set. Depending on your purpose or research question, different kinds of filtering/grouping can provide useful insights into certain factors related to climate change. For example you may wish to look at trends related specifically to maximum temperatures in July through August in order observing yearly fluctuations that occur due heat waves etc., or you may want view rainfall trends for each month across all five years in our dataset etc..
3 . Another important feature contained within this data set are its
weathervalues
which assigns numerical values associated with specific weather events occurring throughout our study period . These values can be used as labels e.g from 0to9 ,for further machine learning work related projects based off thisdata set .In addition based on these valuesyou could also create comparison graphsformeanandstandarddeviationsof particularweather eventtypesandsee howthey’rerelatedtohigher/lowertemperaturesorother factorslikerainfall ratesetc..
- Analyzing the correlation between climate change and daily weather trends in Islamabad, Pakistan over time.
- Understanding how different temperature ranges affect Islamabad, Pakistan's population and tourism levels during different months of the year.
- Creating a predictive model to forecast future climate change data and weather patterns in Islamabad, Pakistan using machine learning classification algorithms like Decision Trees or Random Forests
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: dataset_preprocessed_Islamabad.csv | Column name | Description | |:----------------------|:-------------------------------------------------------| | daymonth_category | A categorization of the day and month. (String) | | weather | The weather condition observed. (String)...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains daily historical weather data recorded at multiple weather stations from January 1, 2020, to December 30, 2020. The data includes temperature, precipitation, humidity, wind speed, and weather conditions, providing a comprehensive view of the weather patterns over the year. This dataset is ideal for climate analysis, weather prediction, and educational purposes.
Date
: The date of the observation.Station
: The weather station identifier.Temperature
: The recorded temperature (in Celsius).Precipitation
: The recorded precipitation (in mm).Humidity
: The recorded humidity (in %).WindSpeed
: The recorded wind speed (in km/h).WeatherCondition
: The recorded weather condition (e.g., sunny, rainy, snowy).Data generated synthetically for educational purposes.
The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.
Insurance.csv file is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6.
The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2
or 1
. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit; some observations are in fact multi-planet systems.
As you can imagine, planets themselves do not emit light, but the stars that they orbit do. If said star is watched over several months or years, there may be a regular 'dimming' of the flux (the light intensity). This is evidence that there may be an orbiting body around the star; such a star could be considered to be a 'candidate' system. Further study of our candidate system, for example by a satellite that captures light at a different wavelength, could solidify the belief that the candidate can in fact be 'confirmed'.
https://cdn.pbrd.co/images/5g0jyccQF.png" alt="Flux Diagram">
In the above diagram, a star is orbited by a blue planet. At t = 1, the starlight intensity drops because it is partially obscured by the planet, given our position. The starlight rises back to its original value at t = 2. The graph in each box shows the measured flux (light intensity) at each time interval.
Trainset:
Testset:
The data presented here are cleaned and are derived from observations made by the NASA Kepler space telescope. The Mission is ongoing - for instance data from Campaign 12 was released on 8th March 2017. Over 99% of this dataset originates from Campaign 3. To boost the number of exoplanet-stars in the dataset, confirmed exoplanets from other campaigns were also included.
To be clear, all observations from Campaign 3 are included. And in addition to this, confirmed exoplanet-stars from other campaigns are also included.
The datasets were prepared late-summer 2016.
Campaign 3 was used because 'it was felt' that this Campaign is unlikely to contain any undiscovered (i.e. wrongly labelled) exoplanets.
NASA open-sources the original Kepler Mission data and it is hosted at the Mikulski Archive. After being beamed down to Earth, NASA applies de-noising algorithms to remove artefacts generated by the telescope. The data - in the .fits
format - is stored online. And with the help of a seasoned astrophysicist, anyone with an internet connection can embark on a search to find and retrieve the datafiles from the Archive.
The cover image is copyright © 2011 by Dan Lessmann
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The dataset contains results of an experiment examining the complex and fascinating antipredator responses of wild brown rats (Rattus norvegicus) to both cat fur and possum fur odour cues in a semi-natural environment. This experiment sought to explore how such animals think, feel, and react when presented with potential danger signals in their environment. The data reflects the behavior of wild brown rats that were housed in open air enclosures containing two feeding stations, each presented with different treatment scenarios: one pair of feeding stations received cat fur, one pair received possum fur, and another pair was provided no added odour cue as control. The results documented can offer deeper insight into the relative sensitivity of these animals when exposed to predator-related cues - not only their behavioral reactions but also their emotional state and physiological processes that are involved - informing us on their response strategies under different contexts. Furthermore, this information can be used as a practical tool for those interested in animal welfare conservation or management. So why wait? Dig deep into this dataset and uncover what wild brown rats are truly made of!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains valuable information about the responses of wild brown rats to predator odour cues in a semi-natural, open air environment. This dataset can be used in order to gain further insight into the behaviour of these animals when presented with different predator odours and offers an opportunity to study their sensitivity and response times.
In order to use this dataset, one should look at the column headings for more information on what each observation contains. Each row will represent an observation from one rat from a particular cohort, on a given night, treated with either cat fur or possum fur odour cue treatments, along with control (i.e., no treatment) observations. The columns contain important data points such as sex, cohort number and cage number (helpful for inter-run comparisons). There are also observations regarding how much time individual rodents spent in proximity of two food stations (helpful for measuring responsiveness), as well as how long they spent in each station’s food hopper before consuming their food (helpful for measuring interest level). Finally, it will be possible to explore how each rat responded when presented with “opposite treatments” - i.e., whether they had preferences or biases towards certain predators over others due to previous diet experiences or other factors.
Overall this dataset provides an insight into the behaviour of wild brown rats when responding to different predator stimuli - offering a great deal of potential research avenues which could be explored by analysing this data effectively!
- Examine correlations between individual animal behaviors and their response to varying levels of predator odors.
- Test differences in responses of wild brown rats within different environmental contexts, such as open air enclosures, laboratory cages, and natural habitats.
- Compare rats’ responses to predator-related olfactory cues with other factors such as sex, age, or genetic make-up to identify variations in behaviors within different population subgroups
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 2019-Oct10._Choice_McQU_DATA_1_hour_bins.csv | Column name | Description | |:---------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------| | Cohort | The group of animals being tested. (Numeric) | | **Night*...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Attributes types : Real: 1,4,5,8,10,12 | Ordered:11 | Binary: 2,6,9 | Nominal:7,3,13
Variable to be predicted: Absence (1) or presence (2) of heart disease
Cost Matrix
abse pres
absence 0 1
presence 5 0
where the rows represent the true values and the columns the predicted.
Total 270 observations and No missing values.
-- 1. age
-- 2. sex
-- 3. chest pain type (4 values)
-- 4. resting blood pressure
-- 5. serum cholestoral in mg/dl
-- 6. fasting blood sugar > 120 mg/dl
-- 7. resting electrocardiographic results (values 0,1,2)
-- 8. maximum heart rate achieved
-- 9. exercise induced angina
-- 10. oldpeak = ST depression induced by exercise relative to rest
-- 11. the slope of the peak exercise ST segment
-- 12. number of major vessels (0-3) colored by flourosopy
-- 13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
-- 14. Target(Absence (1) or presence (2) of heart disease)
Current news about Australia bushfire has been spreading fast, however, the same can't be said about the datasets. This NASA FIRMS MODIS and VIIRS Fire/Hotspot provide an initial dataset for fires in Australia. My main motivation to provide this dataset was the following article:
See the Ideas Section at the bottom for more inspiration.
The current dataset provides 4 tables from 2 NASA Satellite Instruments:
MODIS C6 Tables
VIIRS 375m Tables
The provided URLs contain more details on the two instrument differences. Each instrument contains two types of tables: Near Real Time (_nrt_
) and older standard/science quality data (_archive
). According to the README provided with the dataset, NRT data are replaced with standard quality data when they are available (usually with a 2-3 month lag). This time-lag can be observed on the tables by verifying the acq_date
column (See available Kaggle Notebook). For more details between NRT and Archive see this link.
More information on the instruments used for this dataset can be found here and here. All columns have been annotated with the description following Kaggle format. For data provenance, see the Metadata section.
We acknowledge the use of data and imagery from LANCE FIRMS operated by NASA's Earth Science Data and Information System (ESDIS) with funding provided by NASA Headquarters.
NRT VIIRS 375 m Active Fire product VNP14IMGT. Available on-line [https://earthdata.nasa.gov/firms]. doi: 10.5067/FIRMS/VIIRS/VNP14IMGT.NRT.001.
MODIS Collection 6 NRT Hotspot / Active Fire Detections MCD14DL. Available on-line [https://earthdata.nasa.gov/firms]. doi: 10.5067/FIRMS/MODIS/MCD14DL.NRT.006
Current news articles are a wonderful inspiration for ways to analyze this data and or combine to other datasets. Some ideas:
There is also other information that can be combined with this dataset, such as local air quality, and local alerts to increase accuracy.
Feel free to post on Discussion anything strange you may find, and I will be happy to follow-up.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Air Quality Dataset provides a comprehensive overview of atmospheric pollution levels across various locations in Poland from 2017 to 2023. It features extensive measurements of numerous air pollutants captured through an extensive network of air quality monitoring stations throughout the country. The dataset includes both hourly (1g) and daily (24g) averages of the recorded pollutants, offering detailed temporal resolution to study short-term peaks and long-term trends in air quality.
Pollutants Measured:
1. Gaseous Pollutants: Carbon Monoxide (CO), Nitrogen Dioxide (NO2), Nitric Oxide (NO), Nitrogen Oxides (NOx), Sulfur Dioxide (SO2), Ozone (O3), and Benzene (C6H6).
2. Particulate Matter: PM10, PM2.5; and specific elements and compounds bound to PM10 such as Lead (Pb), Arsenic (As), Cadmium (Cd), Nickel (Ni), among others.
3. Polycyclic Aromatic Hydrocarbons (PAHs) associated with PM10: Benzo[a]anthracene (BaA), Benzo[b]fluoranthene (BbF), Benzo[j]fluoranthene (BjF), Benzo[k]fluoranthene (BkF), Benzo[a]pyrene (BaP), Indeno[1,2,3-cd]pyrene (IP), Dibenzo[a,h]anthracene (DBahA).
4. Additional Chemicals: Including various volatile organic compounds (VOCs) like ethylene, toluene, xylene variants, aldehydes, and hydrocarbons.
Features of the Dataset:
Locations: Data from numerous air quality monitoring stations distributed across various urban, suburban, and rural areas in Poland.
Time Resolution: Measurements are provided in both hourly and daily intervals, catering to different analytical needs.
Coverage Period: This dataset encompasses data from 2017 to the year, 2023, enabling analysis over multiple years to discern trends and assess the effectiveness of air quality management policies.
Deployment of Deposition Sampling: Concentrations of certain pollutants in wet and dry deposition forms, noted with 'cdepoz' (cumulative deposition), providing insights into the deposition rates of airborne pollutants.
Potential Applications:
Environmental Research: Study the impact of various pollutants on air quality, health, and the environment.
Policy Making: Assist policymakers in evaluating the effectiveness of past regulations and planning future actions to improve air quality.
Public Health: Correlate pollutant exposure levels with health outcomes, helping public health professionals to mitigate risks associated with poor air quality.
Data Format:
The dataset is structured in a tabular format with each row representing an observation time (either hourly or daily) and columns representing different pollutants and their concentrations at various monitoring stations.
This dataset is an essential resource for researchers, policymakers, environmental agencies, and health professionals who need a detailed and robust dataset to understand and combat air pollution in Poland.
Source of data: Chief Inspectorate of Environmental Protection (GIOS)
The historic weather dataset for Cracow and Warsaw with suburbs, covering daily observations from 2019 to August 2024, would encompass a range of atmospheric and meteorological data points collected over the defined time period and locations. Here’s a description of what such a dataset might include and signify: Key Characteristics:
Locations: The cities of Cracow and Warsaw, along with their suburbs. The dataset would likely specify the exact areas or measurements stations.
Time Frame: Daily records from January 1, 2019, to August, 2024, providing a comprehensive view of weather variations through different seasons and years.
Data Granularity: Daily data would allow trends such as temperature fluctuations, precipitation patterns, and weather anomalies to be studied in considerable detail.
Likely Data Fields:
Each record in the dataset might contain:
DATE_VALID_STD: Representing each day within the date range specified (from 2019-01-01 to 2024-08-20 for Cracow and Warsaw suburbs).
Temperature Fields (Min, Max, Avg): Temperature readings at specified intervals, likely in Celsius, providing insight into daily and seasonal temperature patterns and extremes.
Humidity Fields (Min, Max, Avg): Relative and specific humidity readings to assess moisture levels in the air, which have implications for weather conditions, comfort levels, and health.
Precipitation: Data on rainfall, snowfall, and total snow depth, essential for understanding water cycle dynamics, agricultural planning, and urban water management in these areas.
Wind Measurements: May include minimum, average, and maximum speeds and perhaps prevailing directions, useful in sectors like aviation, construction, and event planning.
Pressure and Tendency: Barometric pressure readings at different measurement standards to help predict weather changes.
Radiation and Cloud Cover: D...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.