53 datasets found
  1. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  2. Weather DataSet

    • kaggle.com
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vikram Kathare (2023). Weather DataSet [Dataset]. https://www.kaggle.com/datasets/vikramkathare/weather-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vikram Kathare
    Description

    We will learn how to work on a real project of Data Analysis with Python. Questions are given in the project and then solved with the help of Python. It is a project of Data Analysis with Python or you can say, Data Science with Python.

    The commands that we used in this project :

    • head() - It shows the first N rows in the data (by default, N=5).
    • shape - It shows the total no. of rows and no. of columns of the dataframe
    • index - This attribute provides the index of the dataframe
    • columns - It shows the name of each column
    • dtypes - It shows the data-type of each column
    • unique() - In a column, it shows all the unique values. It can be applied on a single column only, not on the whole dataframe.
    • nunique() - It shows the total no. of unique values in each column. It can be applied on a single column as well as on the whole dataframe.
    • count - It shows the total no. of non-null values in each column. It can be applied on a single column as well as on the whole dataframe.
    • value_counts - In a column, it shows all the unique values with their count. It can be applied on a single column only.
    • info() - Provides basic information about the dataframe.

    Challenges for this DataSet:

    Q. 1) Find all the unique 'Wind Speed' values in the data. Q. 2) Find the number of times when the 'Weather is exactly Clear'. Q. 3) Find the number of times when the 'Wind Speed was exactly 4 km/h'. Q. 4) Find out all the Null Values in the data. Q. 5) Rename the column name 'Weather' of the dataframe to 'Weather Condition'. Q. 6) What is the mean 'Visibility' ? Q. 7) What is the Standard Deviation of 'Pressure' in this data? Q. 8) What is the Variance of 'Relative Humidity' in this data ? Q. 9) Find all instances when 'Snow' was recorded. Q. 10) Find all instances when 'Wind Speed is above 24' and 'Visibility is 25'. Q. 11) What is the Mean value of each column against each 'Weather Condition ? Q. 12) What is the Minimum & Maximum value of each column against each 'Weather Condition ? Q. 13) Show all the Records where Weather Condition is Fog. Q. 14) Find all instances when 'Weather is Clear' or 'Visibility is above 40'. Q. 15) Find all instances when : A. 'Weather is Clear' and 'Relative Humidity is greater than 50' or B. 'Visibility is above 40'

  3. Global Freelancers (Raw) Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Urvish Ahir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description :

    This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

    • Each entry includes demographic, professional, and platform-related information such as:
    • Name, gender, age, and country
    • Primary skill and years of experience
    • Hourly rate (with mixed formatting), client rating, and satisfaction score
    • Language spoken (based on country)
    • Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

    Key Features :

    • Gender-based names using Faker’s male/female name generators
    • Realistic age and experience distribution (with missing and noisy values)
    • Country-language pairs mapped using actual linguistic data
    • Messy formatting: mixed data types, missing values, inconsistent casing
    • Generated entirely in Python using the faker library no real data used

    Use Cases :

    • Practicing data cleaning and preprocessing
    • Performing EDA (Exploratory Data Analysis)
    • Developing data pipelines: raw → clean → model-ready
    • Teaching feature engineering and handling real-world dirty data
    • Exercises in data validation, outlier detection, and format standardization

    File : global_freelancers_raw.csv

    | Column Name      | Description                               |
    | --------------------- | ------------------------------------------------------------------------ |
    | `freelancer_ID`    | Unique ID starting with `FL` (e.g., FL250001)              |
    | `name`        | Full name of freelancer (based on gender)                |
    | `gender`       | Gender (messy values and case inconsistency)               |
    | `age`         | Age of the freelancer (20–60, with occasional nulls/outliers)      |
    | `country`       | Country name (with random formatting/casing)               |
    | `language`      | Language spoken (mapped from country)                  |
    | `primary_skill`    | Key freelance domain (e.g., Web Dev, AI, Cybersecurity)         |
    | `years_of_experience` | Work experience in years (some missing values or odd values included)  |
    | `hourly_rate (USD)`  | Hourly rate with currency symbols or missing data            |
    | `rating`       | Rating between 1.0–5.0 (some zeros and nulls included)          |
    | `is_active`      | Active status (inconsistently represented as strings, numbers, booleans) |
    | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs)      |
    
  4. Z

    Adult dataset preprocessed

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Schuster, Verena
    Pustozerova, Anastasia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

    The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

    The preprocessing steps include:

    One-hot-encoding of categorical values

    Imputation of missing values using knn-imputer with k=1

    Standard scaling of ordinal attributes

    Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.

  5. Snitch Clothing Sales

    • kaggle.com
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). Snitch Clothing Sales [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/snitch-clothing-sales
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    Kaggle
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧥 Snitch Fashion Sales (Uncleaned) Dataset 📌 Context This is a synthetic dataset representing sales transactions from Snitch, a fictional Indian clothing brand. The dataset simulates real-world retail sales data with uncleaned records, designed for learners and professionals to practice data cleaning, exploratory data analysis (EDA), and dashboard building using tools like Python, Power BI, or Excel.

    📊 What You’ll Find The dataset includes over 2,500 records of fashion product sales across various Indian cities. It contains common data issues such as:

    Missing values

    Incorrect date formats

    Duplicates

    Typos in categories and city names

    Unrealistic discounts and profit values

    🧾 Columns Explained Column --Description Order_ID ------Unique ID for each sale (some duplicates) Customer_Name ------Name of the customer (inconsistent formatting) Product_Category ---Clothing category (e.g., T-Shirts, Jeans — includes typos) Product_Name -----Specific product sold Units_Sold --Quantity sold (some negative or null) Unit_Price --Price per unit (some missing or zero) Discount_% ----Discount applied (some >100% or missing) Sales_Amount ------Total revenue after discount (some miscalculations) Order_Date ---------Order date (multiple formats or missing) City -------Indian city (includes typos like "Hyd", "bengaluru") Segment----- Market segment (B2C, B2B, or missing) Profit ---------Profit made on the sale (some unrealistic/negative)

    💡 How to Use This Dataset Clean and standardize messy data

    Convert dates and correct formats

    Perform EDA to find:

    Top-selling categories

    Impact of discounts on sales and profits

    Monthly/quarterly trends

    Segment-based performance

    Create dashboards in Power BI or Excel Pivot Table

    Document findings in a PDF/Markdown report

    🎯 Ideal For Aspiring data analysts and data scientists

    Excel / Power BI dashboard learners

    Portfolio project creators

    Kaggle competitions or practice

    📌 License This is a synthetic dataset created for educational use only. No real customer or business data is included.

  6. S

    Geographical distribution and climate data of Cycas taiwaniana

    • scidb.cn
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CHUNPING XIE (2025). Geographical distribution and climate data of Cycas taiwaniana [Dataset]. http://doi.org/10.57760/sciencedb.19432
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Science Data Bank
    Authors
    CHUNPING XIE
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.

  7. Li-ion Battery Aging Dataset

    • kaggle.com
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GIRITHARAN MANI (2024). Li-ion Battery Aging Dataset [Dataset]. https://www.kaggle.com/datasets/mystifoe77/nasa-battery-data-cleaned/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GIRITHARAN MANI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Overview

    This dataset provides a comprehensive view of the aging process of lithium-ion batteries, facilitating the estimation of their Remaining Useful Life (RUL). Originally sourced from NASA's open repository, the dataset has undergone meticulous preprocessing to enhance its analytical utility. The data is presented in a user-friendly CSV format after extracting relevant features from the original .mat files.

    Key Features of the Dataset

    1. Battery Performance Metrics:

      • Capacity: Measured over time to assess degradation.
      • Internal Resistance (Re): Represents the electrical resistance of the battery.
      • Charge Transfer Resistance (Rct): Indicates charge movement efficiency.
    2. Environmental Conditions:

      • Ambient Temperature: External temperature affecting battery performance.
    3. Identification Attributes:

      • Battery ID: Unique identifier for each battery tested.
      • Test ID: Links specific test conditions to outcomes.
      • UID & Filename: Traceable dataset references.
    4. Processed Data:

      • Missing values have been addressed.
      • Columns irrelevant to RUL estimation have been removed.
      • Skewness in the data has been corrected for statistical accuracy.
    5. Labels:

      • Degradation States: Categorized into intervals for easier interpretation.
      • Ranges include operational and failure states.

    Potential Applications

    1. Battery Health Monitoring:

      • Predict battery failure timelines.
      • Enhance battery maintenance strategies.
    2. Data Science and Machine Learning:

      • Model development for RUL prediction.
      • Feature engineering for predictive analysis.
    3. Research and Development:

      • Improve battery design.
      • Study the impact of environmental and operational conditions on battery life.

    Technical Details

    • File Format: CSV
    • Size: ~625.02 kB
    • Columns: 9
    • Data Points: Multiple observations across various tests.

    Tags

    • Keywords: Lithium-ion batteries, RUL, Battery Aging, Machine Learning, Data Analysis, Predictive Maintenance.

    License

    • Apache 2.0: Permits academic and commercial use.

    Usage Instructions

    1. Import the dataset into your data analysis tools (e.g., Python, R, MATLAB).
    2. Explore features to understand correlations and dependencies.
    3. Use machine learning models for RUL prediction.

    Provenance

    The dataset was retrieved from NASA's publicly available data repositories. It has been preprocessed to align with research and industrial standards for usability in analytical tasks.

    Call to Action

    Leverage this dataset to enhance your understanding of lithium-ion battery degradation and build models that could revolutionize energy storage solutions.

  8. T

    titanic

    • tensorflow.org
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). titanic [Dataset]. https://www.tensorflow.org/datasets/catalog/titanic
    Explore at:
    Dataset updated
    Feb 12, 2023
    Description

    Dataset describing the survival status of individual passengers on the Titanic. Missing values in the original dataset are represented using ?. Float and int missing values are replaced with -1, string missing values are replaced with 'Unknown'.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('titanic', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. Z

    Data from: A comprehensive dataset for the accelerated development and...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Larson, David (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Coimbra, Carlos
    Larson, David
    Carreira Pedro, Hugo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

    Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

    Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

    Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

    Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

    Missing data The string "NAN" indicates missing data

    File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

    Files

    Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

    Folsom_weather.csv Primary One-minute weather data.

    Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

    Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

    Folsom_sky_image_features.csv Secondary Features derived from the sky images.

    Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

    Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

    Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

    Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

    NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

    Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

    Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

    Postprocess.py Code Python script used to compute the error metric for all the forecasts.

  10. Shopping Mall Customer Data Segmentation Analysis

    • kaggle.com
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DataZng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Demographic Analysis of Shopping Behavior: Insights and Recommendations

    Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

    Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

    Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

    Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

    Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

    References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/

  11. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

  12. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • b2find.eudat.eu
    zip
    Updated Jun 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset paper (public preprint)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the Soil Moisture Climate Data Records from satellites community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  13. Temperature and Humidity Time Series of Cold Storage Room Monitoring

    • zenodo.org
    bin, csv, png, zip
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll (2025). Temperature and Humidity Time Series of Cold Storage Room Monitoring [Dataset]. http://doi.org/10.5281/zenodo.15130001
    Explore at:
    png, bin, zip, csvAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elia Henrichs; Elia Henrichs; Florian Stoll; Christian Krupitzer; Christian Krupitzer; Florian Stoll
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.

    This dataset consists of the following files:

    • Raw.zip - The raw data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. These files can contain multiple headers.
    • Preprocessed.zip - The preprocessed data CSV files of the nine Arduino-based data loggers, containing the semicolon-separated columns date (formatted as dd.mm.yyyy), time (formatted as HH:MM:SS), temperature, and humidity. Multiple headers were removed, and the length of the datasets was aligned to equal length by filling missing values with NaN.
    • DataPreprocessing.ipynb - Jupyter Notebook containing the code to preprocess the data and create the overview file, which summarizes key characteristics of the dataset.
    • DataPreliminaryAnalysis.ipynb - Jupyter Notebook containing the code to perform the preliminary data analysis (general statistics, peaks, and matrix profiles).
    • experiment_actions.csv - CSV file logging performed actions (door openings and sensor movements).
    • overview.csv - CSV file summarizing key characteristics of the dataset and preliminary data analysis.
    • temphum_logger.ino - Source code to run the Arduino-based data logger with a sampling rate of 5 sec.
    • Arduino_setup_sketch_v1.png - Circuit diagram of the Arduino-based data logger.
  14. Science Education Research Topic Modeling Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, html +2
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
    Explore at:
    bin, txt, html, text/x-pythonAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

    The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

    • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
    • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
    • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
    • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
    • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
    • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
    • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

    After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

    In addition to this file, we have also included the following files:

    1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
    2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
    3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

    This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

  15. m

    Python code for the estimation of missing prices in real-estate market with...

    • data.mendeley.com
    Updated Dec 12, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iván García-Magariño (2017). Python code for the estimation of missing prices in real-estate market with a dataset of house prices from Teruel city [Dataset]. http://doi.org/10.17632/mxpgf54czz.2
    Explore at:
    Dataset updated
    Dec 12, 2017
    Authors
    Iván García-Magariño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Teruel
    Description

    This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in two neighborhoods of Teruel city (Spain) in November 13, 2017 from Idealista website. These two neighborhoods are the center of the city and “Ensanche”.

    This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.

    The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.

    The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.

  16. NA-CORDEX Cloud-Optimized Dataset

    • data.ucar.edu
    zarr
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Banihirwe, Anderson; Bonnlander, Brian; McGinnis, Seth; Nienhouse, Eric; de La Beaujardiere, Jeff (2023). NA-CORDEX Cloud-Optimized Dataset [Dataset]. http://doi.org/10.26024/9xkm-fp81
    Explore at:
    zarrAvailable download formats
    Dataset updated
    Sep 10, 2023
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Banihirwe, Anderson; Bonnlander, Brian; McGinnis, Seth; Nienhouse, Eric; de La Beaujardiere, Jeff
    Time period covered
    Jan 1, 1951 - Dec 31, 2099
    Area covered
    Description

    The NA-CORDEX dataset contains output from high-resolution regional climate models run over North America using boundary conditions from global simulations in the CMIP5 archive. The subset of the NA-CORDEX data on AWS (data volume ~15 TB) includes daily data from 1950-2100 for impacts-relevant variables on a 0.25 degree or 0.50 degree common lat-lon grid. This data is freely available on AWS S3 thanks to the AWS Open Data Sponsorship Program and the Amazon Sustainability Data Initiative, which provide free storage and egress. The data on AWS is stored in Zarr format. This format supports the same data model as netCDF and is well suited to object storage and distributed computing in the cloud using the Pangeo libraries in Python. An Intake-ESM Catalog listing all available data can be found at: [https://ncar-na-cordex.s3-us-west-2.amazonaws.com/catalogs/aws-na-cordex.json] The full dataset (data volume ~35 TB) can be accessed for download or via web services on the NCAR Climate Data Gateway. [https://www.earthsystemgrid.org/search/cordexsearch.html]

  17. f

    Additional file 9 of The automatic detection of diabetic kidney disease from...

    • springernature.figshare.com
    zip
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaomin Shi; Ling Gao; Juan Zhang; Baifang Zhang; Jing Xiao; Wan Xu; Yuan Tian; Lihua Ni; Xiaoyan Wu (2024). Additional file 9 of The automatic detection of diabetic kidney disease from retinal vascular parameters combined with clinical variables using artificial intelligence in type-2 diabetes patients [Dataset]. http://doi.org/10.6084/m9.figshare.26634081.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 16, 2024
    Dataset provided by
    figshare
    Authors
    Shaomin Shi; Ling Gao; Juan Zhang; Baifang Zhang; Jing Xiao; Wan Xu; Yuan Tian; Lihua Ni; Xiaoyan Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 9. Supplementary Python source codes. Code1: Python code for the model using RF classifier with SMOTE correction for data set imbalance. Code2: Python code for the model using SVM classifier with SMOTE correction for data set imbalance. Code3: Python code for the model using BDT classifier with SMOTE correction for data set imbalance. Code4: Python code for the model using Ada classifier with SMOTE correction for data set imbalance. Code5: Python code for the model using RF classifier with Random oversampling correction for data set imbalance. Code6: Python code for the model using SVM classifier with Random oversampling correction for data set imbalance. Code7: Python code for the model using BDT classifier with Random oversampling correction for data set imbalance. Code8: Python code for the model using Ada classifier with Random oversampling correction for data set imbalance. Code9: Python code for the model using RF classifier with no correction for data set imbalance. Code10: Python code for the model using SVM classifier with no correction for data set imbalance. Code11: Python code for the model using BDT classifier with no correction for data set imbalance. Code12: Python code for the model using Ada classifier with no correction for data set imbalance. Code13: Python code for the ROC curves of models with SMOTE correction for data set imbalance. Code14: Python code for the ROC curves of models with Random oversampling correction for data set imbalance. Code15: Python code for the ROC curves of models with no correction for data set imbalance. Code16: Python code for the model using RF classifier with SMOTE correction for data set imbalance, and imputing the missing data by the method of backfilling missing values. Code17: Python code for the model using RF classifier with SMOTE correction for data set imbalance, and imputing the missing data by means. Code18: Python code for tunning of the model using RF classifier with SMOTE correction for data set imbalance. Code19: Python code for calculating the standard deviations.

  18. Z

    LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Oct 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao Yang (2020). LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4139694
    Explore at:
    Dataset updated
    Oct 29, 2020
    Dataset provided by
    Xiao Yang
    John Gardner
    Matthew R.V. Ross
    Simon Topp
    Tamlin Pavelsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    LimnoSat-US is an analysis-ready remote sensing database that includes reflectance values spanning 36 years for 56,792 lakes across > 328,000 Landsat scenes. The database comes pre-processed with cross-sensor standardization and the effects of clouds, cloud shadows, snow, ice, and macrophytes removed. In total, it contains over 22 million individual lake observations with an average of 393 +/- 233 (mean +/- standard deviation) observations per lake over the 36 year period. The data and code contained within this repository are as follows:

    HydroLakes_DP.shp: A shapefile containing the deepest points for all U.S. lakes within HydroLakes. For more information on the deepest point see https://doi.org/10.5281/zenodo.4136754 and Shen et al (2015).

    LakeExport.py: Python code to extract reflectance values for U.S. lakes from Google Earth Engine.

    GEE_pull_functions.py: Functions called within LakeExport.py

    01_LakeExtractor.Rmd: An R Markdown file that takes the raw data from LakeExport.py and processes it for the final database.

    SceneMetadata.csv: A file containing additional information such as scene cloud cover and sun angle for all Landsat scenes within the database. Can be joined to the final database using LandsatID.

    srCorrected_us_hydrolakes_dp_20200628: The final LimnoSat-US database containing all cloud free observations of U.S. lakes from 1984-2020. Missing values for bands not shared between sensors (Aerosol and TIR2) are denoted by -99. dWL is the dominant wavelength calculated following Wang et al. (2015). pCount_dswe1 represents the number of high confidence water pixels within 120 meters of the deepest point. pCount_dswe3 represents the number of vegetated water pixels within 120 meters and can be used as a flag for potential reflectance noise. All reflectance values represent the median value of high confidence water pixels within 120 meters. The final database is provided in both as a .csv and .feather formats. It can be linked to SceneMetadata.cvs using LandsatID. All reflectance values are derived from USGS T1-SR Landsat scenes.

  19. Fitness Trackers Products Ecommerce ⌚

    • kaggle.com
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    randomarnab (2023). Fitness Trackers Products Ecommerce ⌚ [Dataset]. https://www.kaggle.com/arnabchaki/fitness-trackers-products-ecommerce/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2023
    Dataset provided by
    Kaggle
    Authors
    randomarnab
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This is a fitness tracker product dataset consisting of different products from various brands with their specifications, ratings and reviews for the Indian market. The data has been collected from e-commerce websites namely Flipkart and Amazon using web scraping technique.

    Inspiration This dataset could be used to find answers to some interesting questions like - 1. Is there a significant demand for fitness trackers in the Indian market? 2. Information on the top 5 brands for fitness bands and smart watches 3. Is there a correlation between the prices and product specifications, ratings, etc. 4. Different types of fitness trackers, their price segments for different users

    This dataset contains 451 samples with 16 attributes. There are some missing values in this dataset.

  20. SignalFlowEEG Example Data

    • figshare.com
    bin
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ernest Pedapati (2024). SignalFlowEEG Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.25414042.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 15, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ernest Pedapati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SignalFlowEEG Example Data dataset contains sample EEG recordings that demonstrate the capabilities and usage of the SignalFlowEEG Python package. This package provides a comprehensive set of tools for processing, analyzing, and visualizing electroencephalography (EEG) data, with a focus on neuroscience research applications.The example dataset includes EEG recordings from various paradigms:Resting-state EEG: A 5-minute recording where the subject relaxed with eyes closed.Auditory chirp stimulation: EEG recorded while the subject listened to chirp sounds with varying frequencies.Visual evoked potentials: EEG recorded as the subject viewed checkerboard pattern stimuli to elicit visual responses.These recordings were collected at the Cincinnati Children's Hospital Medical Center and are made available for educational and testing purposes.SignalFlowEEG builds upon MNE-Python, a popular open-source library for EEG analysis, and offers additional functionality tailored for clinical research workflows. This example dataset allows users to explore SignalFlowEEG's features and gain hands-on experience analyzing EEG data with this powerful Python package.The dataset consists of .set files, a format used by the EEGLAB toolbox. Each file contains raw EEG data, channel info, and event markers for a specific experimental paradigm. Files can be loaded using mne.io.read_raw_eeglab() from MNE-Python, a SignalFlowEEG dependency. The dataset has no missing data or special abbreviations. Channel names and event markers follow standard EEGLAB conventions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

Search
Clear search
Close search
Google apps
Main menu