Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
We will learn how to work on a real project of Data Analysis with Python. Questions are given in the project and then solved with the help of Python. It is a project of Data Analysis with Python or you can say, Data Science with Python.
The commands that we used in this project :
Challenges for this DataSet:
Q. 1) Find all the unique 'Wind Speed' values in the data. Q. 2) Find the number of times when the 'Weather is exactly Clear'. Q. 3) Find the number of times when the 'Wind Speed was exactly 4 km/h'. Q. 4) Find out all the Null Values in the data. Q. 5) Rename the column name 'Weather' of the dataframe to 'Weather Condition'. Q. 6) What is the mean 'Visibility' ? Q. 7) What is the Standard Deviation of 'Pressure' in this data? Q. 8) What is the Variance of 'Relative Humidity' in this data ? Q. 9) Find all instances when 'Snow' was recorded. Q. 10) Find all instances when 'Wind Speed is above 24' and 'Visibility is 25'. Q. 11) What is the Mean value of each column against each 'Weather Condition ? Q. 12) What is the Minimum & Maximum value of each column against each 'Weather Condition ? Q. 13) Show all the Records where Weather Condition is Fog. Q. 14) Find all instances when 'Weather is Clear' or 'Visibility is above 40'. Q. 15) Find all instances when : A. 'Weather is Clear' and 'Relative Humidity is greater than 50' or B. 'Visibility is above 40'
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.
- Each entry includes demographic, professional, and platform-related information such as:
- Name, gender, age, and country
- Primary skill and years of experience
- Hourly rate (with mixed formatting), client rating, and satisfaction score
- Language spoken (based on country)
- Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)
- Gender-based names using Faker’s male/female name generators
- Realistic age and experience distribution (with missing and noisy values)
- Country-language pairs mapped using actual linguistic data
- Messy formatting: mixed data types, missing values, inconsistent casing
- Generated entirely in Python using the faker library no real data used
- Practicing data cleaning and preprocessing
- Performing EDA (Exploratory Data Analysis)
- Developing data pipelines: raw → clean → model-ready
- Teaching feature engineering and handling real-world dirty data
- Exercises in data validation, outlier detection, and format standardization
global_freelancers_raw.csv
| Column Name | Description |
| --------------------- | ------------------------------------------------------------------------ |
| `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) |
| `name` | Full name of freelancer (based on gender) |
| `gender` | Gender (messy values and case inconsistency) |
| `age` | Age of the freelancer (20–60, with occasional nulls/outliers) |
| `country` | Country name (with random formatting/casing) |
| `language` | Language spoken (mapped from country) |
| `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) |
| `years_of_experience` | Work experience in years (some missing values or odd values included) |
| `hourly_rate (USD)` | Hourly rate with currency symbols or missing data |
| `rating` | Rating between 1.0–5.0 (some zeros and nulls included) |
| `is_active` | Active status (inconsistently represented as strings, numbers, booleans) |
| `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.
The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.
The preprocessing steps include:
One-hot-encoding of categorical values
Imputation of missing values using knn-imputer with k=1
Standard scaling of ordinal attributes
Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🧥 Snitch Fashion Sales (Uncleaned) Dataset 📌 Context This is a synthetic dataset representing sales transactions from Snitch, a fictional Indian clothing brand. The dataset simulates real-world retail sales data with uncleaned records, designed for learners and professionals to practice data cleaning, exploratory data analysis (EDA), and dashboard building using tools like Python, Power BI, or Excel.
📊 What You’ll Find The dataset includes over 2,500 records of fashion product sales across various Indian cities. It contains common data issues such as:
Missing values
Incorrect date formats
Duplicates
Typos in categories and city names
Unrealistic discounts and profit values
🧾 Columns Explained Column --Description Order_ID ------Unique ID for each sale (some duplicates) Customer_Name ------Name of the customer (inconsistent formatting) Product_Category ---Clothing category (e.g., T-Shirts, Jeans — includes typos) Product_Name -----Specific product sold Units_Sold --Quantity sold (some negative or null) Unit_Price --Price per unit (some missing or zero) Discount_% ----Discount applied (some >100% or missing) Sales_Amount ------Total revenue after discount (some miscalculations) Order_Date ---------Order date (multiple formats or missing) City -------Indian city (includes typos like "Hyd", "bengaluru") Segment----- Market segment (B2C, B2B, or missing) Profit ---------Profit made on the sale (some unrealistic/negative)
💡 How to Use This Dataset Clean and standardize messy data
Convert dates and correct formats
Perform EDA to find:
Top-selling categories
Impact of discounts on sales and profits
Monthly/quarterly trends
Segment-based performance
Create dashboards in Power BI or Excel Pivot Table
Document findings in a PDF/Markdown report
🎯 Ideal For Aspiring data analysts and data scientists
Excel / Power BI dashboard learners
Portfolio project creators
Kaggle competitions or practice
📌 License This is a synthetic dataset created for educational use only. No real customer or business data is included.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a comprehensive view of the aging process of lithium-ion batteries, facilitating the estimation of their Remaining Useful Life (RUL). Originally sourced from NASA's open repository, the dataset has undergone meticulous preprocessing to enhance its analytical utility. The data is presented in a user-friendly CSV format after extracting relevant features from the original .mat
files.
Battery Performance Metrics:
Environmental Conditions:
Identification Attributes:
Processed Data:
Labels:
Battery Health Monitoring:
Data Science and Machine Learning:
Research and Development:
The dataset was retrieved from NASA's publicly available data repositories. It has been preprocessed to align with research and industrial standards for usability in analytical tasks.
Leverage this dataset to enhance your understanding of lithium-ion battery degradation and build models that could revolutionize energy storage solutions.
Dataset describing the survival status of individual passengers on the Titanic. Missing values in the original dataset are represented using ?. Float and int missing values are replaced with -1, string missing values are replaced with 'Unknown'.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('titanic', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.
Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494
Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.
Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.
Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.
Missing data The string "NAN" indicates missing data
File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files
Files
Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.
Folsom_weather.csv Primary One-minute weather data.
Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.
Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.
Folsom_sky_image_features.csv Secondary Features derived from the sky images.
Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.
Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).
Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.
Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.
NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.
Target_{horizon}.csv Secondary Target data for the different forecasting horizons.
Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.
Postprocess.py Code Python script used to compute the error metric for all the forecasts.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Demographic Analysis of Shopping Behavior: Insights and Recommendations
Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.
Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.
Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.
Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.
Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.
References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the Soil Moisture Climate Data Records from satellites community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.
This dataset consists of the following files:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in two neighborhoods of Teruel city (Spain) in November 13, 2017 from Idealista website. These two neighborhoods are the center of the city and “Ensanche”.
This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.
The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.
The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.
The NA-CORDEX dataset contains output from high-resolution regional climate models run over North America using boundary conditions from global simulations in the CMIP5 archive. The subset of the NA-CORDEX data on AWS (data volume ~15 TB) includes daily data from 1950-2100 for impacts-relevant variables on a 0.25 degree or 0.50 degree common lat-lon grid. This data is freely available on AWS S3 thanks to the AWS Open Data Sponsorship Program and the Amazon Sustainability Data Initiative, which provide free storage and egress. The data on AWS is stored in Zarr format. This format supports the same data model as netCDF and is well suited to object storage and distributed computing in the cloud using the Pangeo libraries in Python. An Intake-ESM Catalog listing all available data can be found at: [https://ncar-na-cordex.s3-us-west-2.amazonaws.com/catalogs/aws-na-cordex.json] The full dataset (data volume ~35 TB) can be accessed for download or via web services on the NCAR Climate Data Gateway. [https://www.earthsystemgrid.org/search/cordexsearch.html]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 9. Supplementary Python source codes. Code1: Python code for the model using RF classifier with SMOTE correction for data set imbalance. Code2: Python code for the model using SVM classifier with SMOTE correction for data set imbalance. Code3: Python code for the model using BDT classifier with SMOTE correction for data set imbalance. Code4: Python code for the model using Ada classifier with SMOTE correction for data set imbalance. Code5: Python code for the model using RF classifier with Random oversampling correction for data set imbalance. Code6: Python code for the model using SVM classifier with Random oversampling correction for data set imbalance. Code7: Python code for the model using BDT classifier with Random oversampling correction for data set imbalance. Code8: Python code for the model using Ada classifier with Random oversampling correction for data set imbalance. Code9: Python code for the model using RF classifier with no correction for data set imbalance. Code10: Python code for the model using SVM classifier with no correction for data set imbalance. Code11: Python code for the model using BDT classifier with no correction for data set imbalance. Code12: Python code for the model using Ada classifier with no correction for data set imbalance. Code13: Python code for the ROC curves of models with SMOTE correction for data set imbalance. Code14: Python code for the ROC curves of models with Random oversampling correction for data set imbalance. Code15: Python code for the ROC curves of models with no correction for data set imbalance. Code16: Python code for the model using RF classifier with SMOTE correction for data set imbalance, and imputing the missing data by the method of backfilling missing values. Code17: Python code for the model using RF classifier with SMOTE correction for data set imbalance, and imputing the missing data by means. Code18: Python code for tunning of the model using RF classifier with SMOTE correction for data set imbalance. Code19: Python code for calculating the standard deviations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LimnoSat-US is an analysis-ready remote sensing database that includes reflectance values spanning 36 years for 56,792 lakes across > 328,000 Landsat scenes. The database comes pre-processed with cross-sensor standardization and the effects of clouds, cloud shadows, snow, ice, and macrophytes removed. In total, it contains over 22 million individual lake observations with an average of 393 +/- 233 (mean +/- standard deviation) observations per lake over the 36 year period. The data and code contained within this repository are as follows:
HydroLakes_DP.shp: A shapefile containing the deepest points for all U.S. lakes within HydroLakes. For more information on the deepest point see https://doi.org/10.5281/zenodo.4136754 and Shen et al (2015).
LakeExport.py: Python code to extract reflectance values for U.S. lakes from Google Earth Engine.
GEE_pull_functions.py: Functions called within LakeExport.py
01_LakeExtractor.Rmd: An R Markdown file that takes the raw data from LakeExport.py and processes it for the final database.
SceneMetadata.csv: A file containing additional information such as scene cloud cover and sun angle for all Landsat scenes within the database. Can be joined to the final database using LandsatID.
srCorrected_us_hydrolakes_dp_20200628: The final LimnoSat-US database containing all cloud free observations of U.S. lakes from 1984-2020. Missing values for bands not shared between sensors (Aerosol and TIR2) are denoted by -99. dWL is the dominant wavelength calculated following Wang et al. (2015). pCount_dswe1 represents the number of high confidence water pixels within 120 meters of the deepest point. pCount_dswe3 represents the number of vegetated water pixels within 120 meters and can be used as a flag for potential reflectance noise. All reflectance values represent the median value of high confidence water pixels within 120 meters. The final database is provided in both as a .csv and .feather formats. It can be linked to SceneMetadata.cvs using LandsatID. All reflectance values are derived from USGS T1-SR Landsat scenes.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a fitness tracker product dataset consisting of different products from various brands with their specifications, ratings and reviews for the Indian market. The data has been collected from e-commerce websites namely Flipkart and Amazon using web scraping technique.
Inspiration This dataset could be used to find answers to some interesting questions like - 1. Is there a significant demand for fitness trackers in the Indian market? 2. Information on the top 5 brands for fitness bands and smart watches 3. Is there a correlation between the prices and product specifications, ratings, etc. 4. Different types of fitness trackers, their price segments for different users
This dataset contains 451 samples with 16 attributes. There are some missing values in this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SignalFlowEEG Example Data dataset contains sample EEG recordings that demonstrate the capabilities and usage of the SignalFlowEEG Python package. This package provides a comprehensive set of tools for processing, analyzing, and visualizing electroencephalography (EEG) data, with a focus on neuroscience research applications.The example dataset includes EEG recordings from various paradigms:Resting-state EEG: A 5-minute recording where the subject relaxed with eyes closed.Auditory chirp stimulation: EEG recorded while the subject listened to chirp sounds with varying frequencies.Visual evoked potentials: EEG recorded as the subject viewed checkerboard pattern stimuli to elicit visual responses.These recordings were collected at the Cincinnati Children's Hospital Medical Center and are made available for educational and testing purposes.SignalFlowEEG builds upon MNE-Python, a popular open-source library for EEG analysis, and offers additional functionality tailored for clinical research workflows. This example dataset allows users to explore SignalFlowEEG's features and gain hands-on experience analyzing EEG data with this powerful Python package.The dataset consists of .set files, a format used by the EEGLAB toolbox. Each file contains raw EEG data, channel info, and event markers for a specific experimental paradigm. Files can be loaded using mne.io.read_raw_eeglab() from MNE-Python, a SignalFlowEEG dependency. The dataset has no missing data or special abbreviations. Channel names and event markers follow standard EEGLAB conventions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.