100+ datasets found

l
Data set for article: Effect of data preprocessing and machine learning...
opal.latrobe.edu.au
researchdata.edu.au
hdf
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wil Gardner (2024). Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models [Dataset]. http://doi.org/10.26181/22671022.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.26181/22671022.v1
Dataset updated
Mar 7, 2024
Dataset provided by
La Trobe
Authors
Wil Gardner
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788
f
Preprocessing steps.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628
Explore at:
Dataset updated
Jun 28, 2024
Authors
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle
Description
In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Employee Performance & Salary (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Employee Performance & Salary (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/employee-performance-and-salary-synthetic-dataset

Explore at:

zip(13002 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🧑‍💼 Employee Performance and Salary Dataset

This synthetic dataset simulates employee information in a medium-sized organization, designed specifically for data preprocessing and exploratory data analysis (EDA) tasks in Data Mining and Machine Learning labs.

It includes over 1,000 employee records with realistic variations in age, gender, department, experience, performance score, and salary — along with missing values, duplicates, and outliers to mimic real-world data quality issues.

📊 Columns Description

Column Name	Description
Employee_ID	Unique employee identifier (E0001, E0002, …)
Age	Employee age (22–60 years)
Gender	Gender of the employee (Male/Female)
Department	Department where the employee works (HR, Finance, IT, Marketing, Sales, Operations)
Experience_Years	Total years of work experience (contains missing values)
Performance_Score	Employee performance score (0–100, contains missing values)
Salary	Annual salary in USD (contains outliers)

🧠 Example Lab Tasks

Identify and impute missing values using mean or median.
Detect and remove duplicate employee records.
Detect outliers in Salary using IQR or Z-score.
Normalize Salary and Performance_Score using Min-Max scaling.
Encode categorical columns (Gender, Department) for model training.
Ideal for Regression

🎯 Possible Regression Targets (Dependent Variables)

Salary → Predict salary based on experience, performance, department, and age. Performance_Score → Predict employee performance based on age, experience, and department.

🧩 Example Regression Problem

Predict the employee's salary based on their experience, performance score, and department.

🧠 Sample Features:

X = ['Age', 'Experience_Years', 'Performance_Score', 'Department', 'Gender'] y = ['Salary']

You can apply:

Linear Regression
Ridge/Lasso Regression
Random Forest Regressor
XGBoost Regressor
SVR (Support Vector Regression)
and evaluate with metrics like:

R², MAE, MSE, RMSE, and residual plots.

DATS 6401 - Final Project - Yon ho Cheong.zip
figshare.com
zip
Updated Dec 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yon ho Cheong (2018). DATS 6401 - Final Project - Yon ho Cheong.zip [Dataset]. http://doi.org/10.6084/m9.figshare.7471007.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7471007.v1
Dataset updated
Dec 15, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yon ho Cheong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
n
Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvj1
Dataset updated
Jul 8, 2024
Dataset provided by
Stanford University School of Medicine
Authors
Yuqi Tan; Tim Kempchen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.

Student Academic Performance (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset

Explore at:

zip(9287 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

Handling missing values
Removing duplicates
Detecting and treating outliers
Data normalization and transformation
Encoding categorical variables
Exploratory data analysis (EDA)
Regression Analysis

📊 Columns Description

Column Name	Description
Student_ID	Unique identifier for each student (e.g., S0001, S0002, …)
Age	Age of the student (between 18 and 25 years)
Gender	Gender of the student (Male/Female)
Study_Hours	Average number of study hours per day (contains missing values and outliers)
Attendance(%)	Percentage of class attendance (contains missing values)
Test_Score	Final exam score (0–100 scale)
Grade	Letter grade derived from test scores (`F`, `C`, `B`, `A`, `A+`)

🧠 Example Lab Tasks Using This Dataset:

Identify and impute missing values using mean/median.
Detect and remove duplicate records.
Use IQR or Z-score methods to handle outliers.
Normalize Study_Hours and Test_Score using Min-Max scaling.
Encode categorical variables (Gender, Grade) for model input.
Prepare a clean dataset ready for classification/regression analysis.
Can be used for Limited Regression

🎯 Possible Regression Targets

Test_Score → Predict test score based on study hours, attendance, age, and gender.

🧩 Example Regression Problem

Predict the student’s test score using their study hours, attendance percentage, and age.

🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

You can use:

Linear Regression (for simplicity)
Polynomial Regression (to explore nonlinear patterns)
Decision Tree Regressor or Random Forest Regressor

And analyze feature influence using correlation or SHAP/LIME explainability.

n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Metabolomics Data Preprocessing PQN PCA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
Explore at:
zip(22763 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

Includes data visualization techniques to interpret PCA results effectively.

Suitable for metabolomics researchers and data scientists working on omics data.

Enables better reproducibility of preprocessing workflows for metabolomics studies.

Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

Provides a Python-based notebook that is easy to adapt to new datasets.

Includes example datasets and code snippets for immediate application.

Helps users understand the impact of normalization on downstream statistical analyses.

Supports integration with other metabolomics pipelines or machine learning workflows.
m
Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...
data.mendeley.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
Explore at:
Unique identifier
https://doi.org/10.17632/x3fz2nfm4w.3
Dataset updated
Apr 2, 2024
Authors
Md Shoaib Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

Fresh salmon 🐟 Infected Salmon 🐠

This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

So, dive in and explore the fascinating world of salmon health and disease!

The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

Fresh salmon (healthy fish with no visible signs of disease), 456 images

Infected Salmon containing disease, 752 images

Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

Data Preprocessing

The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

Usage

To use the salmon scan dataset in your ML and DL projects, follow these steps:

Clone or download the salmon scan dataset repository from GitHub.

Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.

Split the dataset into training, validation, and test sets as per your requirement.

Preprocess the data as needed, such as resizing and normalizing the images.

Train your ML/DL model using the preprocessed training data.

Evaluate the model on the test set and make predictions on new, unseen data.
Additional file 1: of Proposal of supervised data analysis strategy of...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone (2023). Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)
Data Pre-processing Sample Dataset
kaggle.com
zip
Updated Feb 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allan Kiplang'at Chepkoy (2018). Data Pre-processing Sample Dataset [Dataset]. https://www.kaggle.com/chepkoyallan/datapreprocessing
Explore at:
zip(286 bytes)Available download formats
Dataset updated
Feb 2, 2018
Authors
Allan Kiplang'at Chepkoy
Description
Dataset

This dataset was created by Allan Kiplang'at Chepkoy

Contents
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
US Deep Learning Market Analysis, Size, and Forecast 2025-2029
technavio.com
pdf
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). US Deep Learning Market Analysis, Size, and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/us-deep-learning-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 8, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Description
Snapshot img

US Deep Learning Market Size 2025-2029

The deep learning market size in US is forecast to increase by USD 5.02 billion at a CAGR of 30.1% between 2024 and 2029.

The deep learning market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) in various industries for advanced solutioning. This trend is fueled by the availability of vast amounts of data, which is a key requirement for deep learning algorithms to function effectively. Industry-specific solutions are gaining traction, as businesses seek to leverage deep learning for specific use cases such as image and speech recognition, fraud detection, and predictive maintenance. Alongside, intuitive data visualization tools are simplifying complex neural network outputs, helping stakeholders understand and validate insights. However, challenges remain, including the need for powerful computing resources, data privacy concerns, and the high cost of implementing and maintaining deep learning systems. Despite these hurdles, the market's potential for innovation and disruption is immense, making it an exciting space for businesses to explore further. Semi-supervised learning, data labeling, and data cleaning facilitate efficient training of deep learning models. Cloud analytics is another significant trend, as companies seek to leverage cloud computing for cost savings and scalability.

What will be the Size of the market During the Forecast Period?

Request Free Sample

Deep learning, a subset of machine learning, continues to shape industries by enabling advanced applications such as image and speech recognition, text generation, and pattern recognition. Reinforcement learning, a type of deep learning, gains traction, with deep reinforcement learning leading the charge. Anomaly detection, a crucial application of unsupervised learning, safeguards systems against security vulnerabilities. Ethical implications and fairness considerations are increasingly important in deep learning, with emphasis on explainable AI and model interpretability. Graph neural networks and attention mechanisms enhance data preprocessing for sequential data modeling and object detection. Time series forecasting and dataset creation further expand deep learning's reach, while privacy preservation and bias mitigation ensure responsible use.

In summary, deep learning's market dynamics reflect a constant pursuit of innovation, efficiency, and ethical considerations. The Deep Learning Market in the US is flourishing as organizations embrace intelligent systems powered by supervised learning and emerging self-supervised learning techniques. These methods refine predictive capabilities and reduce reliance on labeled data, boosting scalability. BFSI firms utilize AI image recognition for various applications, including personalizing customer communication, maintaining a competitive edge, and automating repetitive tasks to boost productivity. Sophisticated feature extraction algorithms now enable models to isolate patterns with high precision, particularly in applications such as image classification for healthcare, security, and retail.

How is this market segmented and which is the largest segment?

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Application Image recognition Voice recognition Video surveillance and diagnostics Data mining Type Software Services Hardware End-user Security Automotive Healthcare Retail and commerce Others Geography North America US

By Application Insights

The Image recognition segment is estimated to witness significant growth during the forecast period. In the realm of artificial intelligence (AI) and machine learning, image recognition, a subset of computer vision, is gaining significant traction. This technology utilizes neural networks, deep learning models, and various machine learning algorithms to decipher visual data from images and videos. Image recognition is instrumental in numerous applications, including visual search, product recommendations, and inventory management. Consumers can take photographs of products to discover similar items, enhancing the online shopping experience. In the automotive sector, image recognition is indispensable for advanced driver assistance systems (ADAS) and autonomous vehicles, enabling the identification of pedestrians, other vehicles, road signs, and lane markings.

Furthermore, image recognition plays a pivotal role in augmented reality (AR) and virtual reality (VR) applications, where it tracks physical objects and overlays digital content onto real-world scenarios. The model training process involves the backpropagation algorithm, which calculates the loss fu
e
Sample Geodata and Software for Demonstrating Geospatial Preprocessing for...
envidat.ch
gimi9.com
+1more
ipynb, not available +3
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leo Gallus Bont; Marielle Fraefel; Ionuț Iosifescu Enescu, Sample Geodata and Software for Demonstrating Geospatial Preprocessing for Forest Accessibility and Wood Harvesting at FOSS4G2019 [Dataset]. http://doi.org/10.16904/envidat.75
Explore at:
tiff, ipynb, png, not available, zipAvailable download formats
Unique identifier
https://doi.org/10.16904/envidat.75
Dataset provided by
Swiss Federal Institute for Forest, Snow and Landscape Research WSL
Authors
Leo Gallus Bont; Marielle Fraefel; Ionuț Iosifescu Enescu
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Time period covered
May 18, 2019 - May 22, 2019
Area covered
Germany, Upper Bavaria (Kochel Forest Range / Bayerische Staatsforsten AöR Revier Kochel)
Dataset funded by
WSL
Description
This dataset contains open vector data for railways, forests and power lines, as well an open digital elevation model (DEM) for a small area around a sample forest range in Europe (Germany, Upper Bavaria, Kochel Forest Range, some 70 km south of München, at the edge of Bavarian Alps). The purpose of this dataset is to provide a documented sample dataset in order to demonstrate geospatial preprocessing at FOSS4G2019 based on open data and software. This sample has been produced based on several existing open data sources (detailed below), therefore documenting the sources for obtaining some data needed for computations related to forest accessibility and wood harvesting. For example, they can be used with the open methodology and QGIS plugin Seilaplan for optimising the geometric layout cable roads or with additional open software for computing the forest accessibility for wood harvesting. The vector data (railways, forests and power lines) was extracted from OpenStreetMap (data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org). The railways and forests were downloaded and extracted on 18.05.2019 using the open sources QGIS (https://www.qgis.org) with the QuickOSM plugin, while the power lines were downloaded a couple of days later on 23.05.2019. Additional notes for vector data: Please note that OpenStreeMap data extracts such as forests, roads and railways (except power lines) can also be downloaded in a GIS friendly format (Shapefile) from http://download.geofabrik.de/ or using the QGIS built-in download function for OpenStreetMap data. The most efficient way to retrieve specific OSM tags (such as power=line) is to use the QuickOSM plugin for QGIS (using the Overpass API - https://wiki.openstreetmap.org/wiki/Overpass_API) or directly using overpass turbo (https://overpass-turbo.eu/). Finally, the digitised perimeter of the sample forest range is also made available for reproducibility purposes, although any perimeter or area can be digitised freely using the QGIS editing toolbar. The DEM was originally adapted and modified also with QGIS (https://www.qgis.org) based on the elevation data available from two different sources, by reprojecting and downsampling datasets to 25m then selecting, for each individual raster cell, the elevation value that was closer to the average. These two different elevation sources are: - Copernicus Land Monitoring Service - EU-DEM v.1.1 (TILE ID E40N20, downloaded from https://land.copernicus.eu/imagery-in-situ/eu-dem/eu-dem-v1.1; this original DEM was produced by the Copernicus Land Monitoring Service “with funding by the European Union” based on SRTM and ASTER GDEM) - Digitales Geländemodell 50 m Gitterweite (https://opendata.bayern.de/detailansicht/datensatz/digitales-gelaendemodell-50-m-gitterweite/), produced by the Bayerische Vermessungsverwaltung – www.geodaten.bayern.de –and downloaded from http://www.geodaten.bayern.de/opendata/DGM50/dgm50_epsg4258.tif This methodology was chosen as a way of performing a basic quality check, by comparing the EU-DEM v.1.1 derived from globally available DEM data (such as SRTM) with more authoritative data for the randomly selected region, since using authoritative data is preferred (if open and available). For other sample regions, where authoritative open data is not available, such comparisons cannot longer be performed. Additional notes DEM: a very good DEM open data source for Germany is the open data set collected and resampled by Sonny (sonnyy7@gmail.com) and made available on the Austrian Open Data Portal http://data.opendataportal.at/dataset/dtm-germany. In order to simplify end-to-end reproducibility of the paper planned for FOSS4G2019, we use and distribute an adapted (reprojected and resampled to 25 meters) sample of the above mentioned dataset for the selected forest range. This sample dataset is accompanied by software in Python, as a Jupiter Notebook that generates harmonized output rasters with the same extent from the input data. The extent is given by the polygon vector dataset (Perimeter). These output rasters, such as obstacles, aspect, slope, forest cover, can serve as input data for later computations related to forest accessibility and wood harvesting questions. The obstacles output is obtained by transforming line vector datasets (railway lines, high voltage power lines) to raster. Aspect and slope are both derived from the sample digital elevation model.
f
Data_Sheet_1_On the Automation of Flood Event Separation From Continuous...
frontiersin.figshare.com
zip
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henning Oppel; Benjamin Mewes (2023). Data_Sheet_1_On the Automation of Flood Event Separation From Continuous Time Series.zip [Dataset]. http://doi.org/10.3389/frwa.2020.00018.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/frwa.2020.00018.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Henning Oppel; Benjamin Mewes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Can machine learning effectively lower the effort necessary to extract important information from raw data for hydrological research questions? On the example of a typical water-management task, the extraction of direct runoff flood events from continuous hydrographs, we demonstrate how machine learning can be used to automate the application of expert knowledge to big data sets and extract the relevant information. In particular, we tested seven different algorithms to detect event beginning and end solely from a given excerpt from the continuous hydrograph. First, the number of required data points within the excerpts as well as the amount of training data has been determined. In a local application, we were able to show that all applied Machine learning algorithms were capable to reproduce manually defined event boundaries. Automatically delineated events were afflicted with a relative duration error of 20 and 5% event volume. Moreover, we could show that hydrograph separation patterns could easily be learned by the algorithms and are regionally and trans-regionally transferable without significant performance loss. Hence, the training data sets can be very small and trained algorithms can be applied to new catchments lacking training data. The results showed the great potential of machine learning to extract relevant information efficiently and, hence, lower the effort for data preprocessing for water management studies. Moreover, the transferability of trained algorithms to other catchments is a clear advantage to common methods.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
n
Data from: A Deep Learning and XGBoost-based Method for Predicting...
narcis.nl
data.mendeley.com
Updated Aug 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang, P (via Mendeley Data) (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
Explore at:
Unique identifier
https://doi.org/10.17632/9tft3vz5tm.2
Dataset updated
Aug 3, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
wang, P (via Mendeley Data)
Description
local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label
The original data for Strategies of improved data preprocessing enhance the...
figshare.com
zip
Updated Oct 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qu, Li (2023). The original data for Strategies of improved data preprocessing enhance the power of noninvasive prenatal screening in detecting microdeletionsyndromes [Dataset]. http://doi.org/10.6084/m9.figshare.24433609.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24433609.v1
Dataset updated
Oct 25, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Qu, Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Noninvasive prenatal screening (NIPS) has become widely adopted for assessment of common trisomies, but technical variability hinders reliable identification of rarer microdeletion syndromes. Here, we implement advanced mappability correction and principal component analysis prior to NIPS, which minimizes noise and batch effects in the data. Consequently, our approach, termed IDP, achieves substantially higher sensitivity and specificity for five examined microdeletions versus standard preprocessing controls. Using 2,865 clinical samples and 100 simulated datasets, we comprehensively validate IDP's capabilities. We also propose two new quality metrics, abnormal rate and disperse index, to monitor fluctuations throughout processing. Together, this study provides a robust framework to expand NIPS utility for a broader range of prenatal genetic conditions.
C
Synthetic Integrated Services Data
data.wprdc.org
csv, html, pdf, zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
Explore at:
html, csv(1375554033), pdf, zip(39231637)Available download formats
Dataset updated
Jun 25, 2024
Dataset authored and provided by
Allegheny County
Description
Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wil Gardner (2024). Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models [Dataset]. http://doi.org/10.26181/22671022.v1

Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models

Explore at:

hdfAvailable download formats

Unique identifier

https://doi.org/10.26181/22671022.v1

Dataset updated

Mar 7, 2024

Dataset provided by

La Trobe

Authors

Wil Gardner

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788

Clear search

Close search

Google apps

Main menu

Data set for article: Effect of data preprocessing and machine learning...

Preprocessing steps.

Ecommerce Dataset for Data Analysis

Employee Performance & Salary (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks

🎯 Possible Regression Targets (Dependent Variables)

🧩 Example Regression Problem

🧠 Sample Features:

DATS 6401 - Final Project - Yon ho Cheong.zip

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

Student Academic Performance (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks Using This Dataset:

🎯 Possible Regression Targets

🧩 Example Regression Problem

Malaria disease and grading system dataset from public hospitals reflecting...

Metabolomics Data Preprocessing PQN PCA

Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

Additional file 1: of Proposal of supervised data analysis strategy of...

Data Pre-processing Sample Dataset

Dataset

Contents

Educational Attainment in North Carolina Public Schools: Use of statistical...

US Deep Learning Market Analysis, Size, and Forecast 2025-2029

Snapshot img

Sample Geodata and Software for Demonstrating Geospatial Preprocessing for...

Data_Sheet_1_On the Automation of Flood Event Separation From Continuous...

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Data from: A Deep Learning and XGBoost-based Method for Predicting...

The original data for Strategies of improved data preprocessing enhance the...

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources

Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models