Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To achieve high quality omics results, systematic variability in mass spectrometry (MS) data must be adequately addressed. Effective data normalization is essential for minimizing this variability. The abundance of approaches and the data-dependent nature of normalization have led some researchers to develop open-source academic software for choosing the best approach. While these tools are certainly beneficial to the community, none of them meet all of the needs of all users, particularly users who want to test new strategies that are not available in these products. Herein, we present a simple and straightforward workflow that facilitates the identification of optimal normalization strategies using straightforward evaluation metrics, employing both supervised and unsupervised machine learning. The workflow offers a “DIY” aspect, where the performance of any normalization strategy can be evaluated for any type of MS data. As a demonstration of its utility, we apply this workflow on two distinct datasets, an ESI-MS dataset of extracted lipids from latent fingerprints and a cancer spheroid dataset of metabolites ionized by MALDI-MSI, for which we identified the best-performing normalization strategies.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global Normalizing Service market is experiencing robust growth, driven by increasing demand for [Insert specific drivers based on your knowledge of the Normalizing Service market, e.g., improved data quality, enhanced data analysis capabilities, rising adoption of cloud-based solutions, stringent data governance regulations]. The market is segmented by application [Insert specific applications, e.g., healthcare, finance, manufacturing] and type [Insert specific types of Normalizing Services, e.g., data cleansing, data transformation, data integration]. While precise market sizing data is unavailable, based on industry trends and comparable markets with similar growth trajectories, a reasonable estimate for the 2025 market size could be placed in the range of $500-750 million USD, with a Compound Annual Growth Rate (CAGR) of approximately 15-20% projected from 2025 to 2033. This growth is expected to be fueled by the continued expansion of big data analytics and the rising need for data standardization across diverse industries. However, challenges such as data security concerns, integration complexities, and high initial investment costs can act as potential restraints on market expansion. Regional analysis suggests a strong presence across North America and Europe, driven by early adoption and robust technological infrastructure. Asia-Pacific is poised for significant growth in the coming years due to increasing digitalization and expanding data centers. The market is highly competitive, with a mix of established players and emerging technology companies vying for market share. Successful players will need to differentiate their offerings through specialized solutions, strategic partnerships, and a focus on addressing specific industry needs. Future growth will depend on advancements in AI and machine learning technologies, further integration with cloud platforms, and the development of user-friendly, scalable solutions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research study aims to understand the application of Artificial Neural Networks (ANNs) to forecast the Self-Compacting Recycled Coarse Aggregate Concrete (SCRCAC) compressive strength. From different literature, 602 available data sets from SCRCAC mix designs are collected, and the data are rearranged, reconstructed, trained and tested for the ANN model development. The models were established using seven input variables: the mass of cementitious content, water, natural coarse aggregate content, natural fine aggregate content, recycled coarse aggregate content, chemical admixture and mineral admixture used in the SCRCAC mix designs. Two normalization techniques are used for data normalization to visualize the data distribution. For each normalization technique, three transfer functions are used for modelling. In total, six different types of models were run in MATLAB and used to estimate the 28th day SCRCAC compressive strength. Normalization technique 2 performs better than 1 and TANSING is the best transfer function. The best k-fold cross-validation fold is k = 7. The coefficient of determination for predicted and actual compressive strength is 0.78 for training and 0.86 for testing. The impact of the number of neurons and layers on the model was performed. Inputs from standards are used to forecast the 28th day compressive strength. Apart from ANN, Machine Learning (ML) techniques like random forest, extra trees, extreme boosting and light gradient boosting techniques are adopted to predict the 28th day compressive strength of SCRCAC. Compared to ML, ANN prediction shows better results in terms of sensitive analysis. The study also extended to determine 28th day compressive strength from experimental work and compared it with 28th day compressive strength from ANN best model. Standard and ANN mix designs have similar fresh and hardened properties. The average compressive strength from ANN model and experimental results are 39.067 and 38.36 MPa, respectively with correlation coefficient is 1. It appears that ANN can validly predict the compressive strength of concrete.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data accompanies the following publication: Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task Journal: Journal of the American Medical Informatics Association (JAMIA) The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).
his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.
In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.
The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset comprises 302 JPEG images captured with an endoscopic camera, focusing on detecting porosities in the machined holes inner walls of cast aluminum parts. Each image has a resolution of 400x400 pixels in RGB color space, providing detailed views of potential defects. The dataset is intended for developing and evaluating algorithms for automated defect detection in industrial manufacturing, specifically targeting porosity defects in aluminum casting processes. It does not include annotations or labels. Researchers can use these images to: Train and test machine learning models for defect detection. * Explore characteristics and distributions of porosity defects in machined holes. Develop algorithms for automated quality control in manufacturing settings. Preprocessing such as normalization and resizing may be necessary before applying the images to machine learning tasks.
The provided code files are utilized to construct a convolutional neural network (CNN)-based state of health (SOH) estimator using data from Samsung 30T cylindrical 21700 cells. These files encompass essential functions: 1) Preprocessing of original data, including normalization and data splitting, 2) Training the CNN-based SOH estimator, and 3) Evaluating performance and generating result plots for the CNN-based SOH estimator. The comprehensive functionality of these files, as well as detailed discussion of results, are extensively covered in the IEEE Xplore publication titled "A Convolutional Neural Network for Estimation of Lithium-Ion Battery State-of-Health during Constant Current Operation," and supplemented by the accompanying user guide "CNN based SOH estimation code - Users Guide.pdf". The battery aging data used is also open source: “Fifteen minute fast charge aging dataset - Samsung 30T cells”, Borealis Data, 2023. https://doi.org/10.5683/SP3/UYPYDJ
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our study focuses on the south-eastern region of Australia. In recent years, the south-eastern part has been experiencing increasing frequency of wildfires. However, the 2019-2020 bushfire season was unprecedented in intensity and devastation. It is widely known as ‘Black Summer’.This study combined various biophysical factors, including MODIS MCD64A1 fire product, digital elevation model (DEM), slope, aspect, ECOSTRESS data (i.e., evapotranspiration – ET, evaporative stress index – ESI, land surface temperature – LST, water use efficiency -WUE), NDVI generated from Sentinel-2 data, and rainfall data, for wildfire prediction. To this aim, we designed models that incorporate pre-fire vegetation conditions obtained from ECOSTRESS data to predict the probability of future wildfire occurrence. The predictivity of models and biophysical factors were assessed to understand pre-fire vegetation conditions and wildfire susceptibility.
We used nine variables from four sources as explanatory variables (Table 1). Fire occurrences between the period of September 2019 and March 2020 were obtained from the MODIS MCD64A1 product as a shapefile and mapped. A dataset was created to record the presence and absence of fires, classified as 0 and 1, respectively. Rainfall data were obtained from the Bureau of Meteorology, Australia, for all seven months, which was then compiled and interpolated using the Inverse Distance Weighting (IDW) method. The IDW tool from ArcGIS Spatial Analyst extension is used. DEM derivatives such as slope and aspect were created using Slope and Aspect tools in ArcGIS Pro. Sentinel-2 L2A (16-bit) data was downloaded from the Sentinel Hub EO browser at a resolution of 10 m, and NDVI was mapped using bands 4 and 8. All variable raster images were clipped to extract the study area. ECOSTRESS data products, including Evapotranspiration (ET), Evaporative Stress Index (ESI), Land Surface Temperature (LST), and Water Use Efficiency (WUE), acquired from NASA LPDAAC AppEARS, were used to model wildfire dynamics (Fisher et al., 2020; Zhu et al., 2022). A mosaic dataset in a raster format was created for each variable over the seven months between September 2019 and March 2020.
Table 1. Explanatory variables used in this research and their data sources
Category Explanatory variables Source
ECOSTRESS
Evapotranspiration (ET)
70m resolution ECOSTRESS data from LPDAAC AppEARS https://lpdaacsvc.cr.usgs.gov/appeears/
ECOSTRESS
Evaporative stress index (ESI)
70m resolution ECOSTRESS data from LPDAAC AppEARS https://lpdaacsvc.cr.usgs.gov/appeears/
ECOSTRESS
Land surface temperature (LST)
70m resolution ECOSTRESS data from LPDAAC AppEARS https://lpdaacsvc.cr.usgs.gov/appeears/
ECOSTRESS
Water use efficiency (WUE)
70m resolution ECOSTRESS data from LPDAAC AppEARS https://lpdaacsvc.cr.usgs.gov/appeears/
Vegetation Index
Normalized Difference Vegetation Index (NDVI)
SENTINEL-2 Data (10 m resolution, band 4 and 8 is used) https://scihub.copernicus.eu/dhus/#/home
Climate
Rainfall
Bureau of Meteorology, Australia http://www.bom.gov.au/climate/data/
Topography
Elevation
9 arc-second DEM (~250 m resolution) from Geoscience Australia (Hutchinson et al., 2008)
Topography
Slope
Derived from DEM
Topography
Aspect
Derived from DEM
Two categories of models were developed in this study: general models and monthly models. The general models were specifically constructed to estimate wildfire susceptibility and quantify the significance of input biophysical factors over the entire wildfire period, spanning from September 2019 to March 2020. These models utilized the mean values of explanatory variables throughout this period as independent input variables, with the samples collected from MODIS ground fire points during 2019-2020 serving as the dependent variable. The study integrated a range of explanatory variables, including ECOSTRESS data, vegetation indices, climatic parameters, and topographical factors, to quantitatively assess their respective impacts on the prediction of wildfire.
The monthly models were designed to capture pre-fire vegetation conditions and predict wildfire spread one week ahead. We set up a three-week time lag for data collection prior to a wildfire event in the 4th week and predict the probability of wildfire occurrence in the following week (5th week). The mean values of the selected data in three weeks were computed to minimize or eliminate gaps. The model, for example, to predict wildfire occurrence probability in the first week of September (September 1-7), was built using the mean values of explanatory variables during a three-week time from August 1 to August 21. Such a design is to create an effective model to predict wildfire spread and assess the impact of pre-fire plant stress on following wildfire occurrence. The Australian bushfires started to spread in the first week of September 2019 and faded in early April 2020. The fires ceased at the end of October 2019 in south-eastern Australia and reignited in late November 2019. To understand the impact of change in the climate condition of the country after the first fire and to effectively assess the fire influential factor, we built three monthly models to predict (1) the first week of September (the week when the first wildfire started), (2) the last week of November, and (3) the first week of December (the weeks when the second fire started).
Machine Learning is based on algorithms that have the capacity to learn from data and make effective predictions. This learning process involves modeling the hidden relationships between a set of input variables (explanatory variables) and the occurrences of the phenomenon (the dependent variable) (Tonini et al., 2020). we acquired 2037 wildfire occurrence points. Of these, 70% (1426 wildfire occurrence points) were allocated for training, while the remaining 30% (611 wildfire occurrence points) were reserved for validation. Here, we evaluated LR, GWR, and RF algorithms to create models that fit relationships between wildfire events and the explanatory variables. The fit relationships from these models were then used in the susceptibility mapping and assessment of variable influence. Linear Regression (LR), in particular, demands the independence of explanatory variables. To mitigate the impact of the correlation between these variables, we employed a regularization technique using LASSO (L1 regularization). LASSO penalizes the coefficients of correlated variables, prompting the LR model to favor a subset of independent variables and enhance model robustness (Qian et al., 2012). Prior to the application of LR and GWR, we normalized the explanatory variables to a common scale (between 0 and 1) based on their observed maximum and minimum values (Zhu et al., 2022). This normalization ensures equal contributions from all variables. Such scaling facilitates straightforward comparison and interpretation of variable importance.
In addition, to evaluate the accuracy of wildfire susceptibility modeling, pixels were categorized as either fire or non-fire based on a probability threshold value of 0.5. Pixels greater than 0.5 were identified as fire pixels, while those below the threshold value were not considered in the process. A confusion matrix is utilized to evaluate the performance of a classification model that predicts two or more classes. This matrix evaluates the accuracy, sensitivity, and specificity of the model’s outcomes (Parikh et al., 2008).
This dataset covers south-eastern region of Australia during 2019-2020. The dataset includes input explanatory variables of general and monthly models, wildfire susceptibility for each city and fire locations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced.Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores.Results: Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers.Discussion: Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset contains data collected during fatigue detection experiments in running using IMUs.Subjects underwent a fatiguing protocol consisting of three distinct consecutive runs on an athletic track:1. The first run consisted of a 4000 m run (10 laps) at a constant speed, determined as 100% of the average speed of the subject during the best performance in the previous year on a 5 to 10 km race;2. The second run was performed according to a fatiguing protocol. The speed in this fatiguing protocol started at the same level of the first run and increased progres-sively of by 0.2km/h every 100 m. Perceived fatigue was assessed by means of a Borg Rating of Perceived Extertion (RPE) Scale (min-max score 6-20) [20], asked to the runner every 100 m. The fatiguing protocol was terminated once the RPE was higher than 16 (RPE between hard and very hard) , or, if such requirement was not met, after 1200m;3. The third run consisted of a 1200m run (3 laps), in which speed was kept constant and equal to the first 4000 m run.
pXXX_XXX_0-2K: contains the Segment and Joint data exported from MVN for the first half of the first runpXXX_XXX_2-4K: contains the Segment and Joint data exported from MVN for the second half of the first runpXXX_XXX_postfatigue1200m: : contains the Segment and Joint data exported from MVN for the third run
pXXX_strides: contains the segmented strides from each subject
TableFeats: contains values used for the machine learning pipeline, after normalization over each single subject
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Breast cancer (BC) is the most common malignancy worldwide and neoadjuvant therapy (NAT) plays an important role in the treatment of patients with early BC. However, only a subset of BC patients can achieve pathological complete response (pCR) and benefit from NAT. It is therefore necessary to predict the responses to NAT. Although many models to predict the response to NAT based on gene expression determined by the microarray platform have been proposed, their applications in clinical practice are limited due to the data normalization methods during model building and the disadvantages of the microarray platform compared with the RNA-seq platform. In this study, we first reconfirmed the correlation between immune profiles and pCR in an RNA-seq dataset. Then, we employed multiple machine learning algorithms and a model stacking strategy to build an immunological gene based model (Ipredictor model) and an immunological gene and receptor status based model ICpredictor model) in the RNA-seq dataset. The areas under the receiver operator characteristic curves for the Ipredictor model and ICpredictor models were 0.745 and 0.769 in an independent external test set based on the RNA-seq platform, and were 0.716 and 0.752 in another independent external test set based on the microarray platform. Furthermore, we found that the predictive score of the Ipredictor model was correlated with immune microenvironment and genomic aberration markers. These results demonstrated that the models can accurately predict the response to NAT for BC patients and will contribute to individualized therapy.
This dataset contains data collected from female participants assessing perceived security while cycling in a virtual reality (VR) laboratory at the University of Tehran, Iran, as described in the study "Modeling Women Cyclists' Perceived Security: A Comparison of Machine Learning Techniques." Participants experienced multiple VR scenarios simulating various urban environments and traffic conditions, with features including crowdedness (1-4), incivility (1-3), lighting (1-3), obstruction of visibility (1-4), and surveillance (0-1). After each scenario, participants rated their perceived security (1=low, 2=medium, 3=high) via a questionnaire. The final dataset includes 208 samples after outlier removal, with five independent variables and the target variable "perceived security." Data was preprocessed with normalization and standardization for machine learning analysis. See the accompanying README.txt and the paper for variable definitions and file structure. This dataset supports the development of the DNN-stack2 model (84.07% accuracy) and is intended for research on cyclist security, urban planning, and machine learning applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brain tumors pose significant global health concerns due to their high mortality rates and limited treatment options. These tumors, arising from abnormal cell growth within the brain, exhibits various sizes and shapes, making their manual detection from magnetic resonance imaging (MRI) scans a subjective and challenging task for healthcare professionals, hence necessitating automated solutions. This study investigates the potential of deep learning, specifically the DenseNet architecture, to automate brain tumor classification, aiming to enhance accuracy and generalizability for clinical applications. We utilized the Figshare brain tumor dataset, comprising 3,064 T1-weighted contrast-enhanced MRI images from 233 patients with three prevalent tumor types: meningioma, glioma, and pituitary tumor. Four pre-trained deep learning models—ResNet, EfficientNet, MobileNet, and DenseNet—were evaluated using transfer learning from ImageNet. DenseNet achieved the highest test set accuracy of 96%, outperforming ResNet (91%), EfficientNet (91%), and MobileNet (93%). Therefore, we focused on improving the performance of the DenseNet, while considering it as base model. To enhance the generalizability of the base DenseNet model, we implemented a fine-tuning approach with regularization techniques, including data augmentation, dropout, batch normalization, and global average pooling, coupled with hyperparameter optimization. This enhanced DenseNet model achieved an accuracy of 97.1%. Our findings demonstrate the effectiveness of DenseNet with transfer learning and fine-tuning for brain tumor classification, highlighting its potential to improve diagnostic accuracy and reliability in clinical settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accuracies, precisions, recalls, F1 scores, and AUCs of the model trained from scratch, fine-tuned with PANNs, fine-tuned with pre-training via contrastive learning, and evaluated by two professional nephrology nurses.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To achieve high quality omics results, systematic variability in mass spectrometry (MS) data must be adequately addressed. Effective data normalization is essential for minimizing this variability. The abundance of approaches and the data-dependent nature of normalization have led some researchers to develop open-source academic software for choosing the best approach. While these tools are certainly beneficial to the community, none of them meet all of the needs of all users, particularly users who want to test new strategies that are not available in these products. Herein, we present a simple and straightforward workflow that facilitates the identification of optimal normalization strategies using straightforward evaluation metrics, employing both supervised and unsupervised machine learning. The workflow offers a “DIY” aspect, where the performance of any normalization strategy can be evaluated for any type of MS data. As a demonstration of its utility, we apply this workflow on two distinct datasets, an ESI-MS dataset of extracted lipids from latent fingerprints and a cancer spheroid dataset of metabolites ionized by MALDI-MSI, for which we identified the best-performing normalization strategies.