Facebook
TwitterAdditional file 2: Supplemental Figure 1. Flowcharts of the analytic samples for FOCUS 1.0 and FOCUS 2.0 survey waves. Supplemental Figure 2A. Box and whisker plots comparing FOCUS safety climate scores by size variables for FOCUSv.1.0 departments. Supplemental Figure 2B. Box and whisker plots comparing FOCUS safety climate scores by size variables for FOCUSv.2.0 departments.
Facebook
Twitterhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QEhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QE
WIDEa is R-based software aiming to provide users with a range of functionalities to explore, manage, clean and analyse "big" environmental and (in/ex situ) experimental data. These functionalities are the following, 1. Loading/reading different data types: basic (called normal), temporal, infrared spectra of mid/near region (called IR) with frequency (wavenumber) used as unit (in cm-1); 2. Interactive data visualization from a multitude of graph representations: 2D/3D scatter-plot, box-plot, hist-plot, bar-plot, correlation matrix; 3. Manipulation of variables: concatenation of qualitative variables, transformation of quantitative variables by generic functions in R; 4. Application of mathematical/statistical methods; 5. Creation/management of data (named flag data) considered as atypical; 6. Study of normal distribution model results for different strategies: calibration (checking assumptions on residuals), validation (comparison between measured and fitted values). The model form can be more or less complex: mixed effects, main/interaction effects, weighted residuals.
Facebook
TwitterThis data release contains time series and plots summarizing mean monthly temperature (TAVE) and total monthly precipitation (PPT), and runoff (RO) from the U.S. Geological Survey Monthly Water Balance Model at 115 National Wildlife Refuges within the U.S. Fish and Wildlife Service Mountain-Prairie Region (CO, KS, MT, NE, ND, SD, UT, and WY). These three variables are derived from two sets of statistically-downscaled general circulation models from 1951 through 2099. Three variables (TAVE, PPT, and RO for refuge areas) were summarized for comparison across four 19-year periods: historic (1951-1969), baseline (1981-1999), 2050 (2041-2059), and 2080 (2071-2089). For each refuge, mean monthly plots, seasonal box plots, and annual envelope plots were produced for each of the four periods.
Facebook
TwitterBank has multiple banking products that it sells to customer such as saving account, credit cards, investments etc. It wants to which customer will purchase its credit cards. For the same it has various kind of information regarding the demographic details of the customer, their banking behavior etc. Once it can predict the chances that customer will purchase a product, it wants to use the same to make pre-payment to the authors.
In this part I will demonstrate how to build a model, to predict which clients will subscribing to a term deposit, with inception of machine learning. In the first part we will deal with the description and visualization of the analysed data, and in the second we will go to data classification models.
-Desire target -Data Understanding -Preprocessing Data -Machine learning Model -Prediction -Comparing Results
Predict if a client will subscribe (yes/no) to a term deposit — this is defined as a classification problem.
The dataset (Assignment-2_data.csv) used in this assignment contains bank customers’ data. File name: Assignment-2_Data File format: . csv Numbers of Row: 45212 Numbers of Attributes: 17 non- empty conditional attributes attributes and one decision attribute.
https://user-images.githubusercontent.com/91852182/143783430-eafd25b0-6d40-40b8-ac5b-1c4f67ca9e02.png">
https://user-images.githubusercontent.com/91852182/143783451-3e49b817-29a6-4108-b597-ce35897dda4a.png">
Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.
In this assignment, we are going to utilize python to develop a predictive machine learning model. First, we will import some important and necessary libraries.
Below we are can see that there are various numerical and categorical columns. The most important column here is y, which is the output variable (desired target): this will tell us if the client subscribed to a term deposit(binary: ‘yes’,’no’).
https://user-images.githubusercontent.com/91852182/143783456-78c22016-149b-4218-a4a5-765ca348f069.png">
We must to check missing values in our dataset if we do have any and do, we have any duplicated values or not.
https://user-images.githubusercontent.com/91852182/143783471-a8656640-ec57-4f38-8905-35ef6f3e7f30.png">
We can see that in 'age' 9 missing values and 'balance' as well 3 values missed. In this case based that our dataset it has around 45k row I will remove them from dataset. on Pic 1 and 2 you will see before and after.
https://user-images.githubusercontent.com/91852182/143783474-b3898011-98e3-43c8-bd06-2cfcde714694.png">
From the above analysis we can see that only 5289 people out of 45200 have subscribed which is roughly 12%. We can see that our dataset highly unbalanced. we need to take it as a note.
https://user-images.githubusercontent.com/91852182/143783534-a05020a8-611d-4da1-98cf-4fec811cb5d8.png">
Our list of categorical variables.
https://user-images.githubusercontent.com/91852182/143783542-d40006cd-4086-4707-a683-f654a8cb2205.png">
Our list of numerical variables.
https://user-images.githubusercontent.com/91852182/143783551-6b220f99-2c4d-47d0-90ab-18ede42a4ae5.png">
In above boxplot we can see that some point in very young age and as well impossible age. So,
https://user-images.githubusercontent.com/91852182/143783564-ad0e2a27-5df5-4e04-b5d7-6d218cabd405.png">
https://user-images.githubusercontent.com/91852182/143783589-5abf0a0b-8bab-4192-98c8-d2e04f32a5c5.png">
Now, we don’t have issues on this feature so we can use it
https://user-images.githubusercontent.com/91852182/143783599-5205eddb-a0f5-446d-9f45-cc1adbfcce67.png">
https://user-images.githubusercontent.com/91852182/143783601-e520d59c-3b21-4627-a9bb-cac06f415a1e.png">
https://user-images.githubusercontent.com/91852182/143783634-03e5a584-a6fb-4bcb-8dc5-1f3cc50f9507.png">
https://user-images.githubusercontent.com/91852182/143783640-f6e71323-abbe-49c1-9935-35ffb2d10569.png">
This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.
The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.
An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.
A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.
The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.
Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Figure 1A: Box and Whisker Plots of log Aldosterone to Renin Ratio, additionally adjusted for body mass index Supplementary Figure 1B. Box and Whisker Plots of log Renin, additionally adjusted for body mass index Supplementary Figure 1C. Box and Whisker Plots of log Aldosterone, additionally adjusted for body mass index Supplementary Figure 2. Box and Whisker Plots of log ACE activity, additionally adjusted for body mass index
Facebook
TwitterFigure S1. CRISPR-KO screening data quality assessment. (A) Average correlation between sgRNAs read-count replicates across cell lines. (B) Receiver operating characteristic (ROC) curve obtained from classifying fitness essential (FE) and non-essential genes based on the average logFC of their targeting sgRNAs. An example cell line OVCAR-8 is shown. (C) Area under the ROC (AUROC) curve obtained for cell lines from classifying FE and non-essential genes based on the average logFC of their targeting sgRNAs. (D) Recall for sets of a priori known essential genes from MSigDB and from literature when classifying FE and non-essential genes across cell lines (5% FDR). Each circle represents a cell line and coloured by tissue type. Box and whisker plots show median, inter-quartile ranges and 95% confidence intervals. (E) Genes ranked based on the average logFC of targeting sgRNAs for OVCAR-8 and enrichment of genes belonging to predefined sets of a priori known essential genes from MSigDB, at an FDR equal to 5% when classifying FE (second last column) and non-essential genes (last column). Blue numbers at the bottom indicate the classification true positive rate (recall). Figure S2. Assessment of copy number bias before and after CRISPRcleanR correction across cell lines. sgRNA logFC values before and after CRISPRcleanR for eight cell lines are shown classified based on copy number (amplified or deleted) and expression status. Copy number segments were identified using Genomics of Drug Sensitivity in Cancer (GDSC) and Cell Line Encyclopedia (CCLE) datasets. Box and whisker plots show median, inter-quartile ranges and 95% confidence intervals. Asterisks indicate significant associations between sgRNA LogFC values (Welchs t-test, p < 0,005) and their different effect sizes accounting for the standard deviation (Cohen’s D value), compared to the whole sgRNA library. Figure S3. CN-associated effect on sgRNA logFC values in highly biased cell lines. For 3 cell lines, recall curves of non-essential genes, fitness essential genes, copy number (CN) amplified and CN amplified non-expressed genes obtained when classifying genes based on the average logFC values of their targeting sgRNAs. Figure S4. Assessment of CN-associated bias across all cell lines. LogFC values of sgRNAs averaged within segments of equal copy number (CN). One plot per cell line, with CN values at which a significant differences (Welchs t-test, p < 0.05) with respect to the logFCs corresponding to CN = 2 are initially observed (bias starting point) and start to significantly increase continuously (bias critical point). CN-associated bias is shown for all sgRNA, when excluding FE genes and histones, and for non-expressed genes only. Box and whisker plots show median, inter-quartile ranges and 95% confidence intervals. Figure S5. CRISPRcleanR correction varying the minimal number of genes required and the effect of fitness essential genes. Recall reduction of (A) amplified or (B) amplified not-expressed genes versus that of fitness essential and other prior known essential genes, when comparing CRISPRcleanR correction varying the minimal number of genes to be targeted by sgRNA in a biased segment (default parameter is n = 3). Similar results were observed when performing the analysis including or excluding known essential genes. Figure S6. CRISPRcleanR performances across 342 cell lines from an independent dataset. Recall at 5% FDR of predefined sets of genes based on their uncorrected or corrected logFCs (coordinates on the two axis) averaged across targeting sgRNAs for 342 cell lines from the Project Achilles. Figure S7. CRISPRcleanR performances in relation to data quality. The impact of data quality on recall at 5% false discovery rate (FDR) assessed following CRISPRcleanR correction for predefined set of genes. Project Achilles data (n = 342 cell lines) was binned based on the quality of uncorrected essentiality profile. This is obtained by measuring the recall at 5% FDR for predefined essential genes (from the Molecular Signature Database) and grouping the cell lines in 10 equidistant bins (1 lowest quality and 10 highest quality) when sorting them based on this value. Recall increment for fitness essential genes was greatest for the lower quality data, indicating that CRISPRcleanR can improve true signal of gene depletion in low quality data. Figure S8. Minimal impact of CRISPRcleanR on loss/gain-of-fitness effects. (A) The percentage of genes where the significance of their fitness effect (gain- or loss-of-fitness) is altered after CRISPRcleanR for Project Score and Project Achilles data. The upper row shows correction effects for all screened genes and the lower row for the subset of genes with a significant effect in the uncorrected data. Each dot is a separate cell line. Blue dots indicate the percentage of genes where significance is lost or gained post correction. Green dots indicate the percentage of genes where the fitness effect is distorted and the effect is opposite in the uncorrected data. (B) The majority of the loss-of-fitness genes impacted by correction are putative false positive effects affecting genes which are either not-expressed (FPKM < 0.5), amplified, known non-essential, or exhibit a mild phenotype in the screening data. (C) Summary of overall impact of CRISPRcleanR on fitness effects following correction when considering data for all cell lines. The colors reflect the percentage of genes with a loss-of-fitness, no phenotype or gain-of-fitness effect which are retained in the corrected data. Figure S9. CRISPRcleanR retains cancer driver gene dependencies in Project Score and Achilles data. (A) Each circle represents a tested cancer driver gene dependency (mutation or amplification of a copy number segment) and the statistical significance using MaGeCK before (x-axis) and after (y-axis) CRISPRcleanR correction, across the two screens. Plots in the first row show depletion FDR values pre/post-correction, whereas those in the second row show depletion FDR values pre-correction and enrichment FDR values post-correction. (B) Details of the tested genetic dependencies and whether they are shared before and after CRISPRcleanR correction at two different thresholds of statistical significance (5 and 10% FDR, respectively for 1st and 2nd row of plots). The third row indicates the type of alteration involving the cancer driver genes under consideration and the total number of cell lines with an alteration. (ZIP 191 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectivesThis study aimed to explore the correlation between the sentiment of nursing notes and the one-year mortality of sepsis patients.MethodsThe box plot was used to compare the differences in sentiment polarity/sentiment subjectivity between different groups. Multivariate logistic regression was used to explore the correlation between sentiment polarity/sentiment subjectivity and one-year mortality of elderly sepsis patients. Ridge regression, XGBoost regression, and random forest were used to explore the importance of sentiment polarity and subjectivity in the one-year mortality of elderly sepsis patients. Restricted cubic spline (RCS) was used to explore whether there was a linear relationship between sentiment polarity, sentiment subjectivity and the one-year mortality of elderly sepsis patients. Kaplan-Meier (KM) curve was used to explore the relationship between the sentiment polarity (or sentiment subjectivity) and the 1-year death of the patient.ResultsCompared with the control group, the one-year mortality group year had lower sentiment polarity and higher sentiment subjectivity. Sentiment polarity and sentiment subjectivity were independently related to the one-year mortality of elderly sepsis patients. There was a linear relationship between sentiment polarity and the one-year mortality of elderly sepsis patients. At the same time, there was a nonlinear relationship between sentiment subjectivity and the one-year mortality of elderly sepsis patients.KM.ConclusionsThe sentiment of nursing notes was correlated with the one-year mortality of elderly sepsis patients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of result on optimal design of industrial refrigeration system.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The comparison results of different algorithms on CEC2017 functions with D=30.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of result on welded beam design problem.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of result on three-bar truss design problem.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of result on speed reducer design problem.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The comparison results of different algorithms on 23 benchmark functions with D=30.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of result on pressure vessel design problem.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The comparison results of different algorithms on CEC2019 functions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The supplementary file includes the computational details of ClusterPSO and supplementary Figures and Tables. Length distribution of the results of CpGIS, CpGCluster, CPSORL, and ClusterPSO in the human genome (Figure A). Distribution of the results of CpG islands in the human genome (Figure B). XY charts comparing the true positive and false positive rates amongst the six methods for six contig sequences (Figure C). Box plot comparing the stability of five methods in six contig sequences (Figure D). Box plot of the O/E ratio for each interval length in the human genome (Figure E). Number of CpG islands located in gene regions identified with CPSORL and ClusterPSO (Table A). Performance measurement of ClusterPSO and CPSORL for all chromosomes in the human genome (Table B). Number of detection CpG islands overlapping on true CpG islands for CpGcluster, CPSORL and ClusterPSO for all chromosomes in the human genome (Table C). (DOC)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundArtificial intelligence (AI) methods have established themselves in cardiovascular magnetic resonance (CMR) as automated quantification tools for ventricular volumes, function, and myocardial tissue characterization. Quality assurance approaches focus on measuring and controlling AI-expert differences but there is a need for tools that better communicate reliability and agreement. This study introduces the Verity plot, a novel statistical visualization that communicates the reliability of quantitative parameters (QP) with clear agreement criteria and descriptive statistics.MethodsTolerance ranges for the acceptability of the bias and variance of AI-expert differences were derived from intra- and interreader evaluations. AI-expert agreement was defined by bias confidence and variance tolerance intervals being within bias and variance tolerance ranges. A reliability plot was designed to communicate this statistical test for agreement. Verity plots merge reliability plots with density and a scatter plot to illustrate AI-expert differences. Their utility was compared against Correlation, Box and Bland-Altman plots.ResultsBias and variance tolerance ranges were established for volume, function, and myocardial tissue characterization QPs. Verity plots provided insights into statstistcal properties, outlier detection, and parametric test assumptions, outperforming Correlation, Box and Bland-Altman plots. Additionally, they offered a framework for determining the acceptability of AI-expert bias and variance.ConclusionVerity plots offer markers for bias, variance, trends and outliers, in addition to deciding AI quantification acceptability. The plots were successfully applied to various AI methods in CMR and decisively communicated AI-expert agreement.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAdditional file 2: Supplemental Figure 1. Flowcharts of the analytic samples for FOCUS 1.0 and FOCUS 2.0 survey waves. Supplemental Figure 2A. Box and whisker plots comparing FOCUS safety climate scores by size variables for FOCUSv.1.0 departments. Supplemental Figure 2B. Box and whisker plots comparing FOCUS safety climate scores by size variables for FOCUSv.2.0 departments.