Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this table we present a more detailed view of performance for our method's poorest predicted network ( total regulatory interactions with up to regulators controlling each gene). The table inline method precision [%] at varying degrees of completeness (recall [%]).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple Sclerosis (MS) progresses at an unpredictable rate, but predictions on the disease course in each patient would be extremely useful to tailor therapy to the individual needs. We explore different machine learning (ML) approaches to predict whether a patient will shift from the initial Relapsing-Remitting (RR) to the Secondary Progressive (SP) form of the disease, using only “real world” data available in clinical routine. The clinical records of 1624 outpatients (207 in the SP phase) attending the MS service of Sant’Andrea hospital, Rome, Italy, were used. Predictions at 180, 360 or 720 days from the last visit were obtained considering either the data of the last available visit (Visit-Oriented setting), comparing four classical ML methods (Random Forest, Support Vector Machine, K-Nearest Neighbours and AdaBoost) or the whole clinical history of each patient (History-Oriented setting), using a Recurrent Neural Network model, specifically designed for historical data. Missing values were handled by removing either all clinical records presenting at least one missing parameter (Feature-saving approach) or the 3 clinical parameters which contained missing values (Record-saving approach). The performances of the classifiers were rated using common indicators, such as Recall (or Sensitivity) and Precision (or Positive predictive value). In the visit-oriented setting, the Record-saving approach yielded Recall values from 70% to 100%, but low Precision (5% to 10%), which however increased to 50% when considering only predictions for which the model returned a probability above a given “confidence threshold”. For the History-oriented setting, both indicators increased as prediction time lengthened, reaching values of 67% (Recall) and 42% (Precision) at 720 days. We show how “real world” data can be effectively used to forecast the evolution of MS, leading to high Recall values and propose innovative approaches to improve Precision towards clinically useful values.
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data242https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data242
JavaScript code to be implemented in Google Earth Engine(c) for Hybrid MSRM-Based Deep Learning and Multitemporal Sentinel 2-Based Machine Learning Algorithm. Algorithm for large-scale automatic detection of burial mounds, one of the most common types of archaeological sites globally, using LiDAR and multispectral satellite data. Although previous attempts were able to detect a good proportion of the known mounds in a given area, they still presented high numbers of false positives and low precision values. Our proposed approach combines random forest for soil classification using multitemporal multispectral Sentinel-2 data and a deep learning model using YOLOv3 on LiDAR data previously pre-processed using a multi–scale relief model. The resulting algorithm significantly improves previous attempts with a detection rate of 89.5%, an average precision of 66.75%, a recall value of 0.64 and a precision of 0.97, which allowed, with a small set of training data, the detection of 10,527 burial mounds over an area of near 30,000 km2, the largest in which such an approach has ever been applied. The open code and platforms employed to develop the algorithm allow this method to be applied anywhere LiDAR data or high-resolution digital terrain models are available.
PHM2017 is a new dataset consisting of 7,192 English tweets across six diseases and conditions: Alzheimer’s Disease, heart attack (any severity), Parkinson’s disease, cancer (any type), Depression (any severity), and Stroke. The Twitter search API was used to retrieve the data using the colloquial disease names as search keywords, with the expectation of retrieving a high-recall, low precision dataset. After removing the re-tweets and replies, the tweets were manually annotated. The labels are:
self-mention. The tweet contains a health mention with a health self-report of the Twitter account owner, e.g., "However, I worked hard and ran for Tokyo Mayer Election Campaign in January through February, 2014, without publicizing the cancer." other-mention. The tweet contains a health mention of a health report about someone other than the account owner, e.g., "Designer with Parkinson’s couldn’t work then engineer invents bracelet + changes her world" awareness. The tweet contains the disease name, but does not mention a specific person, e.g., "A Month Before a Heart Attack, Your Body Will Warn You With These 8 Signals" non-health. The tweet contains the disease name, but the tweet topic is not about health. "Now I can have cancer on my wall for all to see <3"
Satellite imagery has several applications, including land use and land cover classification, change detection, object detection, etc. Satellite based remote sensing sensors often encounter cloud coverage due to which clear imagery of earth is not collected. The clouded regions should be excluded, or cloud removal algorithms must be applied, before the imagery can be used for analysis. Most of these preprocessing steps require a cloud mask. In case of single-scene imagery, though tedious, it is relatively easy to manually create a cloud mask. However, for a larger number of images, an automated approach for identifying clouds is necessary. This model can be used to automatically generate a cloud mask from Sentinel-2 imagery.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Fine-tuning the modelThis model can be fine-tuned using the Train Deep Learning Model tool. Follow the guide to fine-tune this model.InputSentinel-2 L2A imagery in the form of a raster, mosaic dataset or image service.OutputClassified raster containing three classes: Low density, Medium density and High density.Applicable geographiesThis model is expected to work well in Europe and the United States. This model works well for land based areas. Large water bodies such as ocean, seas and lakes should be avoided.Model architectureThis model uses the UNet model architecture implemented in ArcGIS API for Python.Accuracy metricsThis model has an overall accuracy of 94 percent with L2A imagery. The table below summarizes the precision, recall and F1-score of the model on the validation dataset. The comparatively low precision, recall and F1 score for Low density clouds might cause false detection of such clouds in certain urban areas. Also, for certain seasonal clouds some extremely bright pixels might be missed out.ClassPrecisionRecallF1 scoreHigh density0.9600.9750.968Medium density0.9050.8970.901Low density0.7740.5710.657Sample resultsHere are a few results from the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assignment of nine new bounding boxes by K-means.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective This project focuses on developing an object detection model using the YOLOv11 architecture. The primary goal is to accurately detect and classify objects within images across three distinct classes. The model was trained for 250 epochs to achieve high performance in terms of mean Average Precision (mAP), Precision, and Recall.
Dataset Information - Number of Images: 300 - Number of Annotations: 582 - Classes: 3 - Average Image Size: 0.30 megapixels - Image Size Range: 0.03 megapixels to 11.83 megapixels - Median Image Ratio: 648x500 pixels
Preprocessing - Auto-Orient: Applied to ensure correct image orientation. - Resize: Images were stretched to a uniform size of 640x640 pixels to maintain consistency across the dataset. Augmentations - Outputs per Training Example: 3 augmented outputs were generated for each training example to enhance the diversity of the training data. - Crop: Random cropping was applied with a minimum zoom of 0% and a maximum zoom of 8%. - Rotation: Images were randomly rotated between -8° and +8° to improve the model's robustness to different orientations.
Training and Performance The model was trained for 250 epochs, and the following performance metrics were achieved: - mAP (mean Average Precision): 90.4% - Precision: 87.7% - Recall: 83.4%
These metrics indicate that the model is highly effective in detecting and classifying objects within the images, with a strong balance between precision and recall.
** Key Insights** - mAP: The high mAP score of 90.4% suggests that the model is accurate in predicting the correct bounding boxes and class labels for objects in the dataset. - Precision: A precision of 87.7% indicates that the model has a low false positive rate, meaning it is reliable in identifying true objects. - Recall: The recall of 83.4% shows that the model is capable of detecting most of the relevant objects in the images. Visualization The training process was monitored using various metrics, including mAP, Box Loss, Class Loss, and Object Loss. The visualizations show the progression of these metrics over the 250 epochs, demonstrating the model's learning and improvement over time.
Conclusion The project successfully implemented and trained an object detection model using the YOLOv11 architecture. The achieved performance metrics highlight the model's effectiveness and reliability in detecting objects across different classes. This model can be further refined and applied to real-world applications for object detection tasks.
The data for this competition is from the RAICOM Mission Application Competition and Mo in China, originating from https://www.kaggle.com/datasets/uciml/mushroom-classification/
The copyright of datasets belongs to the organizers of "RAICOM Mission Application Competition"
The result of Official Baseline is:
Accuracy: 0.7464409388226241
Precision: 0.7591353576942872
Recall: 0.6344086021505376
F1: 0.6911902530459232
Confusion matrix:
[[2405 468]
[ 850 1475]]
Mushrooms are a beloved delicacy among people, but beneath their glamorous appearance, they may harbor deadly dangers. China is one of the countries with the largest variety of mushrooms in the world. At the same time, mushroom poisoning is one of the most serious food safety issues in China. According to relevant reports, in 2021, China conducted research on 327 mushroom poisoning incidents, involving 923 patients and 20 deaths, with a total mortality rate of 2.17%. For non professionals, it is impossible to distinguish between poisonous mushrooms and edible mushrooms based on their appearance, shape, color, etc. There is no simple standard that can distinguish between poisonous mushrooms and edible mushrooms. To determine whether mushrooms are edible, it is necessary to collect mushrooms with different characteristic attributes and analyze whether they are toxic. In this competition, 22 characteristic attributes of mushrooms were analyzed to obtain a mushroom usability model, which can better predict whether mushrooms are edible.
In the context of this mushroom usability model competition, several performance metrics can be utilized to evaluate the predictive accuracy of the model. Among them, the F1 score stands out due to its ability to provide a balance between precision and recall, which are crucial for this classification problem where distinguishing between poisonous and edible mushrooms can have severe real-world implications.
F1 Score The F1 score is the harmonic mean of precision and recall, and it is particularly useful in binary classification scenarios with imbalanced class distribution:
Precision (also known as positive predictive value) indicates the proportion of true positive observations among all observations classified as positive. It measures the accuracy of the positive predictions. \( \text{Precision} = \frac{TP}{TP + FP} \)
Recall (also known as sensitivity or true positive rate) measures the proportion of true positive observations out of all actual positives. It assesses the ability to capture all the true positive instances. \( \text{Recall} = \frac{TP}{TP + FN} \)
The F1 score is calculated as follows:
\[ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
Why F1 Score? Balance Between Precision and Recall: In the context where mushroom classification error can have critical health impacts, favoring either precision or recall solely might be dangerous. F1 score provides a more comprehensive evaluation by balancing these errors.
Handling Imbalanced Classes: Mushroom datasets often have an imbalance between the number of edible and poisonous instances. The F1 score is less influenced by the skewed class distributions compared to accuracy.
Critical Application: Misclassifying a poisonous mushroom as edible can lead to severe health risks. Hence, ensuring both high precision (minimizing false positives) and high recall (capturing all true positives) is crucial. The F1 score encapsulates the tradeoff between these aspects well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Frames per second test comparison.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.
This dataset includes the scripts to reproduce the models presented in the paper. The cleaned data used for the analyses is also available. Abstract of the article: Precision farming technology, including GPS collars with biologging, has revolutionized remote livestock monitoring in extensive grazing systems. High resolution accelerometry can be used to infer the behavior of an animal. Previous behavioral classification studies using accelerometer data have focused on a few key behaviors and were mostly conducted in controlled situations. Here, we conducted behavioral observations of 38 beef cows (Hereford, Limousine, Charolais, Simmental/NRF/Hereford mix) free-ranging in rugged, forested areas, and fitted with a commercially available virtual fence collar (Nofence) containing a 10Hz tri-axial accelerometer. We used random forest models to calibrate data from the accelerometers on both commonly documented (e.g., feeding, resting, walking) and rarer (e.g., suckling calf, head butting, allogrooming) behaviors. Our goal was to assess pre-processing decisions including different running mean intervals (smoothing window of 1, 5, or 20 seconds), collar orientation and feature selection (orientation-dependent versus orientation-independent features). We identified the 10 most common behaviors exhibited by the cows. Models based only on orientation-independent features did not perform better than models based on orientation-dependent features, despite variation in how collars were attached (direction and tightness). Using a 20 seconds running mean and orientation-dependent features resulted in the highest model performance (model accuracy: 0.998, precision: 0.991, and recall: 0.989). We also used this model to add 11 rarer behaviors (each < 0.1% of the data; e.g. head butting, throwing head, self-grooming). These rarer behaviors were predicted with less accuracy because they were not observed at all for some individuals, but overall model performance remained high (accuracy, precision, recall >98%). Our study suggests that the accelerometers in the Nofence collars are suitable to identify the most common behaviors of free-ranging cattle. The results of this study could be used in future research for understanding cattle habitat selection in rugged forest ranges, herd dynamics, or responses to stressors such as carnivores, as well as to improve cattle management and welfare.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Bacterial small RNAs (sRNAs) are pivotal in post-transcriptional regulation, affecting functions like virulence, metabolism, and gene expression by binding specific mRNA targets. Identifying these targets is crucial to understanding sRNA regulation across species. Despite advancements in high-throughput (HT) experimental methods, they remain technically challenging and are limited to detecting sRNA-target interactions under specific environmental conditions. Therefore, computational approaches, especially machine learning (ML), are essential for identifying strong candidates for biological validation.
In this study, we hypothesize that ML models trained on large-scale interaction data from specific conditions can accurately predict new interactions in unseen conditions within the same bacterial strain. To test this, we developed models from two families: (1) graph neural networks (GNNs), including GraphRNA and kGraphRNA, that learn transformed representations of interacting sRNA-mRNA pairs via graph relationships, and (2) decision forests, sInterRF (Random Forest) and sInterXGB (XGBoost), which use various interaction features for prediction. We also proposed Summation Ensemble Models (SEM) that combine scores from multiple models. Across three seen-to-unseen conditions evaluations, our models —particularly kGraphRNA— significantly improved the area under the ROC curve (AUC) and Precision-Recall curve (PR-AUC) compared to sRNARFTarget, CopraRNA, and RNAup. The SEM model combining GraphRNA and CopraRNA outperformed CopraRNA alone on a low-throughput (LT) interactions test set (HT-to-LT).
This data source provides the HT and LT interaction datasets used for our study. In addition, we provide the prediction scores of our models: kGraphRNA, GraphRNA, sInterRF, and sInterXGB for any pair of sRNA and mRNA of Escherichia coli K12 MG1655 (NC_000913). We also provide the true labels and the CopraRNA p-value scores computed for all possible pairs. Note that prediction scores are not provided for sRNA-mRNA pairs that were used to train the models, i.e., all the labeled interactions (HT and LT) and negative interactions sampled randomly (see our paper for more details).
For convenience, each CVS file contains the scores of a single sRNA with the following information: accession IDs, locus tags, and names of the sRNA the mRNA; CopraRNA p-value (if available); the prediction scores of kGraphRNA, GraphRNA, sInterRF, and sInterXGB models; true label (if available) – 1 for interaction and 0 for non-interaction; whether the sRNA-mRNA pair was sampled for the train set as a random negative sample – true or false.
Dataset Introduction TFH_Annotated_Dataset is an annotated patent dataset pertaining to thin film head technology in hard-disk. To the best of our knowledge, this is the second labeled patent dataset public available in technology management domain that annotates both entities and the semantic relations between entities, the first one is [1].
The well-crafted information schema used for patent annotation contains 17 types of entities and 15 types of semantic relations as shown below.
Table 1 The specification of entity types
Type | Comment | example |
---|---|---|
physical flow | substance that flows freely | The etchant solution has a suitable solvent additive such as glycerol or methyl cellulose |
information flow | information data | A camera using a film having a magnetic surface for recording magnetic data thereon |
energy flow | entity relevant to energy | Conductor is utilized for producing writing flux in magnetic yoke |
measurement | method of measuring something | The curing step takes place at the substrate temperature less than 200.degree |
value | numerical amount | The curing step takes place at the substrate temperature less than 200.degree |
location | place or position | The legs are thinner near the pole tip than in the back gap region |
state | particular condition at a specific time | The MR elements are biased to operate in a magnetically unsaturated mode |
effect | change caused an innovation | Magnetic disk system permits accurate alignment of magnetic head with spaced tracks |
function | manufacturing technique or activity | A magnetic head having highly efficient write and read functions is thereby obtained |
shape | the external form or outline of something | Recess is filled with non-magnetic material such as glass |
component | a part or element of a machine | A pole face of yoke is adjacent edge of element remote from surface |
attribution | a quality or feature of something | A pole face of yoke is adjacent edge of element remote from surface |
consequence | The result caused by something or activity | This prevents the slider substrate from electrostatic damage |
system | a set of things working together as a whole | A digital recording system utilizing a magnetoresistive transducer in a magnetic recording head |
material | the matter from which a thing is made | Interlayer may comprise material such as Ta |
scientific concept | terminology used in scientific theory | Peak intensity ratio represents an amount hydrophilic radical |
other | Not belongs to the above entity types | Pressure distribution across air bearing surface is substantially symmetrical side |
Table 2 The specification of relation types
TYPE | COMMENT | EXAMPLE |
---|---|---|
spatial relation | specify how one entity is located in relation to others | Gap spacer material is then deposited on the film knife-edge |
part-of | the ownership between two entities | a magnetic head has a magnetoresistive element |
causative relation | one entity operates as a cause of the other entity | Pressure pad carried another arm of spring urges film into contact with head |
operation | specify the relation between an activity and its object | Heat treatment improves the (100) orientation |
made-of | one entity is the material for making the other entity | The thin film head includes a substrate of electrically insulative material |
instance-of | the relation between a class and its instance | At least one of the magnetic layer is a free layer |
attribution | one entity is an attribution of the other entity | The thin film has very high heat resistance of remaining stable at 700.degree |
generating | one entity generates another entity | Buffer layer resistor create impedance that noise introduced to head from disk of drive |
purpose | relation between reason/result | conductor is utilized for producing writing flux in magnetic yoke |
in-manner-of | do something in certain way | The linear array is angled at a skew angle |
alias | one entity is also known under another entity’s name | The bias structure includes an antiferromagnetic layer AFM |
formation | an entity acts as a role of the other entity | Windings are joined at end to form center tapped winding |
comparison | compare one entity to the other | First end is closer to recording media use than second end |
measurement | one entity acts as a way to measure the other entity | This provides a relative permeance of at least 1000 |
other | not belongs to the above types | Then, MR resistance estimate during polishing step is calculated from S value and K value |
There are 1010 patent abstracts with 3,986 sentences in this corpus . We use a web-based annotation tool named Brat[2] for data labeling, and the annotated data is saved in '.ann' format. The benefit of 'ann' is that you can display and manipulate the annotated data once the TFH_Annotated_Dataset.zip is unzipped under corresponding repository of Brat.
TFH_Annotated_Dataset contains 22,833 entity mentions and 17,412 semantic relation mentions. With TFH_Annotated_Dataset, we run two tasks of information extraction including named entity recognition with BiLSTM-CRF[3] and semantic relation extractionand with BiGRU-2ATTENTION[4]. For improving semantic representation of patent language, the word embeddings are trained with the abstract of 46,302 patents regarding magnetic head in hard disk drive, which turn out to improve the performance of named entity recognition by 0.3% and semantic relation extraction by about 2% in weighted average F1, compared to GloVe and the patent word embedding provided by Risch et al[5].
For named entity recognition, the weighted-average precision, recall, F1-value of BiLSTM-CRF on entity-level for the test set are 78.5%, 78.0%, and 78.2%, respectively. Although such performance is acceptable, it is still lower than its performance on general-purpose dataset by more than 10% in F1-value. The main reason is the limited amount of labeled dataset.
The precision, recall, and F1-value for each type of entity is shown in Fig. 4. As to relation extraction, the weighted-average precision, recall, F1-value of BiGRU-2ATTENTION for the test set are 89.7%, 87.9%, and 88.6% with no_edge relations, and 32.3%, 41.5%, 36.3% without no_edge relations.
Academic citing Chen, L., Xu, S*., Zhu, L. et al. A deep learning based method for extracting semantic information from patent documents. Scientometrics 125, 289–312 (2020). https://doi.org/10.1007/s11192-020-03634-y
Paper link https://link.springer.com/article/10.1007/s11192-020-03634-y
REFERENCE [1] Pérez-Pérez, M., Pérez-Rodríguez, G., Vazquez, M., Fdez-Riverola, F., Oyarzabal, J., Oyarzabal, J., Valencia,A., Lourenço, A., & Krallinger, M. (2017). Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: The CEMP and GPRO patents tracks. In Proceedings of the Bio-Creative V.5 challenge evaluation workshop, pp. 11–18.
[2] Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. I. (2012). BRAT: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102-107)
[3] Huang, Z., Xu, W., &Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
[4] Han,X., Gao,T., Yao,Y., Ye,D., Liu,Z., Sun, M.(2019). OpenNRE: An Open and Extensible Toolkit for Neural Relation Extraction. arXiv preprint arXiv: 1301.3781
[5] Risch, J., & Krestel, R. (2019). Domain-specific word embeddings for patent classification. Data Technologies and Applications, 53(1), 108–122.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network test comparison.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparative experiment of multiple networks on UCAS-AOD dataset (IOU 0.5).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Overview: Predicting Employee Turnover at Sailsfort Motors
Introduction This project aims to analyze the factors contributing to employee turnover at Sailsfort Motors, an automobile company. By leveraging a combination of logistic regression and tree-based models, we will identify key predictors of employee turnover and develop strategies to enhance employee retention.
Objectives
Data Description The dataset includes the following attributes:
-Satisfaction Level: Employee satisfaction level. -Last Evaluation: Last performance evaluation score. -Number of Projects: Number of projects the employee has worked on. -Average Monthly Hours: Average monthly working hours. -Time Spent at Company: Number of years the employee has been with the company. -Work Accident: Whether the employee has had a work accident (1: Yes, 0: No). -Left: Whether the employee has left the company (1: Yes, 0: No). -Promotion in Last 5 Years: Whether the employee has been promoted in the last five years (1: Yes, 0: No). -Department: Department the employee belongs to. -Salary: Salary level (Low, Medium, High).
Methodology -Data Preprocessing: Clean and preprocess the data to handle missing values, categorical variables, and data normalization. -Exploratory Data Analysis (EDA): Perform EDA to understand the distribution of data and identify patterns and correlations. -Feature Engineering: Create relevant features to enhance model performance.
Model Building: -Logistic Regression: Build a logistic regression model to identify the probability of employee turnover. -Tree-Based Models: Build tree-based models (e.g., Decision Tree, Random Forest) to capture non-linear relationships and interactions between features. -Model Evaluation: Evaluate model performance using metrics such as accuracy, precision, recall, and F1-score.
-Insights and Recommendations: Analyze the results to identify key factors leading to employee turnover and provide recommendations to improve retention.
Expected Outcomes -Predictive Models: Accurate models to predict employee turnover. -Key Insights: Identification of the most significant factors contributing to employee turnover. -Retention Strategies: Data-driven recommendations to improve employee satisfaction and retention.
By predicting employee turnover and understanding its driving factors, this project aims to provide valuable insights for Sailsfort Motors to enhance their HR strategies and foster a more stable and satisfied workforce.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate and robust somatic mutation detection is essential for cancer treatment, diagnostics and research. Various analysis pipelines give different results and thus should be systematically evaluated. In this study, we benchmarked 5 commonly-used somatic mutation calling pipelines (VarScan, VarDictJava, Mutect2, Strelka2 and FANSe) for their precision, recall and speed, using standard benchmarking datasets based on a series of real-world whole-exome sequencing datasets. All the 5 pipelines showed very high precision in all cases, and high recall rate in mutation rates higher than 10%. However, for the low frequency mutations, these pipelines showed large difference. FANSe showed the highest accuracy (especially the sensitivity) in all cases, and VarScan and VarDictJava outperformed Mutect2 and Strelka2 in low frequency mutations at all sequencing depths. The flaws in filter was the major cause of the low sensitivity of the four pipelines other than FANSe. Concerning the speed, FANSe pipeline was 8.8∼19x faster than the other pipelines. Our benchmarking results demonstrated performance of the somatic calling pipelines and provided a reference for a proper choice of such pipelines in cancer applications.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Did We Solve the Problem? The objective of this analysis was to predict high streaming counts on Spotify and perform a detailed cluster analysis to understand user behavior. Here’s a summary of how we addressed each part of the objective:
Prediction of High Streaming Counts:
Implemented Multiple Models: We utilized several machine learning models including Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN). Comparison and Evaluation: These models were evaluated based on classification metrics like accuracy, precision, recall, and F1-score. The Gradient Boosting and Random Forest models were found to be the most effective in predicting high streaming counts. Cluster Analysis:
K-means Clustering: We applied K-means clustering to segment users into three clusters based on their listening behavior. Detailed Characterization: Each cluster was analyzed to understand the distinct characteristics, such as average playtime, skip rate, offline usage, and shuffle usage. Visualizations: Histograms and scatter plots were used to visualize the distributions and relationships within each cluster. Results and Insights Effective Models: The Gradient Boosting and Random Forest models provided the highest accuracy and balanced performance for predicting high streaming counts. User Segmentation: The cluster analysis revealed three distinct user segments: Cluster 1: Users with longer playtimes and lower skip rates. Cluster 2: Users with moderate playtimes and skip rates. Cluster 3: Users with shorter playtimes and higher skip rates. These insights can be leveraged for targeted marketing, personalized recommendations, and improving user engagement on Spotify.
Conclusion Yes, we solved the problem. We successfully predicted high streaming counts using effective machine learning models and provided a detailed cluster analysis to understand user behavior. The analysis offers valuable insights for enhancing Spotify’s recommendation system and user experience.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial intelligence (AI) development across the health sector has recently been the most crucial. Early medical information, identification, diagnosis, classification, then analysis, along with viable remedies, are always beneficial developments. Precise and consistent image classification has critical in diagnosing and tactical decisions for healthcare. The core issue with image classification has become the semantic gap. Conventional machine learning algorithms for classification rely mainly on low-level but rather high-level characteristics, employ some handmade features to close the gap, but force intense feature extraction as well as classification approaches. Deep learning is a powerful tool with considerable advances in recent years, with deep convolution neural networks (CNNs) succeeding in image classification. The main goal is to bridge the semantic gap and enhance the classification performance of multi-modal medical images based on the deep learning-based model ResNet50. The data set included 28378 multi-modal medical images to train and validate the model. Overall accuracy, precision, recall, and F1-score evaluation parameters have been calculated. The proposed model classifies medical images more accurately than other state-of-the-art methods. The intended research experiment attained an accuracy level of 98.61%. The suggested study directly benefits the health service.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundPreeclampsia is a potentially life-threatening pregnancy complication. Among women whose pregnancies are complicated by preeclampsia, the Preeclampsia Integrated Estimate of RiSk (PIERS) models (i.e., the PIERS Machine Learning [PIERS-ML] model, and the logistic regression-based fullPIERS model) accurately identify individuals at greatest or least risk of adverse maternal outcomes within 48 h following admission. Both models were developed and validated to be used as part of initial assessment. In the United Kingdom, the National Institute for Health and Care Excellence (NICE) recommends repeated use of such static models for ongoing assessment beyond the first 48 h. This study evaluated the models’ performance during such consecutive prediction.Methods and findingsThis multicountry prospective study used data of 8,843 women (32% white, 30% black, and 26% Asian) with a median age of 31 years. These women, admitted to maternity units in the Americas, sub-Saharan Africa, South Asia, Europe, and Oceania, were diagnosed with preeclampsia at a median gestational age of 35.79 weeks between year 2003 and 2016. The risk differentiation performance of the PIERS-ML and fullPIERS models were assessed for each day within a 2-week post-admission window. The PIERS adverse maternal outcome includes one or more of: death, end-organ complication (cardiorespiratory, renal, hepatic, etc.), or uteroplacental dysfunction (e.g., placental abruption). The main outcome measures were: trajectories of mean risk of each of the uncomplicated course and adverse outcome groups; daily area under the precision-recall curve (AUC-PRC); potential clinical impact (i.e., net benefit in decision curve analysis); dynamic shifts of multiple risk groups; and daily likelihood ratios. In the 2 weeks window, the number of daily outcome events decreased from over 200 to around 10. For both PIERS-ML and fullPIERS models, we observed consistently higher mean risk in the adverse outcome (versus uncomplicated course) group. The AUC-PRC values (0.2–0.4) of the fullPIERS model remained low (i.e., close to the daily fraction of adverse outcomes, indicating low discriminative capacity). The PIERS-ML model’s AUC-PRC peaked on day 0 (0.65), and notably decreased thereafter. When categorizing women into multiple risk groups, the PIERS-ML model generally showed good rule-in capacity for the “very high” risk group, with positive likelihood ratio values ranging from 70.99 to infinity, and good rule-out capacity for the “very low” risk group where most negative likelihood ratio values were 0. However, performance declined notably for other risk groups beyond 48 h. Decision curve analysis revealed a diminishing advantage for treatment guided by both models over time. The main limitation of this study is that the baseline performance of the PIERS-ML model was assessed on its development data; however, its baseline performance has also undergone external evaluation.ConclusionsIn this study, we have evaluated the performance of the fullPIERS and PIERS-ML models for consecutive prediction. We observed deteriorating performance of both models over time. We recommend using the models for consecutive prediction with greater caution and interpreting predictions with increasing uncertainty as the pregnancy progresses. For clinical practice, models should be adapted to retain accuracy when deployed serially. The performance of future models can be compared with the results of this study to quantify their added value.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this table we present a more detailed view of performance for our method's poorest predicted network ( total regulatory interactions with up to regulators controlling each gene). The table inline method precision [%] at varying degrees of completeness (recall [%]).