51 datasets found

Network 5 (Yeast-3): Methods precision for low-to-high recall values.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aviv Madar; Alex Greenfield; Eric Vanden-Eijnden; Richard Bonneau (2023). Network 5 (Yeast-3): Methods precision for low-to-high recall values. [Dataset]. http://doi.org/10.1371/journal.pone.0009803.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0009803.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Aviv Madar; Alex Greenfield; Eric Vanden-Eijnden; Richard Bonneau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this table we present a more detailed view of performance for our method's poorest predicted network ( total regulatory interactions with up to regulators controlling each gene). The table inline method precision [%] at varying degrees of completeness (recall [%]).
Considering patient clinical history impacts performance of machine learning...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruggiero Seccia; Daniele Gammelli; Fabio Dominici; Silvia Romano; Anna Chiara Landi; Marco Salvetti; Andrea Tacchella; Andrea Zaccaria; Andrea Crisanti; Francesca Grassi; Laura Palagi (2023). Considering patient clinical history impacts performance of machine learning models in predicting course of multiple sclerosis [Dataset]. http://doi.org/10.1371/journal.pone.0230219
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0230219
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ruggiero Seccia; Daniele Gammelli; Fabio Dominici; Silvia Romano; Anna Chiara Landi; Marco Salvetti; Andrea Tacchella; Andrea Zaccaria; Andrea Crisanti; Francesca Grassi; Laura Palagi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multiple Sclerosis (MS) progresses at an unpredictable rate, but predictions on the disease course in each patient would be extremely useful to tailor therapy to the individual needs. We explore different machine learning (ML) approaches to predict whether a patient will shift from the initial Relapsing-Remitting (RR) to the Secondary Progressive (SP) form of the disease, using only “real world” data available in clinical routine. The clinical records of 1624 outpatients (207 in the SP phase) attending the MS service of Sant’Andrea hospital, Rome, Italy, were used. Predictions at 180, 360 or 720 days from the last visit were obtained considering either the data of the last available visit (Visit-Oriented setting), comparing four classical ML methods (Random Forest, Support Vector Machine, K-Nearest Neighbours and AdaBoost) or the whole clinical history of each patient (History-Oriented setting), using a Recurrent Neural Network model, specifically designed for historical data. Missing values were handled by removing either all clinical records presenting at least one missing parameter (Feature-saving approach) or the 3 clinical parameters which contained missing values (Record-saving approach). The performances of the classifiers were rated using common indicators, such as Recall (or Sensitivity) and Precision (or Positive predictive value). In the visit-oriented setting, the Record-saving approach yielded Recall values from 70% to 100%, but low Precision (5% to 10%), which however increased to 50% when considering only predictions for which the model returned a probability above a given “confidence threshold”. For the History-oriented setting, both indicators increased as prediction time lengthened, reaching values of 67% (Recall) and 42% (Precision) at 720 days. We show how “real world” data can be effectively used to forecast the evolution of MS, leading to high Recall values and propose innovative approaches to improve Precision towards clinically useful values.
C
Hybrid MSRM-Based Deep Learning and Multitemporal Sentinel 2-Based Machine...
dataverse.csuc.cat
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hèctor A. Orengo Romeu; Hèctor A. Orengo Romeu (2022). Hybrid MSRM-Based Deep Learning and Multitemporal Sentinel 2-Based Machine Learning Algorithm [Dataset]. http://doi.org/10.34810/data242
Explore at:
text/plain; charset=utf-8(23224)Available download formats
Unique identifier
https://doi.org/10.34810/data242
Dataset updated
Sep 22, 2022
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Hèctor A. Orengo Romeu; Hèctor A. Orengo Romeu
License
https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data242https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data242
Dataset funded by
Spanish Ministry of Science and Innovation
European Union’s Horizon 2020 research and innovation programme
Ayuda a Equipos de Investigación Científica of the Fundación BBVA
Nvidia Hardware Grant Programme
Description
JavaScript code to be implemented in Google Earth Engine(c) for Hybrid MSRM-Based Deep Learning and Multitemporal Sentinel 2-Based Machine Learning Algorithm. Algorithm for large-scale automatic detection of burial mounds, one of the most common types of archaeological sites globally, using LiDAR and multispectral satellite data. Although previous attempts were able to detect a good proportion of the known mounds in a given area, they still presented high numbers of false positives and low precision values. Our proposed approach combines random forest for soil classification using multitemporal multispectral Sentinel-2 data and a deep learning model using YOLOv3 on LiDAR data previously pre-processed using a multi–scale relief model. The resulting algorithm significantly improves previous attempts with a detection rate of 89.5%, an average precision of 66.75%, a recall value of 0.64 and a precision of 0.97, which allowed, with a small set of training data, the detection of 10,527 burial mounds over an area of near 30,000 km2, the largest in which such an approach has ever been applied. The open code and platforms employed to develop the algorithm allow this method to be applied anywhere LiDAR data or high-resolution digital terrain models are available.
P
PHM2017 Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Payam Karisani; Eugene Agichtein, PHM2017 Dataset [Dataset]. https://paperswithcode.com/dataset/phm2017
Explore at:
Authors
Payam Karisani; Eugene Agichtein
Description
PHM2017 is a new dataset consisting of 7,192 English tweets across six diseases and conditions: Alzheimer’s Disease, heart attack (any severity), Parkinson’s disease, cancer (any type), Depression (any severity), and Stroke. The Twitter search API was used to retrieve the data using the colloquial disease names as search keywords, with the expectation of retrieving a high-recall, low precision dataset. After removing the re-tweets and replies, the tweets were manually annotated. The labels are:

self-mention. The tweet contains a health mention with a health self-report of the Twitter account owner, e.g., "However, I worked hard and ran for Tokyo Mayer Election Campaign in January through February, 2014, without publicizing the cancer." other-mention. The tweet contains a health mention of a health report about someone other than the account owner, e.g., "Designer with Parkinson’s couldn’t work then engineer invents bracelet + changes her world" awareness. The tweet contains the disease name, but does not mention a specific person, e.g., "A Month Before a Heart Attack, Your Body Will Warn You With These 8 Signals" non-health. The tweet contains the disease name, but the tweet topic is not about health. "Now I can have cancer on my wall for all to see <3"
Cloud Mask Generation (Sentinel-2)
hub.arcgis.com
angola.africageoportal.com
+3more
Updated Jul 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2022). Cloud Mask Generation (Sentinel-2) [Dataset]. https://hub.arcgis.com/content/1e1ec9602f4743108708ccdf362e3c48
Explore at:
Dataset updated
Jul 25, 2022
Dataset authored and provided by
Esrihttp://esri.com/
Description
Satellite imagery has several applications, including land use and land cover classification, change detection, object detection, etc. Satellite based remote sensing sensors often encounter cloud coverage due to which clear imagery of earth is not collected. The clouded regions should be excluded, or cloud removal algorithms must be applied, before the imagery can be used for analysis. Most of these preprocessing steps require a cloud mask. In case of single-scene imagery, though tedious, it is relatively easy to manually create a cloud mask. However, for a larger number of images, an automated approach for identifying clouds is necessary. This model can be used to automatically generate a cloud mask from Sentinel-2 imagery.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Fine-tuning the modelThis model can be fine-tuned using the Train Deep Learning Model tool. Follow the guide to fine-tune this model.InputSentinel-2 L2A imagery in the form of a raster, mosaic dataset or image service.OutputClassified raster containing three classes: Low density, Medium density and High density.Applicable geographiesThis model is expected to work well in Europe and the United States. This model works well for land based areas. Large water bodies such as ocean, seas and lakes should be avoided.Model architectureThis model uses the UNet model architecture implemented in ArcGIS API for Python.Accuracy metricsThis model has an overall accuracy of 94 percent with L2A imagery. The table below summarizes the precision, recall and F1-score of the model on the validation dataset. The comparatively low precision, recall and F1 score for Low density clouds might cause false detection of such clouds in certain urban areas. Also, for certain seasonal clouds some extremely bright pixels might be missed out.ClassPrecisionRecallF1 scoreHigh density0.9600.9750.968Medium density0.9050.8970.901Low density0.7740.5710.657Sample resultsHere are a few results from the model.
f
Assignment of nine new bounding boxes by K-means.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang (2023). Assignment of nine new bounding boxes by K-means. [Dataset]. http://doi.org/10.1371/journal.pone.0265503.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0265503.t001
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Assignment of nine new bounding boxes by K-means.
R
Fruit Detection Workflow Dataset
universe.roboflow.com
zip
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FruitDetectionYolo10s (2025). Fruit Detection Workflow Dataset [Dataset]. https://universe.roboflow.com/fruitdetectionyolo10s/fruit-detection-workflow/model/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
FruitDetectionYolo10s
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Items Bounding Boxes
Description
Objective This project focuses on developing an object detection model using the YOLOv11 architecture. The primary goal is to accurately detect and classify objects within images across three distinct classes. The model was trained for 250 epochs to achieve high performance in terms of mean Average Precision (mAP), Precision, and Recall.

Dataset Information - Number of Images: 300 - Number of Annotations: 582 - Classes: 3 - Average Image Size: 0.30 megapixels - Image Size Range: 0.03 megapixels to 11.83 megapixels - Median Image Ratio: 648x500 pixels

Dataset Split

Train Set: 90% (270 images)

Validation Set: 8% (24 images)

Test Set: 2% (6 images)

Preprocessing - Auto-Orient: Applied to ensure correct image orientation. - Resize: Images were stretched to a uniform size of 640x640 pixels to maintain consistency across the dataset. Augmentations - Outputs per Training Example: 3 augmented outputs were generated for each training example to enhance the diversity of the training data. - Crop: Random cropping was applied with a minimum zoom of 0% and a maximum zoom of 8%. - Rotation: Images were randomly rotated between -8° and +8° to improve the model's robustness to different orientations.

Training and Performance The model was trained for 250 epochs, and the following performance metrics were achieved: - mAP (mean Average Precision): 90.4% - Precision: 87.7% - Recall: 83.4%

These metrics indicate that the model is highly effective in detecting and classifying objects within the images, with a strong balance between precision and recall.

** Key Insights** - mAP: The high mAP score of 90.4% suggests that the model is accurate in predicting the correct bounding boxes and class labels for objects in the dataset. - Precision: A precision of 87.7% indicates that the model has a low false positive rate, meaning it is reliable in identifying true objects. - Recall: The recall of 83.4% shows that the model is capable of detecting most of the relevant objects in the images. Visualization The training process was monitored using various metrics, including mAP, Box Loss, Class Loss, and Object Loss. The visualizations show the progression of these metrics over the 250 epochs, demonstrating the model's learning and improvement over time.

Conclusion The project successfully implemented and trained an object detection model using the YOLOv11 architecture. The achieved performance metrics highlight the model's effectiveness and reliability in detecting objects across different classes. This model can be further refined and applied to real-world applications for object detection tasks.
Mushroom Classification Enhanced
kaggle.com
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mo_der Steven (2024). Mushroom Classification Enhanced [Dataset]. https://www.kaggle.com/datasets/sakurapuare/mushroom-classification-enhanced/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mo_der Steven
Description
NOTE

The data for this competition is from the RAICOM Mission Application Competition and Mo in China, originating from https://www.kaggle.com/datasets/uciml/mushroom-classification/

The copyright of datasets belongs to the organizers of "RAICOM Mission Application Competition"

Baseline

The result of Official Baseline is:

Accuracy: 0.7464409388226241 Precision: 0.7591353576942872 Recall: 0.6344086021505376 F1: 0.6911902530459232 Confusion matrix: [[2405 468] [ 850 1475]]

Background

Mushrooms are a beloved delicacy among people, but beneath their glamorous appearance, they may harbor deadly dangers. China is one of the countries with the largest variety of mushrooms in the world. At the same time, mushroom poisoning is one of the most serious food safety issues in China. According to relevant reports, in 2021, China conducted research on 327 mushroom poisoning incidents, involving 923 patients and 20 deaths, with a total mortality rate of 2.17%. For non professionals, it is impossible to distinguish between poisonous mushrooms and edible mushrooms based on their appearance, shape, color, etc. There is no simple standard that can distinguish between poisonous mushrooms and edible mushrooms. To determine whether mushrooms are edible, it is necessary to collect mushrooms with different characteristic attributes and analyze whether they are toxic. In this competition, 22 characteristic attributes of mushrooms were analyzed to obtain a mushroom usability model, which can better predict whether mushrooms are edible.

Metrics

In the context of this mushroom usability model competition, several performance metrics can be utilized to evaluate the predictive accuracy of the model. Among them, the F1 score stands out due to its ability to provide a balance between precision and recall, which are crucial for this classification problem where distinguishing between poisonous and edible mushrooms can have severe real-world implications.

F1 Score The F1 score is the harmonic mean of precision and recall, and it is particularly useful in binary classification scenarios with imbalanced class distribution:

Precision (also known as positive predictive value) indicates the proportion of true positive observations among all observations classified as positive. It measures the accuracy of the positive predictions. \( \text{Precision} = \frac{TP}{TP + FP} \)

Recall (also known as sensitivity or true positive rate) measures the proportion of true positive observations out of all actual positives. It assesses the ability to capture all the true positive instances. \( \text{Recall} = \frac{TP}{TP + FN} \)

The F1 score is calculated as follows:

\[ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

Why F1 Score? Balance Between Precision and Recall: In the context where mushroom classification error can have critical health impacts, favoring either precision or recall solely might be dangerous. F1 score provides a more comprehensive evaluation by balancing these errors.

Handling Imbalanced Classes: Mushroom datasets often have an imbalance between the number of edible and poisonous instances. The F1 score is less influenced by the skewed class distributions compared to accuracy.

Critical Application: Misclassifying a poisonous mushroom as edible can lead to severe health risks. Hence, ensuring both high precision (minimizing false positives) and high recall (capturing all true positives) is crucial. The F1 score encapsulates the tradeoff between these aspects well.
f
Frames per second test comparison.
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang (2023). Frames per second test comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0265503.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0265503.t008
Dataset updated
Jun 5, 2023
Dataset provided by
PLOS ONE
Authors
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Frames per second test comparison.
H
Data from: Mining texts to efficiently generate global data on political...
dataverse.harvard.edu
Updated Jul 8, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahryar Minhas; Jay Ulfelder; Michael D. Ward (2015). Mining texts to efficiently generate global data on political regime types [Dataset]. http://doi.org/10.7910/DVN/8MC1LO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8MC1LO
Dataset updated
Jul 8, 2015
Dataset provided by
Harvard Dataverse
Authors
Shahryar Minhas; Jay Ulfelder; Michael D. Ward
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM) classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall) are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.
d
Replication Data for: Classification of behaviors of free-ranging cattle...
search.dataone.org
dataverse.azure.uit.no
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Versluijs, Erik (2024). Replication Data for: Classification of behaviors of free-ranging cattle using accelerometry signatures collected by virtual fence collars [Dataset]. http://doi.org/10.18710/ND4CLL
Explore at:
Unique identifier
https://doi.org/10.18710/ND4CLL
Dataset updated
Sep 25, 2024
Dataset provided by
DataverseNO
Authors
Versluijs, Erik
Time period covered
Jun 22, 2021 - Jul 30, 2021
Description
This dataset includes the scripts to reproduce the models presented in the paper. The cleaned data used for the analyses is also available. Abstract of the article: Precision farming technology, including GPS collars with biologging, has revolutionized remote livestock monitoring in extensive grazing systems. High resolution accelerometry can be used to infer the behavior of an animal. Previous behavioral classification studies using accelerometer data have focused on a few key behaviors and were mostly conducted in controlled situations. Here, we conducted behavioral observations of 38 beef cows (Hereford, Limousine, Charolais, Simmental/NRF/Hereford mix) free-ranging in rugged, forested areas, and fitted with a commercially available virtual fence collar (Nofence) containing a 10Hz tri-axial accelerometer. We used random forest models to calibrate data from the accelerometers on both commonly documented (e.g., feeding, resting, walking) and rarer (e.g., suckling calf, head butting, allogrooming) behaviors. Our goal was to assess pre-processing decisions including different running mean intervals (smoothing window of 1, 5, or 20 seconds), collar orientation and feature selection (orientation-dependent versus orientation-independent features). We identified the 10 most common behaviors exhibited by the cows. Models based only on orientation-independent features did not perform better than models based on orientation-dependent features, despite variation in how collars were attached (direction and tightness). Using a 20 seconds running mean and orientation-dependent features resulted in the highest model performance (model accuracy: 0.998, precision: 0.991, and recall: 0.989). We also used this model to add 11 rarer behaviors (each < 0.1% of the data; e.g. head butting, throwing head, self-grooming). These rarer behaviors were predicted with less accuracy because they were not observed at all for some individuals, but overall model performance remained high (accuracy, precision, recall >98%). Our study suggests that the accelerometers in the Nofence collars are suitable to identify the most common behaviors of free-ranging cattle. The results of this study could be used in future research for understanding cattle habitat selection in rugged forest ranges, herd dynamics, or responses to stressors such as carnivores, as well as to improve cattle management and welfare.
Datasets and predictions for sRNA-mRNA interactions in E. coli - kGraphRNA,...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shani Cohen; Shani Cohen; Lior Rokach; Lior Rokach; Isana Veksler-Lublinsky; Isana Veksler-Lublinsky (2024). Datasets and predictions for sRNA-mRNA interactions in E. coli - kGraphRNA, GraphRNA, sInterRF, and sInterXGB models [Dataset]. http://doi.org/10.5281/zenodo.14030380
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14030380
Dataset updated
Nov 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shani Cohen; Shani Cohen; Lior Rokach; Lior Rokach; Isana Veksler-Lublinsky; Isana Veksler-Lublinsky
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Bacterial small RNAs (sRNAs) are pivotal in post-transcriptional regulation, affecting functions like virulence, metabolism, and gene expression by binding specific mRNA targets. Identifying these targets is crucial to understanding sRNA regulation across species. Despite advancements in high-throughput (HT) experimental methods, they remain technically challenging and are limited to detecting sRNA-target interactions under specific environmental conditions. Therefore, computational approaches, especially machine learning (ML), are essential for identifying strong candidates for biological validation.

In this study, we hypothesize that ML models trained on large-scale interaction data from specific conditions can accurately predict new interactions in unseen conditions within the same bacterial strain. To test this, we developed models from two families: (1) graph neural networks (GNNs), including GraphRNA and kGraphRNA, that learn transformed representations of interacting sRNA-mRNA pairs via graph relationships, and (2) decision forests, sInterRF (Random Forest) and sInterXGB (XGBoost), which use various interaction features for prediction. We also proposed Summation Ensemble Models (SEM) that combine scores from multiple models. Across three seen-to-unseen conditions evaluations, our models —particularly kGraphRNA— significantly improved the area under the ROC curve (AUC) and Precision-Recall curve (PR-AUC) compared to sRNARFTarget, CopraRNA, and RNAup. The SEM model combining GraphRNA and CopraRNA outperformed CopraRNA alone on a low-throughput (LT) interactions test set (HT-to-LT).

This data source provides the HT and LT interaction datasets used for our study. In addition, we provide the prediction scores of our models: kGraphRNA, GraphRNA, sInterRF, and sInterXGB for any pair of sRNA and mRNA of Escherichia coli K12 MG1655 (NC_000913). We also provide the true labels and the CopraRNA p-value scores computed for all possible pairs. Note that prediction scores are not provided for sRNA-mRNA pairs that were used to train the models, i.e., all the labeled interactions (HT and LT) and negative interactions sampled randomly (see our paper for more details).

For convenience, each CVS file contains the scores of a single sRNA with the following information: accession IDs, locus tags, and names of the sRNA the mRNA; CopraRNA p-value (if available); the prediction scores of kGraphRNA, GraphRNA, sInterRF, and sInterXGB models; true label (if available) – 1 for interaction and 0 for non-interaction; whether the sRNA-mRNA pair was sampled for the train set as a random negative sample – true or false.

TFH_Annotated_Dataset Dataset

paperswithcode.com

Updated Sep 6, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). TFH_Annotated_Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/tfh-annotated-dataset

Explore at:

Dataset updated

Sep 6, 2022

Description

Dataset Introduction TFH_Annotated_Dataset is an annotated patent dataset pertaining to thin film head technology in hard-disk. To the best of our knowledge, this is the second labeled patent dataset public available in technology management domain that annotates both entities and the semantic relations between entities, the first one is [1].

The well-crafted information schema used for patent annotation contains 17 types of entities and 15 types of semantic relations as shown below.

Table 1 The specification of entity types

Type	Comment	example
physical flow	substance that flows freely	The etchant solution has a suitable solvent additive such as glycerol or methyl cellulose
information flow	information data	A camera using a film having a magnetic surface for recording magnetic data thereon
energy flow	entity relevant to energy	Conductor is utilized for producing writing flux in magnetic yoke
measurement	method of measuring something	The curing step takes place at the substrate temperature less than 200.degree
value	numerical amount	The curing step takes place at the substrate temperature less than 200.degree
location	place or position	The legs are thinner near the pole tip than in the back gap region
state	particular condition at a specific time	The MR elements are biased to operate in a magnetically unsaturated mode
effect	change caused an innovation	Magnetic disk system permits accurate alignment of magnetic head with spaced tracks
function	manufacturing technique or activity	A magnetic head having highly efficient write and read functions is thereby obtained
shape	the external form or outline of something	Recess is filled with non-magnetic material such as glass
component	a part or element of a machine	A pole face of yoke is adjacent edge of element remote from surface
attribution	a quality or feature of something	A pole face of yoke is adjacent edge of element remote from surface
consequence	The result caused by something or activity	This prevents the slider substrate from electrostatic damage
system	a set of things working together as a whole	A digital recording system utilizing a magnetoresistive transducer in a magnetic recording head
material	the matter from which a thing is made	Interlayer may comprise material such as Ta
scientific concept	terminology used in scientific theory	Peak intensity ratio represents an amount hydrophilic radical
other	Not belongs to the above entity types	Pressure distribution across air bearing surface is substantially symmetrical side

Table 2 The specification of relation types

TYPE	COMMENT	EXAMPLE
spatial relation	specify how one entity is located in relation to others	Gap spacer material is then deposited on the film knife-edge
part-of	the ownership between two entities	a magnetic head has a magnetoresistive element
causative relation	one entity operates as a cause of the other entity	Pressure pad carried another arm of spring urges film into contact with head
operation	specify the relation between an activity and its object	Heat treatment improves the (100) orientation
made-of	one entity is the material for making the other entity	The thin film head includes a substrate of electrically insulative material
instance-of	the relation between a class and its instance	At least one of the magnetic layer is a free layer
attribution	one entity is an attribution of the other entity	The thin film has very high heat resistance of remaining stable at 700.degree
generating	one entity generates another entity	Buffer layer resistor create impedance that noise introduced to head from disk of drive
purpose	relation between reason/result	conductor is utilized for producing writing flux in magnetic yoke
in-manner-of	do something in certain way	The linear array is angled at a skew angle
alias	one entity is also known under another entity’s name	The bias structure includes an antiferromagnetic layer AFM
formation	an entity acts as a role of the other entity	Windings are joined at end to form center tapped winding
comparison	compare one entity to the other	First end is closer to recording media use than second end
measurement	one entity acts as a way to measure the other entity	This provides a relative permeance of at least 1000
other	not belongs to the above types	Then, MR resistance estimate during polishing step is calculated from S value and K value

There are 1010 patent abstracts with 3,986 sentences in this corpus . We use a web-based annotation tool named Brat[2] for data labeling, and the annotated data is saved in '.ann' format. The benefit of 'ann' is that you can display and manipulate the annotated data once the TFH_Annotated_Dataset.zip is unzipped under corresponding repository of Brat.

TFH_Annotated_Dataset contains 22,833 entity mentions and 17,412 semantic relation mentions. With TFH_Annotated_Dataset, we run two tasks of information extraction including named entity recognition with BiLSTM-CRF[3] and semantic relation extractionand with BiGRU-2ATTENTION[4]. For improving semantic representation of patent language, the word embeddings are trained with the abstract of 46,302 patents regarding magnetic head in hard disk drive, which turn out to improve the performance of named entity recognition by 0.3% and semantic relation extraction by about 2% in weighted average F1, compared to GloVe and the patent word embedding provided by Risch et al[5].

For named entity recognition, the weighted-average precision, recall, F1-value of BiLSTM-CRF on entity-level for the test set are 78.5%, 78.0%, and 78.2%, respectively. Although such performance is acceptable, it is still lower than its performance on general-purpose dataset by more than 10% in F1-value. The main reason is the limited amount of labeled dataset.

The precision, recall, and F1-value for each type of entity is shown in Fig. 4. As to relation extraction, the weighted-average precision, recall, F1-value of BiGRU-2ATTENTION for the test set are 89.7%, 87.9%, and 88.6% with no_edge relations, and 32.3%, 41.5%, 36.3% without no_edge relations.

Academic citing Chen, L., Xu, S*., Zhu, L. et al. A deep learning based method for extracting semantic information from patent documents. Scientometrics 125, 289–312 (2020). https://doi.org/10.1007/s11192-020-03634-y

Paper link https://link.springer.com/article/10.1007/s11192-020-03634-y

REFERENCE [1] Pérez-Pérez, M., Pérez-Rodríguez, G., Vazquez, M., Fdez-Riverola, F., Oyarzabal, J., Oyarzabal, J., Valencia,A., Lourenço, A., & Krallinger, M. (2017). Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: The CEMP and GPRO patents tracks. In Proceedings of the Bio-Creative V.5 challenge evaluation workshop, pp. 11–18.

[2] Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. I. (2012). BRAT: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102-107)

[3] Huang, Z., Xu, W., &Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

[4] Han,X., Gao,T., Yao,Y., Ye,D., Liu,Z., Sun, M.(2019). OpenNRE: An Open and Extensible Toolkit for Neural Relation Extraction. arXiv preprint arXiv: 1301.3781

[5] Risch, J., & Krestel, R. (2019). Domain-specific word embeddings for patent classification. Data Technologies and Applications, 53(1), 108–122.

f
Network test comparison.
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang (2023). Network test comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0265503.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0265503.t009
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS ONE
Authors
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Network test comparison.
f
Comparative experiment of multiple networks on UCAS-AOD dataset (IOU 0.5).
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang (2023). Comparative experiment of multiple networks on UCAS-AOD dataset (IOU 0.5). [Dataset]. http://doi.org/10.1371/journal.pone.0265503.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0265503.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
YunYan Wang; Huaxuan Wu; Luo Shuai; Chen Peng; Zhiwei Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparative experiment of multiple networks on UCAS-AOD dataset (IOU 0.5).
Predicting Employee Turnover at Sailsfort Motors
kaggle.com
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mani Devesh (2024). Predicting Employee Turnover at Sailsfort Motors [Dataset]. https://www.kaggle.com/datasets/manidevesh/hr-dataset-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mani Devesh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Overview: Predicting Employee Turnover at Sailsfort Motors

Introduction This project aims to analyze the factors contributing to employee turnover at Sailsfort Motors, an automobile company. By leveraging a combination of logistic regression and tree-based models, we will identify key predictors of employee turnover and develop strategies to enhance employee retention.

Objectives

Predict Turnover: Build models to predict whether an employee will leave the company.

Identify Key Factors: Determine the most significant factors influencing employee turnover.

Retention Strategies: Provide actionable insights to improve employee retention.

Data Description The dataset includes the following attributes:

-Satisfaction Level: Employee satisfaction level. -Last Evaluation: Last performance evaluation score. -Number of Projects: Number of projects the employee has worked on. -Average Monthly Hours: Average monthly working hours. -Time Spent at Company: Number of years the employee has been with the company. -Work Accident: Whether the employee has had a work accident (1: Yes, 0: No). -Left: Whether the employee has left the company (1: Yes, 0: No). -Promotion in Last 5 Years: Whether the employee has been promoted in the last five years (1: Yes, 0: No). -Department: Department the employee belongs to. -Salary: Salary level (Low, Medium, High).

Methodology -Data Preprocessing: Clean and preprocess the data to handle missing values, categorical variables, and data normalization. -Exploratory Data Analysis (EDA): Perform EDA to understand the distribution of data and identify patterns and correlations. -Feature Engineering: Create relevant features to enhance model performance.

Model Building: -Logistic Regression: Build a logistic regression model to identify the probability of employee turnover. -Tree-Based Models: Build tree-based models (e.g., Decision Tree, Random Forest) to capture non-linear relationships and interactions between features. -Model Evaluation: Evaluate model performance using metrics such as accuracy, precision, recall, and F1-score.

-Insights and Recommendations: Analyze the results to identify key factors leading to employee turnover and provide recommendations to improve retention.

Expected Outcomes -Predictive Models: Accurate models to predict employee turnover. -Key Insights: Identification of the most significant factors contributing to employee turnover. -Retention Strategies: Data-driven recommendations to improve employee satisfaction and retention.

By predicting employee turnover and understanding its driving factors, this project aims to provide valuable insights for Sailsfort Motors to enhance their HR strategies and foster a more stable and satisfied workforce.
f
Table1_Towards an accurate and robust analysis pipeline for somatic mutation...
frontiersin.figshare.com
bin
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jingjie Jin; Zixi Chen; Jinchao Liu; Hongli Du; Gong Zhang (2023). Table1_Towards an accurate and robust analysis pipeline for somatic mutation calling.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.979928.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2022.979928.s001
Dataset updated
Jun 21, 2023
Dataset provided by
Frontiers
Authors
Jingjie Jin; Zixi Chen; Jinchao Liu; Hongli Du; Gong Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurate and robust somatic mutation detection is essential for cancer treatment, diagnostics and research. Various analysis pipelines give different results and thus should be systematically evaluated. In this study, we benchmarked 5 commonly-used somatic mutation calling pipelines (VarScan, VarDictJava, Mutect2, Strelka2 and FANSe) for their precision, recall and speed, using standard benchmarking datasets based on a series of real-world whole-exome sequencing datasets. All the 5 pipelines showed very high precision in all cases, and high recall rate in mutation rates higher than 10%. However, for the low frequency mutations, these pipelines showed large difference. FANSe showed the highest accuracy (especially the sensitivity) in all cases, and VarScan and VarDictJava outperformed Mutect2 and Strelka2 in low frequency mutations at all sequencing depths. The flaws in filter was the major cause of the low sensitivity of the four pipelines other than FANSe. Concerning the speed, FANSe pipeline was 8.8∼19x faster than the other pipelines. Our benchmarking results demonstrated performance of the somatic calling pipelines and provided a reference for a proper choice of such pipelines in cancer applications.
Million Song Data Analysis 2
kaggle.com
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zirian Afandy (2024). Million Song Data Analysis 2 [Dataset]. https://www.kaggle.com/datasets/ziriantahirli/million-song-data-analysis-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zirian Afandy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Did We Solve the Problem? The objective of this analysis was to predict high streaming counts on Spotify and perform a detailed cluster analysis to understand user behavior. Here’s a summary of how we addressed each part of the objective:

Prediction of High Streaming Counts:

Implemented Multiple Models: We utilized several machine learning models including Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN). Comparison and Evaluation: These models were evaluated based on classification metrics like accuracy, precision, recall, and F1-score. The Gradient Boosting and Random Forest models were found to be the most effective in predicting high streaming counts. Cluster Analysis:

K-means Clustering: We applied K-means clustering to segment users into three clusters based on their listening behavior. Detailed Characterization: Each cluster was analyzed to understand the distinct characteristics, such as average playtime, skip rate, offline usage, and shuffle usage. Visualizations: Histograms and scatter plots were used to visualize the distributions and relationships within each cluster. Results and Insights Effective Models: The Gradient Boosting and Random Forest models provided the highest accuracy and balanced performance for predicting high streaming counts. User Segmentation: The cluster analysis revealed three distinct user segments: Cluster 1: Users with longer playtimes and lower skip rates. Cluster 2: Users with moderate playtimes and skip rates. Cluster 3: Users with shorter playtimes and higher skip rates. These insights can be leveraged for targeted marketing, personalized recommendations, and improving user engagement on Spotify.

Conclusion Yes, we solved the problem. We successfully predicted high streaming counts using effective machine learning models and provided a detailed cluster analysis to understand user behavior. The analysis offers valuable insights for enhancing Spotify’s recommendation system and user experience.
f
S1 File -
plos.figshare.com
application/x-rar
Updated Jun 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Haris Abid; Rehan Ashraf; Toqeer Mahmood; C. M. Nadeem Faisal (2023). S1 File - [Dataset]. http://doi.org/10.1371/journal.pone.0287786.s001
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287786.s001
Dataset updated
Jun 29, 2023
Dataset provided by
PLOS ONE
Authors
Muhammad Haris Abid; Rehan Ashraf; Toqeer Mahmood; C. M. Nadeem Faisal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial intelligence (AI) development across the health sector has recently been the most crucial. Early medical information, identification, diagnosis, classification, then analysis, along with viable remedies, are always beneficial developments. Precise and consistent image classification has critical in diagnosing and tactical decisions for healthcare. The core issue with image classification has become the semantic gap. Conventional machine learning algorithms for classification rely mainly on low-level but rather high-level characteristics, employ some handmade features to close the gap, but force intense feature extraction as well as classification approaches. Deep learning is a powerful tool with considerable advances in recent years, with deep convolution neural networks (CNNs) succeeding in image classification. The main goal is to bridge the semantic gap and enhance the classification performance of multi-modal medical images based on the deep learning-based model ResNet50. The data set included 28378 multi-modal medical images to train and validate the model. Overall accuracy, precision, recall, and F1-score evaluation parameters have been calculated. The proposed model classifies medical images more accurately than other state-of-the-art methods. The intended research experiment attained an accuracy level of 98.61%. The suggested study directly benefits the health service.
f
Characteristics of women at admission.
plos.figshare.com
xls
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guiyou Yang; Tünde Montgomery-Csobán; Wessel Ganzevoort; Sanne J. Gordijn; Kimberley Kavanagh; Paul Murray; Laura A. Magee; Henk Groen; Peter von Dadelszen (2025). Characteristics of women at admission. [Dataset]. http://doi.org/10.1371/journal.pmed.1004509.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pmed.1004509.t001
Dataset updated
Feb 4, 2025
Dataset provided by
PLOS Medicine
Authors
Guiyou Yang; Tünde Montgomery-Csobán; Wessel Ganzevoort; Sanne J. Gordijn; Kimberley Kavanagh; Paul Murray; Laura A. Magee; Henk Groen; Peter von Dadelszen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundPreeclampsia is a potentially life-threatening pregnancy complication. Among women whose pregnancies are complicated by preeclampsia, the Preeclampsia Integrated Estimate of RiSk (PIERS) models (i.e., the PIERS Machine Learning [PIERS-ML] model, and the logistic regression-based fullPIERS model) accurately identify individuals at greatest or least risk of adverse maternal outcomes within 48 h following admission. Both models were developed and validated to be used as part of initial assessment. In the United Kingdom, the National Institute for Health and Care Excellence (NICE) recommends repeated use of such static models for ongoing assessment beyond the first 48 h. This study evaluated the models’ performance during such consecutive prediction.Methods and findingsThis multicountry prospective study used data of 8,843 women (32% white, 30% black, and 26% Asian) with a median age of 31 years. These women, admitted to maternity units in the Americas, sub-Saharan Africa, South Asia, Europe, and Oceania, were diagnosed with preeclampsia at a median gestational age of 35.79 weeks between year 2003 and 2016. The risk differentiation performance of the PIERS-ML and fullPIERS models were assessed for each day within a 2-week post-admission window. The PIERS adverse maternal outcome includes one or more of: death, end-organ complication (cardiorespiratory, renal, hepatic, etc.), or uteroplacental dysfunction (e.g., placental abruption). The main outcome measures were: trajectories of mean risk of each of the uncomplicated course and adverse outcome groups; daily area under the precision-recall curve (AUC-PRC); potential clinical impact (i.e., net benefit in decision curve analysis); dynamic shifts of multiple risk groups; and daily likelihood ratios. In the 2 weeks window, the number of daily outcome events decreased from over 200 to around 10. For both PIERS-ML and fullPIERS models, we observed consistently higher mean risk in the adverse outcome (versus uncomplicated course) group. The AUC-PRC values (0.2–0.4) of the fullPIERS model remained low (i.e., close to the daily fraction of adverse outcomes, indicating low discriminative capacity). The PIERS-ML model’s AUC-PRC peaked on day 0 (0.65), and notably decreased thereafter. When categorizing women into multiple risk groups, the PIERS-ML model generally showed good rule-in capacity for the “very high” risk group, with positive likelihood ratio values ranging from 70.99 to infinity, and good rule-out capacity for the “very low” risk group where most negative likelihood ratio values were 0. However, performance declined notably for other risk groups beyond 48 h. Decision curve analysis revealed a diminishing advantage for treatment guided by both models over time. The main limitation of this study is that the baseline performance of the PIERS-ML model was assessed on its development data; however, its baseline performance has also undergone external evaluation.ConclusionsIn this study, we have evaluated the performance of the fullPIERS and PIERS-ML models for consecutive prediction. We observed deteriorating performance of both models over time. We recommend using the models for consecutive prediction with greater caution and interpreting predictions with increasing uncertainty as the pregnancy progresses. For clinical practice, models should be adapted to retain accuracy when deployed serially. The performance of future models can be compared with the results of this study to quantify their added value.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aviv Madar; Alex Greenfield; Eric Vanden-Eijnden; Richard Bonneau (2023). Network 5 (Yeast-3): Methods precision for low-to-high recall values. [Dataset]. http://doi.org/10.1371/journal.pone.0009803.t002

Network 5 (Yeast-3): Methods precision for low-to-high recall values.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0009803.t002

Dataset updated

May 31, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Aviv Madar; Alex Greenfield; Eric Vanden-Eijnden; Richard Bonneau

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In this table we present a more detailed view of performance for our method's poorest predicted network ( total regulatory interactions with up to regulators controlling each gene). The table inline method precision [%] at varying degrees of completeness (recall [%]).

Clear search

Close search

Google apps

Main menu

Network 5 (Yeast-3): Methods precision for low-to-high recall values.

Considering patient clinical history impacts performance of machine learning...

Hybrid MSRM-Based Deep Learning and Multitemporal Sentinel 2-Based Machine...

PHM2017 Dataset

Cloud Mask Generation (Sentinel-2)

Assignment of nine new bounding boxes by K-means.

Fruit Detection Workflow Dataset

Dataset Split

Mushroom Classification Enhanced

NOTE

Baseline

Background

Metrics

Frames per second test comparison.

Data from: Mining texts to efficiently generate global data on political...

Replication Data for: Classification of behaviors of free-ranging cattle...

Datasets and predictions for sRNA-mRNA interactions in E. coli - kGraphRNA,...

TFH_Annotated_Dataset Dataset

Network test comparison.

Comparative experiment of multiple networks on UCAS-AOD dataset (IOU 0.5).

Predicting Employee Turnover at Sailsfort Motors

Table1_Towards an accurate and robust analysis pipeline for somatic mutation...

Million Song Data Analysis 2

S1 File -

Characteristics of women at admission.

Network 5 (Yeast-3): Methods precision for low-to-high recall values.