This dataset was created by Deepak Bhat
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cancer is the unregulated development of abnormal cells in the human body system. Cervical cancer, also known as cervix cancer, develops on the cervix’s surface. This causes an overabundance of cells to build up, eventually forming a lump or tumour. As a result, early detection is essential to determine what effective treatment we can take to overcome it. Therefore, the novel Machine Learning (ML) techniques come to a place that predicts cervical cancer before it becomes too serious. Furthermore, four common diagnosis testing namely, Hinselmann, Schiller, Cytology, and Biopsy have been compared and predicted with four common ML models, namely Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (K-NNs), and Extreme Gradient Boosting (XGB). Additionally, to enhance the better performance of ML models, the Stratified k-fold cross-validation (SKCV) method has been implemented over here. The findings of the experiments demonstrate that utilizing an RF classifier for analyzing the cervical cancer risk, could be a good alternative for assisting clinical specialists in classifying this disease in advance.
This dataset was created by Raj Gandhi
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Gagan Bajwa
Released under Apache 2.0
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Nick Kuzmenkov
Released under CC0: Public Domain
It contains the following files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Random Forest classification results for the whole dataset with stratified k-fold and oversampling.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes:
* labels
column of the initial train.csv
DataFrame was binarized to multi-label format columns: complex
, frog_eye_leaf_spot
, healthy
, powdery_mildew
, rust
, and scab
* images were scaled to 600x600
* 77 duplicate images having different labels were removed (see the context in this notebook)
* samples were stratified and split into 5 folds (see corresponding folders fold_0
:fold_4
)
* images were heavily augmented with albumentations
library (for raw images see this dataset)
* each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)
I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.
For a complete example see my TPU Training Notebook
train.csv
folds.csv
fold_0
:fold_4
folders containing 64 .tfrec
files, respectively, with feature map shown below:
feature_map = {
'image': tf.io.FixedLenFeature([], tf.string),
'name': tf.io.FixedLenFeature([], tf.string),
'complex': tf.io.FixedLenFeature([], tf.int64),
'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64),
'healthy': tf.io.FixedLenFeature([], tf.int64),
'powdery_mildew': tf.io.FixedLenFeature([], tf.int64),
'rust': tf.io.FixedLenFeature([], tf.int64),
'scab': tf.io.FixedLenFeature([], tf.int64)}
### AcknowledgementsAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
Additional information can be found on GitHub.
The following data is supplemental to the experiments described in our research paper. The data consists of:
This package consists of the Dataset part.
Dataset
The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2
. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.
The actual dataset is provided as used in the stratified k-fold with k=4
in train_testdata_4folds.tar.gz
.
├── 1 │ ├── test.csv │ └── train.csv ├── 2 │ ├── test.csv │ └── train.csv ├── 3 │ ├── test.csv │ └── train.csv └── 4 ├── test.csv └── train.csv
4 directories, 8 files
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Supplementary Table 5. Unsupervised learning of cross-modal mappings in multi-omics data for survival stratification of gastric cancerSignificant pathways for up-regulated and down-regulated genesAbstractPurpose: This study presents a survival-stratification model based on muti-omics integration using BiDNNs in GC. Methods: Based on the survival-related representation features yielded by BiDNNs through integrating transcriptomics and epigenomics data, K-means clustering analysis was performed to cluster tumor samples into different survival subgroups. The BiDNNs-based model was validated using 10-fold cross-validation and in two independent confirmation cohorts. Results: Using the BiDNNs-based survival stratification model, patients were grouped into two survival subgroups with log-rank P value=9.05E-05. The subgroups classification was robustly validated in 10-fold cross-validation (C-index=0.65±0.02) and in two confirmation cohorts (E-GEOD-26253, C-index=0.609; E-GEOD-62254, C-index=0.706). Conclusion: We propose and validate a robust and stable BiDNNs-based survival stratification model in GC.
This dataset was created by AAKRITI ADHIKARI
A daily gridded North American snowfall data with focus on the quality of the interpolated product is archived in this dataset. Daily snowfall amounts from National Weather Service Cooperative Observer Program stations and Meteorological Service of Canada surface stations are interpolated to 1 degree by 1 degree grids and examined for data record length and quality. The interpolation is validated spatially and temporally through the use of stratified sampling and k-fold cross-validation analyses. Interpolation errors average around 0.5 cm and range from less than 0.01 to greater than 2.5 cm. For most locations, this is within the measurement sensitivity. Grid cells with large variations in elevation experience higher errors and should be used with caution.
This dataset was created by Yuzuki
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. Contains results for all models evaluated during the model training and stratified 10-fold cross validation stage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AUC metric for the corresponding groups and classifiers based on the original 7 features, the selected 10 optimal minimal subset of features and all 13 selected significant features.
This dataset was created by Олексій Чорний
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Namely, sensitivity is the rate of correct prediction of patients identified as sensitive to first line treatments, and specificity describes the correct prediction percentage of patients identified as non-sensitive to first line treatment.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset was created by Deepak Bhat