Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
- Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
- Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
- Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |
File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...
Facebook
TwitterThe U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD developed a cloud validation dataset from 48 unique Landsat 7 Collection 2 images. These images were selected at random from the Landsat 7 SLC-On archive from various locations around the world. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from the original Landsat 7 Level-1 Collection 2 data product (COG GeoTIFF), and its associated Level-1 metadata (MTL.txt file). The methodology used to create these masks is the same as in previous USGS Landsat cloud truth masks (http://doi.org/10.5066/F7251GDH). Pixels are marked as Cloud if the pixel contains opaque and clearly identifiable clouds. Pixels are marked as Thin Cloud if they contain clouds that are transparent or if their classification as cloud is uncertain. Pixels that contain clouds with less than 50% opacity, or which do not contain clouds at all, are marked as Clear. In some masks the borders around clouds have been dilated to encompass the edges around irregular clouds.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metrics and 95% confidence intervals evaluated on the validation data set for two different classification thresholds: 0.60 (selected by maximising the F1 score) and 0.71 (corresponding to 90% specificity).
Facebook
TwitterIn order to thoroughly assess the generalization and transferability of the NE-GraphSAGE model across different domain datasets, this paper adopts a cross-domain validation strategy. Specifically, two domains with significant inherent differences are selected to construct the training and testing sets, respectively. After training the NE-GraphSAGE model on the non-review paper citation relationship network in Domain A to capture the unique patterns and associations of that domain, the model is then applied to the non-review paper citation relationship network in Domain B. Domain B exhibits evident differences from the training set in terms of data characteristics and research questions, thus creating a challenging testing environment. This strategy of separating the training and testing sets by domain can examine the adaptability and flexibility of the NE-GraphSAGE model when confronted with data from different domains.For this study, the training set is chosen from the intelligent transportation systems domain, while the testing set is selected from the 3D vision domain. Literature, including both review and non-review papers, from 2022 to 2024 in these two research domains is retrieved on the Web of Science platform. The search results yield a review literature collection comprising 473 papers, with 218 in the intelligent transportation systems domain and 255 in the 3D vision domain. The non-review literature collection contains 8311 papers, of which 3276 are in the intelligent transportation systems domain and 5035 are in the 3D vision domain. Based on the citation relationships among the non-review literature, a citation network is constructed and nodes are labeled. Additionally, indicators such as the number of citations, usage frequency, publication year, and the number of research fields covered are embedded as node attribute features. After processing, the training set consists of 1595 nodes and 1784 edges, while the testing set includes 1179 nodes and 908 edges.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.
Methods Creating a training and validation image set
We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).
One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).
For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).
Training a two-stage and four-stage CNN
We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.
Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).
We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.
Hardware information
Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).
Comparing CNN accuracy to human annotation accuracy
We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.
Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.
Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.
List of datasets and their original source:
Notes:
Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).
For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.
Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.
Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)
If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.
Facebook
TwitterBackground Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier.
Results
We show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes.
Conclusion
When looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Scitail dataset is your gateway to unlocking powerful and advanced Sci-Fi Natural Language Inference (NLI) algorithms. With data sourced from popular books, movies, and TV shows in the genre, this dataset gives you the opportunity to develop and train NLI algorithms capable of understanding complex sci-fi conversations. Containing seven distinct formats including training sets for both predictor format and datagem format as well as testing sets in tsv format and SNLI format - all containing the same fields but in varied structures - this is an essential resource for any scientist looking to explore the realm of sci-fi NLI! Train your algorithm today with Scitail; unlock a future of supercharged Sci-Fi language processing!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide will explain how to use the Scitail dataset for Natural Language Inference (NLI). NLI is a machine learning task which involves making predictions about a statement’s labels, such as entailment, contradiction, or neutral. The Scitail dataset contains sci-fi samples sourced from various sources such as books, movies and TV shows that can be used to train and evaluate NLI algorithms.
The Scitail dataset is split into seven different formats: Dataset Gem format for testing and training, Predictor format for validation and training, .TSV format for testing and validation. Each of these formats contain the same data fields in different forms; including premise, hypothesis, label (entailment/contradiction/neutral), label assigned by annotators etc.
To get started using this dataset we recommend downloading the datasets in whichever format you prefer from Kaggle. All files are stored as csv’s with each row representing a single data point in the form of premise-hypothesis pairs with labels assigned by annotators which indicate whether two statements entail one another or not.
Once you have downloaded your preferred datasets it’s time to prepare them for training or evaluation purposes; this includes formatting them correctly so they can be used properly by algorithms. To do so we suggest splitting your chosen file(s) into separate sets — training/validation — such that you have selected samples that are sufficiently representative of real-world language samples that demonstrate positive entailing relations as well examples where no entailing relation exists between two statements or uncertainty exists due to lack of evidence provided within a pair’s context i.e., neutral relation between two statements if ambiguity regarding outcome exists based on premises provided within those statements is present
- Develop and fine-tune NLI algorithms with different levels of Sci-Fi language complexity.
- Use the annotator labels to develop an automated human-in-the-loop approach to NLI algorithms.
- Incorporate the hypothesis graph structure into existing models to improve accuracy and reduce error rates in identifying contextual comparisons between premises and hypotheses in Sci-Fi texts
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: dgem_format_test.csv | Column name | Description | |:-------------------------------|:-----------------------------------------------------------------------------------| | premise | The premise of the statement (String). | | hypothesis | The hypothesis of the statement (String). | | label | The label of the statement – either entailment, neutral or contradiction (String). | | hypothesis_graph_structure | A graph structure of the hypothesis (Graph) |
File: predictor_format_validation.csv | Column name | Description ...
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519
README Repository for publication: A. Shamooni et al., Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning, Proc. Combust. Inst. (2025) Containing torch_code The main Pytorch source code used for training/testing is provided in torch_code.tar.gz file. torch_code_tradGAN To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets. The source code is torch_code_tradGAN.tar.gz file. datasets The training/validation/testing datasets have been provided in lmdb format which is ready to use in the code. The datasets in datasets.tar.gz contain: Training dataset: data_train_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_20736_lmdb.lmdb Test dataset: data_valid_inSample_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_3456_lmdb.lmdb Note that the samples from 9 DNS cases are collected in order (each case 2304 samples for training and 384 samples for testing) which can be recognized using the provided metadata file in each folder. Out of distribution test datasets: Out of distribution test dataset (used in Fig 10 of the paper): data_valid_inSample_OF-mass_kinematics_mk3x_FHIT_particle_128_Re52-2D_nonUniform_1024_lmdb.lmdb | We have two separate OOD DNS cases and from each we select 512 samples. experiments The main trained models are provided in experiments.tar.gz file. Each experiment contains the log file of the training, the last training state (for restart) and the model wights used in the publication. Trained model using the main dataset (used in Figs 2-10 of the paper): h_oldOrder_mk_700-11-c_PFT_Inp4TrZk_outTrZ_RRDBNetCBAM-4Prt_DcondPrtWav_f128g64b16_BS16x4_LrG45D5_DS-mk012-20k_LStandLog To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets as above. The training consists of one pre-training step and two separate fine-tuning. One fine-tuning with the loss weights from the litreature and one fine-tuning with tuned loss weights. The final results are in experiments/trad_GAN/experiments/ Pre-trained traditional GAN model (used in Figs 8-9 of the paper): train_RRDB_SRx4_particle_PSNR Fine-tuned traditional GAN model with loss weights from lit. (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_Nista_oneBlock Fine-tuned traditional GAN model with optimized loss weights (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_oneBlock_betaA inference_notebooks The inference_notebooks folder contains example notebooks to do inference. The folder contains "torch_code_inference" and "torch_code_tradGAN_inference". The "torch_code_inference" is the inference of main trained model. The "torch_code_tradGAN_inference" is the inference for traditional GAN approach. Move the inference folders in each of these folders into the corresponding torch_code roots. Also create softlinks of datasets and experiments in the main torch_code roots. Note that in each notebook you must double check the required paths to make sure they are set correctly. How to Build the environment To build the environment required for the training and inference you need Anaconda. Go to the torch_code folder and conda env create -f environment.yml Then create ipython kernel for post processing, conda activate torch_22_2025_Shamooni_PCI python -m ipykernel install --user --name ipyk_torch_22_2025_Shamooni_PCI --display-name "ipython kernel for post processing of PCI2025" Perform training It is suggested to create softlinks to the dataset folder directly in the torch_code folder: cd torch_code ln -s datasets You can also simply move the datasets and inference forlders in the torch_code folder beside the cfd_sr folder and other files. In general, we prefer to have a root structure as below: root files and directories: cfd_sr datasets experiments inference options init.py test.py train.py version.py Then activate the conda environment conda activate torch_22_2025_Shamooni_PCI An example script to run on single node with 2 GPUs: torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py -opt options/train/condSRGAN/use_h_mk_700-011_PFT.yml --launcher pytorch Make sure that the paths to datasets "dataroot_gt" and "meta_info_file" for both training and validation data in option files are set correctly.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
In molecular phylogenetics, partition models and mixture models provide different approaches to accommodating heterogeneity in genomic sequencing data. Both types of models generally give a superior fit to data than models that assume the process of sequence evolution is homogeneous across sites and lineages. The Akaike Information Criterion (AIC), an estimator of Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) are popular tools to select models in phylogenetics. Recent work suggests AIC should not be used for comparing mixture and partition models. In this work, we clarify that this difficulty is not fully explained by AIC misestimating the Kullback-Leibler divergence. We also investigate the performance of the AIC and BIC by comparing amongst mixture models and amongst partition models. We find that under non-standard conditions (i.e. when some edges have a small expected number of changes), AIC underestimates the expected Kullback-Leibler divergence. Under such conditions, AIC preferred the complex mixture models and BIC preferred the simpler mixture models. The mixture models selected by AIC had a better performance in estimating the edge length, while the simpler models selected by BIC performed better in estimating the base frequencies and substitution rate parameters. In contrast, AIC and BIC both prefer simpler partition models over more complex partition models under non-standard conditions, despite the fact that the more complex partition model was the generating model. We also investigated how mispartitioning (i.e. grouping sites that have not evolved under the same process) affects both the performance of partition models compared to mixture models and the model selection process. We found that as the level of mispartitioning increases, the bias of AIC in estimating the expected Kullback-Leibler divergence remains the same, and the branch lengths and evolutionary parameters estimated by partition models become less accurate. We recommend that researchers be cautious when using AIC and BIC to select among partition and mixture models; other alternatives, such as cross-validation and bootstrapping should be explored, but may suffer similar limitations. Methods This document records the pipeline used in data analyses in ``Performance of Akaike Information Criterion and Bayesian Information Criterion in selecting partition models and mixture models''. The main processes included generating alignments, fitting four different partition and mixture models, and analysing results. The data were generated under Seq-Gen-1.3.4 (Rambaut and Grass 1997). The model fitting was performed IQ-TREE2 (Minh et al. 2020) on a Linux system. The results were analysed using the R package phangorn in R (version 3.6.2) (Schliep 2011, R Core Team 2019). We wrote custom bash scripts to extract relevant parts of the results from IQ-TREE2, and these results were processed in R. The zip files contain four folders: "bash-scripts", "data", "R-codes", and "results-IQTREE2". The bash-scripts folder contains all the bash scripts for simulating alignments and performing model fitting. The "data" folder contains two child folders: "sequence-data" and "Rdata". The child folder "sequence-data" contains the alignments created for the simulations. The other child folder, "Rdata", contains the files created by R to store the results extracted from "IQTREE2" and the results calculated in R. The "R-codes" folder includes the R codes for analysing the results from "IQTREE2". The folder "results-IQTREE2" stores all the results from the fitted models. The three simulations we performed were essentially the same. We used the same parameters of the evolutionary models, and the trees with the same topologies but different edge lengths to generate the sequences. The steps we used were: simulating alignments, model fitting and extracting results, and processing the extracted results. The first two steps were performed on a Linux system using bash scripts, and the last step was performed in R. Simulating Alignment To simulate heterogeneous data we created two multiple sequence alignments (MSAs) under simple homogeneous models with each model comprising a substitution model and an edge-weighted phylogenetic tree (the tree topology was fixed). Each MSA contained eight taxa and 1000 sites. This was performed using the bash script “step1_seqgen_data.sh” in Linux. These two MSAs were then concatenated together giving a MSA with 2000 sites. This was equivalent to generating the concatenated MSA under a two-block unlinked edge lengths partition model (P-UEL). This was performed using the bash script “step2_concat_data.sh”. This created the 0% group of MSAs. In order to simulate a situation where the initial choice of blocks does not properly account for the heterogeneity in the concatenated MSA (i.e., mispartitioning), we randomly selected a proportion of 0%, 5%, 10%, 15%, …, up to 50% of sites from each block and swapped them. That is, the sites drawn from the first block were placed in the second block, and the sites drawn from the second block were placed in the first block. This process was repeated 100 times for each proportion of mispartitioned sites giving a total of 1100 MSAs. This process involved two steps. The first step was to generate ten sets of different amounts of numbers without duplicates from each of the two intervals [1,1000] and [1001,2000]. The amounts of numbers were based on the proportions of incorrectly partitioning sites. For example, the first set has 50 numbers on each interval, and the second set has 100 numbers on each interval, etc. The first step was performed in R, and the R code was not provided but the random number text files were included. The second step was to select sites from the concatenated MSAs from the locations based on the numbers created in the first step. This created 5%, 10%, 15%, …, 50% groups of MSAs. The second step used the following bash scripts: “step3_1_mixmatch_pre_data.sh” and “step3_2_mixmatch_data.sh”. The MSAs used in the simulations were created and stored in the “data” folder. Model Fitting and Extracting Results The next steps were to fit four different partition and mixture models to the data in IQ-TREE2 and extract the results. The models used were P-LEL partition model, P-UEL partition model, M-UGP mixture model, and M-LGP mixture model. For the partition models, the partitioning schemes were the same: the first 1000 sites as a block and the second 1000 sites as another. For the groups of MSAs with different proportions of mispartitioned sites, this was equivalent to fitting the partition models with an incorrect partitioning scheme. The partitioning scheme was called “parscheme.nex”. The bash scripts for model fitting were stored in the “bash-scripts” folder. To run the bash scripts, users can follow the order which was shown in the names of these bash scripts. The inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values, and BIC values were extracted from the IQTREE2 results. These extracted results were stored in the “results-IQTREE2” folder and used to evaluate the performance of AIC, BIC, and models in R. Processing Extracted Results in R To evaluate the performance of AIC, BIC, and the performance of fitted partition models and mixture models, we calculated the following measures: the rEKL values, the bias of AIC in estimating the rEKL, BIC values, and the branch scores (bs). We also compared the distribution of the estimated model parameters (i.e. base frequencies and rate matrices) to the generating model parameters. These processes were performed in R. The first step was to read in the inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors, AIC values, and BIC values that were extracted from IQTREE2 results. These R scripts were stored in the “R-codes” folder, and the names of these scripts started with “readpara_...” (e.g. “readpara_MLGP_standard”). After reading in all the parameters for each model, we estimated the measures mentioned above using the corresponding R scripts that were also in the “R-codes” folder. The functions used in these R scripts were stored in the “R_functions_simulation”. It is worth noting that the directories need to be changed if users want to run these R scripts on their computers.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD has developed a cloud validation dataset from Collection 2 images throughout the history of Landsat. Two North American locations with high overlap between WRS-1 and WRS-2 were chosen. For each location, 20 images were selected at random from the Landsat archive, with at least one scene taken from each Landsat satellite between the years of 1972-2024. This provides a sampling of the 50-year history of Landsat data over the two chosen locations -- New Brunswick and Tuscon, AZ. It is intended that more locations will be added to this dataset in the future. For each scene, a manual cloud validation mask was created. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveA clinical prediction model for postoperative combined Acute kidney injury (AKI) in patients with Type A acute aortic dissection (TAAAD) and Type B acute aortic dissection (TBAAD) was constructed by using Machine Learning (ML).MethodsBaseline data was collected from Acute aortic division (AAD) patients admitted to First Affiliated Hospital of Xinjiang Medical University between January 1, 2019 and December 31, 2021. (1) We identified baseline Serum creatinine (SCR) estimation methods and used them as a basis for diagnosis of AKI. (2) Divide their total datasets randomly into Training set (70%) and Test set (30%), Bootstrap modeling and validation of features using multiple ML methods in the training set, and select models corresponding to the largest Area Under Curve (AUC) for follow-up studies. (3) Screening of the best ML model variables through the model visualization tools Shapley Addictive Explanations (SHAP) and Recursive feature reduction (REF). (4) Finally, the pre-screened prediction models were evaluated using test set data from three aspects: discrimination, Calibration, and clinical benefit.ResultsThe final incidence of AKI was 69.4% (120/173) in 173 patients with TAAAD and 28.6% (81/283) in 283 patients with TBAAD. For TAAAD-AKI, the Random Forest (RF) model showed the best prediction performance in the training set (AUC = 0.760, 95% CI:0.630–0.881); while for TBAAD-AKI, the Light Gradient Boosting Machine (LightGBM) model worked best (AUC = 0.734, 95% CI:0.623–0.847). Screening of the characteristic variables revealed that the common predictors among the two final prediction models for postoperative AKI due to AAD were baseline SCR, Blood urea nitrogen (BUN) and Uric acid (UA) at admission, Mechanical ventilation time (MVT). The specific predictors in the TAAAD-AKI model are: White blood cell (WBC), Platelet (PLT) and D dimer at admission, Plasma The specific predictors in the TBAAD-AKI model were N-terminal pro B-type natriuretic peptide (BNP), Serum kalium, Activated partial thromboplastin time (APTT) and Systolic blood pressure (SBP) at admission, Combined renal arteriography in surgery. Finally, we used in terms of Discrimination, the ROC value of the RF model for TAAAD was 0.81 and the ROC value of the LightGBM model for TBAAD was 0.74, both with good accuracy. In terms of calibration, the calibration curve of TAAAD-AKI's RF fits the ideal curve the best and has the lowest and smallest Brier score (0.16). Similarly, the calibration curve of TBAAD-AKI's LightGBM model fits the ideal curve the best and has the smallest Brier score (0.15). In terms of Clinical benefit, the best ML models for both types of AAD have good Net benefit as shown by Decision Curve Analysis (DCA).ConclusionWe successfully constructed and validated clinical prediction models for the occurrence of AKI after surgery in TAAAD and TBAAD patients using different ML algorithms. The main predictors of the two types of AAD-AKI are somewhat different, and the strategies for early prevention and control of AKI are also different and need more external data for validation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The French National Mapping Agency (Institut National de l'Information Géographique et Forestière - IGN) is responsible for producing and maintaining the spatial data sets for all of France. At the same time, they must satisfy the needs of different stakeholders who are responsible for decisions at multiple levels from local to national. IGN produces many different maps including detailed road networks and land cover/land use maps over time. The information contained in these maps is crucial for many of the decisions made about urban planning, resource management and landscape restoration as well as other environmental issues in France. Recently, IGN has started the process of creating a high-resolution land use land cover (LULC) maps, aimed at developing smart and accurate monitoring services of LULC over time. To help update and validate the French LULC database, citizens and interested stakeholders can contribute using the Paysages mobile and web applications. This approach presents an opportunity to evaluate the integration of citizens in the IGN process of updating and validating LULC data.
Dataset 1: Change detection validation 2019
This dataset contains web-based validations of changes detected by time series (2016 – 2019) analysis of Sentinel-2 satellite imagery. Validation was conducted using two high resolution orthophotos from respectively 2016 and 2019 as reference data. Two tools have been used: Paysages web application and LACO-Wiki. Both tools used the same validation design: blind validation and the same options. For each detected change, contributors are asked to validate if there is a change and if it is the case then to choose a LU or LC class from a pre-defined list of classes.
The dataset has the following characteristics:
Time period of the change detection: 2016-2019.
Time period of data collection: February 2019-December 2019
Total number of contributors: 105
Number of validated changes: 1048; each change was validated by between 1 to 6 contributors.
Region of interest: Toulouse and surrounding areas
Associated files: 1- Change validation locations.png, 1-Change validation 2019 – Attributes.csv, 1-Change validation 2019.csv, 1-Change validation 2019.geoJSON
This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France, and GeoVille.
Dataset 2: Land use classification 2019
The aim of this data collection campaign was to improve the LU classification of authoritative LULC data (OCS-GE 2016 ©IGN) for built-up area. Using the Paysages web platform, contributors are asked to choose a land use value among a list of pre-defined values for each location.
The dataset has the following characteristics:
Time period of data collection: August 2019
Types of contributors: Surveyors from the production department of IGN
Total number of contributors: 5
Total number of observations: 2711
Data specifications of the OCS-GE ©IGN
Region of interest: Toulouse and surrounding areas
Associated files: 2- LU classification points.png, 2-LU classification 2019 – Attributes.csv, 2-LU classification 2019.csv, 2-LU classification 2019.geoJSON
This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France and the International Institute for Applied Systems Analysis.
Dataset 3: In-situ validation 2018
The aim of this data collection campaign was to collect in-situ (ground-based) information, using the Paysages mobile application, to update authoritative LULC data. Contributors visit pre-determined locations, take photographs, of the point location and in the four cardinal directions away from the point and answer a few questions with respect with the task. Two tasks were defined:
Classify the point by choosing a LU class between three classes: industrial (US2), commercial (US3) or residential (US5).
Validate changes detected by the LandSense Change Detection Service: for each new detected change, the contributor was requested to validate the change and choose a LU and LC class from a pre-defined list of classes.
The dataset has the following characteristics
Time period of data collection: June 2018 – October 2018
Types of contributors: students from the School of Agricultural and Life Sciences and citizens
Total number of contributors: 26
Total number of observations: 281
Total number of photos: 421
Region of interest: Toulouse and surrounding areas
Associated files: 3- Insitu locations.png, 3- Insitu validation 2018 – Attributes.csv, 3- Insitu validation 2018.csv, 3- Insitu validation 2018.geoJSON
This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no 689812.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In silico prediction of the in vivo efficacy of siRNA ionizable-lipid nanoparticles is desirable as it can save time and resources dedicated to wet-lab experimentation. This study aims to computationally predict siRNA nanoparticles in vivo efficacy. A data set containing 120 entries was prepared by combining molecular descriptors of the ionizable lipids together with two nanoparticles formulation characteristics. Input descriptor combinations were selected by an evolutionary algorithm. Artificial neural networks, support vector machines and partial least squares regression were used for QSAR modeling. Depending on how the data set is split, two training sets and two external validation sets were prepared. Training and validation sets contained 90 and 30 entries respectively. The results showed the successful predictions of validation set log (siRNA dose) with Rval2= 0.86–0.89 and 0.75–80 for validation sets one and two, respectively. Artificial neural networks resulted in the best Rval2 for both validation sets. For predictions that have high bias, improvement of Rval2 from 0.47 to 0.96 was achieved by selecting the training set lipids lying within the applicability domain. In conclusion, in vivo performance of siRNA nanoparticles was successfully predicted by combining cheminformatics with machine learning techniques.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of applying optimized machine learning approach for multi-tasks classification.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use HelpSteer: An Open-Source AI Alignment Dataset
HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.
Step 1 - Choosing the Data File
Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.
## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”
## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality
- Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
- Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
- Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data set includes deforestation maps, located in the border between the west of Brazil and the north of Bolivia (corresponding to Sentinel-2's tile 20LKP). The source images for this dataset came from ESA's Sentinel-2A satellite. They were processed from top of the atmosphere to surface reflectance using the Sen2Cor 2.8 software and their clouds were masked using the algorithm Fmask 4.0. The K-Fold technique was used to select the best Random Forest (RF) model varying different combinations of Sentinel-2A bands and vegetation indices. The RF models were trained using the time series of 481 samples included in this data set. The two selected models that presented the highest median of F1 score for the Deforestation class were: 1) the combination of the blue, bnir, green, nnir, red, swir1, and swir2 bands (hereafter Bands); and 2) the combination of Enhanced Vegetation Index, Normalized Difference Moisture Index, and Normalized Difference Vegetation Index (hereafter Indices). Each RF model produced a deforestation map. During training, we used RF models of 1000 trees and the full depth of the Sentinel-2A time series, comprising 36 observations ranging from August 2018 to July 2019. To assess the map's accuracy, good practices were followed [1]. To determine the validation data set size (n), the user accuracy was conjectured using a bootstrapping technique. Two validation data sets (n=252) were collected independently to assess the maps' accuracy. For Deforestation, the Bands classification model has the highest values of the F1 score (93.1%) when compared with the Indices model (91.9%). The Forest and Other classes had better results of the F1 score using the Indices (85.8% and 82.2%, respectively) than using the Bands (85.3% and 78.7%, respectively). Our classifications have an overall accuracy of 88.9% for Bands and 84.9% for Indices, and the following user's and producer's accuracy for the models: Accuracy of classification using Bands: Deforestation: UA - 97.4% PA - 89.2% Forest: UA - 80.8% PA - 90.4% Other: UA - 80.2% PA - 77.3%
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Colletotrichum kahawae is an emergent fungal pathogen causing severe epidemics of Coffee Berry Disease on Arabica coffee crops in Africa. Currently, the molecular mechanisms underlying the Coffea arabica—C. kahawae interaction are still poorly understood, as well as the differences in pathogen aggressiveness, which makes the development of functional studies for this pathosystem a crucial step. Quantitative real time PCR (qPCR) has been one of the most promising approaches to perform gene expression analyses. However, proper data normalization with suitable reference genes is an absolute requirement. In this study, a set of 8 candidate reference genes were selected based on two different approaches (literature and Illumina RNA-seq datasets) to assess the best normalization factor for qPCR expression analysis of C. kahawae samples. The gene expression stability of candidate reference genes was evaluated for four isolates of C. kahawae bearing different aggressiveness patterns (Ang29, Ang67, Zim12 and Que2), at different stages of fungal development and key time points of the plant-fungus interaction process. Gene expression stability was assessed using the pairwise method incorporated in geNorm and the model-based method used by NormFinder software. For C. arabica—C. kahawae interaction samples, the best normalization factor included the combination of PP1, Act and ck34620 genes, while for C. kahawae samples the combination of PP1, Act and ck20430 revealed to be the most appropriate choice. These results suggest that RNA-seq analyses can provide alternative sources of reference genes in addition to classical reference genes. The analysis of expression profiles of bifunctional catalase-peroxidase (cat2) and trihydroxynaphthalene reductase (thr1) genes further enabled the validation of the selected reference genes. This study provides, for the first time, the tools required to conduct accurate qPCR studies in C. kahawae considering its aggressiveness pattern, developmental stage and host interaction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Covariate selection is a fundamental step when building sparse prediction models in order to avoid overfitting and to gain a better interpretation of the classifier without losing its predictive accuracy. In practice the LASSO regression of Tibshirani, which penalizes the likelihood of the model by the L1 norm of the regression coefficients, has become the gold-standard to reach these objectives. Recently Lee and Oh developed a novel random-effect covariate selection method called the modified unbounded penalty (MUB) regression, whose penalization function can equal minus infinity at 0 in order to produce very sparse models. We sought to compare the predictive accuracy and the number of covariates selected by these two methods in several high-dimensional datasets, consisting in genes expressions measured to predict response to chemotherapy in breast cancer patients. These comparisons were performed by building the Receiver Operating Characteristics (ROC) curves of the classifiers obtained with the selected genes and by comparing their area under the ROC curve (AUC) corrected for optimism using several variants of bootstrap internal validation and cross-validation. We found consistently in all datasets that the MUB penalization selected a remarkably smaller number of covariates than the LASSO while offering a similar—and encouraging—predictive accuracy. The models selected by the MUB were actually nested in the ones obtained with the LASSO. Similar findings were observed when comparing these results to those obtained in their first publication by other authors or when using the area under the Precision-Recall curve (AUCPR) as another measure of predictive performance. In conclusion, the MUB penalization seems therefore to be one of the best options when sparsity is required in high-dimension. Further investigation in other datasets is however required to validate these findings.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
- Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
- Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
- Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |
File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...