97 datasets found
  1. OpenAI Summarization Corpus

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenAI Summarization Corpus [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-summarization-corpus/code
    Explore at:
    zip(35399096 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenAI Summarization Corpus

    Training and Validation Data from TL;DR, CNN, and Daily Mail

    By Huggingface Hub [source]

    About this dataset

    This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

    To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
    - Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content

    Research Ideas

    • Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
    • Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
    • Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |

    File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...

  2. d

    Landsat 7 Collection 2 cloud truth mask validation set

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Landsat 7 Collection 2 cloud truth mask validation set [Dataset]. https://catalog.data.gov/dataset/landsat-7-collection-2-cloud-truth-mask-validation-set
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD developed a cloud validation dataset from 48 unique Landsat 7 Collection 2 images. These images were selected at random from the Landsat 7 SLC-On archive from various locations around the world. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from the original Landsat 7 Level-1 Collection 2 data product (COG GeoTIFF), and its associated Level-1 metadata (MTL.txt file). The methodology used to create these masks is the same as in previous USGS Landsat cloud truth masks (http://doi.org/10.5066/F7251GDH). Pixels are marked as Cloud if the pixel contains opaque and clearly identifiable clouds. Pixels are marked as Thin Cloud if they contain clouds that are transparent or if their classification as cloud is uncertain. Pixels that contain clouds with less than 50% opacity, or which do not contain clouds at all, are marked as Clear. In some masks the borders around clouds have been dilated to encompass the edges around irregular clouds.

  3. f

    Results on validation data.

    • figshare.com
    xls
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl Jidling; Daniel Gedon; Thomas B. Schön; Claudia Di Lorenzo Oliveira; Clareci Silva Cardoso; Ariela Mota Ferreira; Luana Giatti; Sandhi Maria Barreto; Ester C. Sabino; Antonio L. P. Ribeiro; Antônio H. Ribeiro (2023). Results on validation data. [Dataset]. http://doi.org/10.1371/journal.pntd.0011118.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 21, 2023
    Dataset provided by
    PLOS Neglected Tropical Diseases
    Authors
    Carl Jidling; Daniel Gedon; Thomas B. Schön; Claudia Di Lorenzo Oliveira; Clareci Silva Cardoso; Ariela Mota Ferreira; Luana Giatti; Sandhi Maria Barreto; Ester C. Sabino; Antonio L. P. Ribeiro; Antônio H. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metrics and 95% confidence intervals evaluated on the validation data set for two different classification thresholds: 0.60 (selected by maximising the F1 score) and 0.71 (corresponding to 90% specificity).

  4. S

    Training set of NE-GraphSAGE model

    • scidb.cn
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tian Xuecan (2025). Training set of NE-GraphSAGE model [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00487
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Tian Xuecan
    Description

    In order to thoroughly assess the generalization and transferability of the NE-GraphSAGE model across different domain datasets, this paper adopts a cross-domain validation strategy. Specifically, two domains with significant inherent differences are selected to construct the training and testing sets, respectively. After training the NE-GraphSAGE model on the non-review paper citation relationship network in Domain A to capture the unique patterns and associations of that domain, the model is then applied to the non-review paper citation relationship network in Domain B. Domain B exhibits evident differences from the training set in terms of data characteristics and research questions, thus creating a challenging testing environment. This strategy of separating the training and testing sets by domain can examine the adaptability and flexibility of the NE-GraphSAGE model when confronted with data from different domains.For this study, the training set is chosen from the intelligent transportation systems domain, while the testing set is selected from the 3D vision domain. Literature, including both review and non-review papers, from 2022 to 2024 in these two research domains is retrieved on the Web of Science platform. The search results yield a review literature collection comprising 473 papers, with 218 in the intelligent transportation systems domain and 255 in the 3D vision domain. The non-review literature collection contains 8311 papers, of which 3276 are in the intelligent transportation systems domain and 5035 are in the 3D vision domain. Based on the citation relationships among the non-review literature, a citation network is constructed and nodes are labeled. Additionally, indicators such as the number of citations, usage frequency, publication year, and the number of research fields covered are embedded as node attribute features. After processing, the training set consists of 1595 nodes and 1784 edges, while the testing set includes 1179 nodes and 908 edges.

  5. n

    Data from: Using convolutional neural networks to efficiently extract...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing (2022). Using convolutional neural networks to efficiently extract immense phenological data from community science images [Dataset]. http://doi.org/10.5061/dryad.mkkwh7123
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    Carnegie Museum of Natural History
    University of Pittsburgh
    Authors
    Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.

    Methods Creating a training and validation image set

    We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).

    One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).

    For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).

    Training a two-stage and four-stage CNN

    We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.

    Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).

    We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.

    Hardware information

    Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).

    Comparing CNN accuracy to human annotation accuracy

    We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students

  6. Learning Privacy from Visual Entities - Curated data sets and pre-computed...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
    [
    arxiv][code]

    Curated image privacy data sets

    In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

    Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

    Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

    List of datasets and their original source:

    Notes:

    • For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record
    • Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license
    • Owners of the photos on Flick could have removed the photos from the social media platform
    • Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

    Pre-computed visual entitities

    Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

    For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

    Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

    Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

    Enquiries, questions and comments

    If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.

  7. V

    Data from: New feature subset selection procedures for classification of...

    • data.virginia.gov
    • catalog.data.gov
    html
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). New feature subset selection procedures for classification of expression profiles [Dataset]. https://data.virginia.gov/dataset/new-feature-subset-selection-procedures-for-classification-of-expression-profiles
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier.

       Results
       We show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes.
    
    
       Conclusion
       When looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information.
    
  8. SciTail (Multiple-choice science exams)

    • kaggle.com
    zip
    Updated Nov 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). SciTail (Multiple-choice science exams) [Dataset]. https://www.kaggle.com/datasets/thedevastator/futuristic-natural-language-inference-with-the-s
    Explore at:
    zip(7959679 bytes)Available download formats
    Dataset updated
    Nov 29, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    SciTail (Multiple-choice science exams)

    27,026 Multiple-choice science exams and web sentences

    By Huggingface Hub [source]

    About this dataset

    The Scitail dataset is your gateway to unlocking powerful and advanced Sci-Fi Natural Language Inference (NLI) algorithms. With data sourced from popular books, movies, and TV shows in the genre, this dataset gives you the opportunity to develop and train NLI algorithms capable of understanding complex sci-fi conversations. Containing seven distinct formats including training sets for both predictor format and datagem format as well as testing sets in tsv format and SNLI format - all containing the same fields but in varied structures - this is an essential resource for any scientist looking to explore the realm of sci-fi NLI! Train your algorithm today with Scitail; unlock a future of supercharged Sci-Fi language processing!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide will explain how to use the Scitail dataset for Natural Language Inference (NLI). NLI is a machine learning task which involves making predictions about a statement’s labels, such as entailment, contradiction, or neutral. The Scitail dataset contains sci-fi samples sourced from various sources such as books, movies and TV shows that can be used to train and evaluate NLI algorithms.

    The Scitail dataset is split into seven different formats: Dataset Gem format for testing and training, Predictor format for validation and training, .TSV format for testing and validation. Each of these formats contain the same data fields in different forms; including premise, hypothesis, label (entailment/contradiction/neutral), label assigned by annotators etc.

    To get started using this dataset we recommend downloading the datasets in whichever format you prefer from Kaggle. All files are stored as csv’s with each row representing a single data point in the form of premise-hypothesis pairs with labels assigned by annotators which indicate whether two statements entail one another or not.

    Once you have downloaded your preferred datasets it’s time to prepare them for training or evaluation purposes; this includes formatting them correctly so they can be used properly by algorithms. To do so we suggest splitting your chosen file(s) into separate sets — training/validation — such that you have selected samples that are sufficiently representative of real-world language samples that demonstrate positive entailing relations as well examples where no entailing relation exists between two statements or uncertainty exists due to lack of evidence provided within a pair’s context i.e., neutral relation between two statements if ambiguity regarding outcome exists based on premises provided within those statements is present

    Research Ideas

    • Develop and fine-tune NLI algorithms with different levels of Sci-Fi language complexity.
    • Use the annotator labels to develop an automated human-in-the-loop approach to NLI algorithms.
    • Incorporate the hypothesis graph structure into existing models to improve accuracy and reduce error rates in identifying contextual comparisons between premises and hypotheses in Sci-Fi texts

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: dgem_format_test.csv | Column name | Description | |:-------------------------------|:-----------------------------------------------------------------------------------| | premise | The premise of the statement (String). | | hypothesis | The hypothesis of the statement (String). | | label | The label of the statement – either entailment, neutral or contradiction (String). | | hypothesis_graph_structure | A graph structure of the hypothesis (Graph) |

    File: predictor_format_validation.csv | Column name | Description ...

  9. D

    Replication Data for: Super-resolution reconstruction of scalar fields from...

    • darus.uni-stuttgart.de
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Shamooni (2025). Replication Data for: Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning [Dataset]. http://doi.org/10.18419/DARUS-5519
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    DaRUS
    Authors
    Ali Shamooni
    License

    https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-5519

    Dataset funded by
    China Scholarship Council (CSC)
    DFG
    Helmholtz Association of German Research Centers (HGF)
    Description

    README Repository for publication: A. Shamooni et al., Super-resolution reconstruction of scalar fields from the pyrolysis of pulverised biomass using deep learning, Proc. Combust. Inst. (2025) Containing torch_code The main Pytorch source code used for training/testing is provided in torch_code.tar.gz file. torch_code_tradGAN To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets. The source code is torch_code_tradGAN.tar.gz file. datasets The training/validation/testing datasets have been provided in lmdb format which is ready to use in the code. The datasets in datasets.tar.gz contain: Training dataset: data_train_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_20736_lmdb.lmdb Test dataset: data_valid_inSample_OF-mass_kinematics_mk0x_1x_2x_FHIT_particle_128_Re52-2D_3456_lmdb.lmdb Note that the samples from 9 DNS cases are collected in order (each case 2304 samples for training and 384 samples for testing) which can be recognized using the provided metadata file in each folder. Out of distribution test datasets: Out of distribution test dataset (used in Fig 10 of the paper): data_valid_inSample_OF-mass_kinematics_mk3x_FHIT_particle_128_Re52-2D_nonUniform_1024_lmdb.lmdb | We have two separate OOD DNS cases and from each we select 512 samples. experiments The main trained models are provided in experiments.tar.gz file. Each experiment contains the log file of the training, the last training state (for restart) and the model wights used in the publication. Trained model using the main dataset (used in Figs 2-10 of the paper): h_oldOrder_mk_700-11-c_PFT_Inp4TrZk_outTrZ_RRDBNetCBAM-4Prt_DcondPrtWav_f128g64b16_BS16x4_LrG45D5_DS-mk012-20k_LStandLog To compare with traditional GAN, we use the code in torch_code_tradGAN with similar particle-laden datasets as above. The training consists of one pre-training step and two separate fine-tuning. One fine-tuning with the loss weights from the litreature and one fine-tuning with tuned loss weights. The final results are in experiments/trad_GAN/experiments/ Pre-trained traditional GAN model (used in Figs 8-9 of the paper): train_RRDB_SRx4_particle_PSNR Fine-tuned traditional GAN model with loss weights from lit. (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_Nista_oneBlock Fine-tuned traditional GAN model with optimized loss weights (used in Figs 8-9 of the paper) train_ESRGAN_SRx4_particle_oneBlock_betaA inference_notebooks The inference_notebooks folder contains example notebooks to do inference. The folder contains "torch_code_inference" and "torch_code_tradGAN_inference". The "torch_code_inference" is the inference of main trained model. The "torch_code_tradGAN_inference" is the inference for traditional GAN approach. Move the inference folders in each of these folders into the corresponding torch_code roots. Also create softlinks of datasets and experiments in the main torch_code roots. Note that in each notebook you must double check the required paths to make sure they are set correctly. How to Build the environment To build the environment required for the training and inference you need Anaconda. Go to the torch_code folder and conda env create -f environment.yml Then create ipython kernel for post processing, conda activate torch_22_2025_Shamooni_PCI python -m ipykernel install --user --name ipyk_torch_22_2025_Shamooni_PCI --display-name "ipython kernel for post processing of PCI2025" Perform training It is suggested to create softlinks to the dataset folder directly in the torch_code folder: cd torch_code ln -s datasets You can also simply move the datasets and inference forlders in the torch_code folder beside the cfd_sr folder and other files. In general, we prefer to have a root structure as below: root files and directories: cfd_sr datasets experiments inference options init.py test.py train.py version.py Then activate the conda environment conda activate torch_22_2025_Shamooni_PCI An example script to run on single node with 2 GPUs: torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py -opt options/train/condSRGAN/use_h_mk_700-011_PFT.yml --launcher pytorch Make sure that the paths to datasets "dataroot_gt" and "meta_info_file" for both training and validation data in option files are set correctly.

  10. n

    Data from: Performance of akaike information criterion and bayesian...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Feb 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Liu; Michael Charleston; Shane Richards; Barbara Holland (2023). Performance of akaike information criterion and bayesian information criterion in selecting partition models and mixture models [Dataset]. http://doi.org/10.5061/dryad.1jwstqjwj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 26, 2023
    Dataset provided by
    University of Tasmania
    Authors
    Qin Liu; Michael Charleston; Shane Richards; Barbara Holland
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    In molecular phylogenetics, partition models and mixture models provide different approaches to accommodating heterogeneity in genomic sequencing data. Both types of models generally give a superior fit to data than models that assume the process of sequence evolution is homogeneous across sites and lineages. The Akaike Information Criterion (AIC), an estimator of Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) are popular tools to select models in phylogenetics. Recent work suggests AIC should not be used for comparing mixture and partition models. In this work, we clarify that this difficulty is not fully explained by AIC misestimating the Kullback-Leibler divergence. We also investigate the performance of the AIC and BIC by comparing amongst mixture models and amongst partition models. We find that under non-standard conditions (i.e. when some edges have a small expected number of changes), AIC underestimates the expected Kullback-Leibler divergence. Under such conditions, AIC preferred the complex mixture models and BIC preferred the simpler mixture models. The mixture models selected by AIC had a better performance in estimating the edge length, while the simpler models selected by BIC performed better in estimating the base frequencies and substitution rate parameters. In contrast, AIC and BIC both prefer simpler partition models over more complex partition models under non-standard conditions, despite the fact that the more complex partition model was the generating model. We also investigated how mispartitioning (i.e. grouping sites that have not evolved under the same process) affects both the performance of partition models compared to mixture models and the model selection process. We found that as the level of mispartitioning increases, the bias of AIC in estimating the expected Kullback-Leibler divergence remains the same, and the branch lengths and evolutionary parameters estimated by partition models become less accurate. We recommend that researchers be cautious when using AIC and BIC to select among partition and mixture models; other alternatives, such as cross-validation and bootstrapping should be explored, but may suffer similar limitations. Methods This document records the pipeline used in data analyses in ``Performance of Akaike Information Criterion and Bayesian Information Criterion in selecting partition models and mixture models''. The main processes included generating alignments, fitting four different partition and mixture models, and analysing results. The data were generated under Seq-Gen-1.3.4 (Rambaut and Grass 1997). The model fitting was performed IQ-TREE2 (Minh et al. 2020) on a Linux system. The results were analysed using the R package phangorn in R (version 3.6.2) (Schliep 2011, R Core Team 2019). We wrote custom bash scripts to extract relevant parts of the results from IQ-TREE2, and these results were processed in R. The zip files contain four folders: "bash-scripts", "data", "R-codes", and "results-IQTREE2". The bash-scripts folder contains all the bash scripts for simulating alignments and performing model fitting. The "data" folder contains two child folders: "sequence-data" and "Rdata". The child folder "sequence-data" contains the alignments created for the simulations. The other child folder, "Rdata", contains the files created by R to store the results extracted from "IQTREE2" and the results calculated in R. The "R-codes" folder includes the R codes for analysing the results from "IQTREE2". The folder "results-IQTREE2" stores all the results from the fitted models. The three simulations we performed were essentially the same. We used the same parameters of the evolutionary models, and the trees with the same topologies but different edge lengths to generate the sequences. The steps we used were: simulating alignments, model fitting and extracting results, and processing the extracted results. The first two steps were performed on a Linux system using bash scripts, and the last step was performed in R. Simulating Alignment To simulate heterogeneous data we created two multiple sequence alignments (MSAs) under simple homogeneous models with each model comprising a substitution model and an edge-weighted phylogenetic tree (the tree topology was fixed). Each MSA contained eight taxa and 1000 sites. This was performed using the bash script “step1_seqgen_data.sh” in Linux. These two MSAs were then concatenated together giving a MSA with 2000 sites. This was equivalent to generating the concatenated MSA under a two-block unlinked edge lengths partition model (P-UEL). This was performed using the bash script “step2_concat_data.sh”. This created the 0% group of MSAs. In order to simulate a situation where the initial choice of blocks does not properly account for the heterogeneity in the concatenated MSA (i.e., mispartitioning), we randomly selected a proportion of 0%, 5%, 10%, 15%, …, up to 50% of sites from each block and swapped them. That is, the sites drawn from the first block were placed in the second block, and the sites drawn from the second block were placed in the first block. This process was repeated 100 times for each proportion of mispartitioned sites giving a total of 1100 MSAs. This process involved two steps. The first step was to generate ten sets of different amounts of numbers without duplicates from each of the two intervals [1,1000] and [1001,2000]. The amounts of numbers were based on the proportions of incorrectly partitioning sites. For example, the first set has 50 numbers on each interval, and the second set has 100 numbers on each interval, etc. The first step was performed in R, and the R code was not provided but the random number text files were included. The second step was to select sites from the concatenated MSAs from the locations based on the numbers created in the first step. This created 5%, 10%, 15%, …, 50% groups of MSAs. The second step used the following bash scripts: “step3_1_mixmatch_pre_data.sh” and “step3_2_mixmatch_data.sh”. The MSAs used in the simulations were created and stored in the “data” folder. Model Fitting and Extracting Results The next steps were to fit four different partition and mixture models to the data in IQ-TREE2 and extract the results. The models used were P-LEL partition model, P-UEL partition model, M-UGP mixture model, and M-LGP mixture model. For the partition models, the partitioning schemes were the same: the first 1000 sites as a block and the second 1000 sites as another. For the groups of MSAs with different proportions of mispartitioned sites, this was equivalent to fitting the partition models with an incorrect partitioning scheme. The partitioning scheme was called “parscheme.nex”. The bash scripts for model fitting were stored in the “bash-scripts” folder. To run the bash scripts, users can follow the order which was shown in the names of these bash scripts. The inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values, and BIC values were extracted from the IQTREE2 results. These extracted results were stored in the “results-IQTREE2” folder and used to evaluate the performance of AIC, BIC, and models in R. Processing Extracted Results in R To evaluate the performance of AIC, BIC, and the performance of fitted partition models and mixture models, we calculated the following measures: the rEKL values, the bias of AIC in estimating the rEKL, BIC values, and the branch scores (bs). We also compared the distribution of the estimated model parameters (i.e. base frequencies and rate matrices) to the generating model parameters. These processes were performed in R. The first step was to read in the inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors, AIC values, and BIC values that were extracted from IQTREE2 results. These R scripts were stored in the “R-codes” folder, and the names of these scripts started with “readpara_...” (e.g. “readpara_MLGP_standard”). After reading in all the parameters for each model, we estimated the measures mentioned above using the corresponding R scripts that were also in the “R-codes” folder. The functions used in these R scripts were stored in the “R_functions_simulation”. It is worth noting that the directories need to be changed if users want to run these R scripts on their computers.

  11. U

    Landsat Collection 2 temporal cloud truth mask validation set

    • data.usgs.gov
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pat Scaramuzza, Landsat Collection 2 temporal cloud truth mask validation set [Dataset]. http://doi.org/10.5066/P138N3ZU
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Pat Scaramuzza
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Feb 1, 1974 - Jan 1, 2024
    Description

    The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD has developed a cloud validation dataset from Collection 2 images throughout the history of Landsat. Two North American locations with high overlap between WRS-1 and WRS-2 were chosen. For each location, 20 images were selected at random from the Landsat archive, with at least one scene taken from each Landsat satellite between the years of 1972-2024. This provides a sampling of the 50-year history of Landsat data over the two chosen locations -- New Brunswick and Tuscon, AZ. It is intended that more locations will be added to this dataset in the future. For each scene, a manual cloud validation mask was created. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from ...

  12. f

    Data_Sheet_5_Prediction model of acute kidney injury after different types...

    • figshare.com
    txt
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Xinsai; Wang Zhengye; Huang Xuan; Chu Xueqian; Peng Kai; Chen Sisi; Jiang Xuyan; Li Suhua (2023). Data_Sheet_5_Prediction model of acute kidney injury after different types of acute aortic dissection based on machine learning.CSV [Dataset]. http://doi.org/10.3389/fcvm.2022.984772.s005
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Li Xinsai; Wang Zhengye; Huang Xuan; Chu Xueqian; Peng Kai; Chen Sisi; Jiang Xuyan; Li Suhua
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveA clinical prediction model for postoperative combined Acute kidney injury (AKI) in patients with Type A acute aortic dissection (TAAAD) and Type B acute aortic dissection (TBAAD) was constructed by using Machine Learning (ML).MethodsBaseline data was collected from Acute aortic division (AAD) patients admitted to First Affiliated Hospital of Xinjiang Medical University between January 1, 2019 and December 31, 2021. (1) We identified baseline Serum creatinine (SCR) estimation methods and used them as a basis for diagnosis of AKI. (2) Divide their total datasets randomly into Training set (70%) and Test set (30%), Bootstrap modeling and validation of features using multiple ML methods in the training set, and select models corresponding to the largest Area Under Curve (AUC) for follow-up studies. (3) Screening of the best ML model variables through the model visualization tools Shapley Addictive Explanations (SHAP) and Recursive feature reduction (REF). (4) Finally, the pre-screened prediction models were evaluated using test set data from three aspects: discrimination, Calibration, and clinical benefit.ResultsThe final incidence of AKI was 69.4% (120/173) in 173 patients with TAAAD and 28.6% (81/283) in 283 patients with TBAAD. For TAAAD-AKI, the Random Forest (RF) model showed the best prediction performance in the training set (AUC = 0.760, 95% CI:0.630–0.881); while for TBAAD-AKI, the Light Gradient Boosting Machine (LightGBM) model worked best (AUC = 0.734, 95% CI:0.623–0.847). Screening of the characteristic variables revealed that the common predictors among the two final prediction models for postoperative AKI due to AAD were baseline SCR, Blood urea nitrogen (BUN) and Uric acid (UA) at admission, Mechanical ventilation time (MVT). The specific predictors in the TAAAD-AKI model are: White blood cell (WBC), Platelet (PLT) and D dimer at admission, Plasma The specific predictors in the TBAAD-AKI model were N-terminal pro B-type natriuretic peptide (BNP), Serum kalium, Activated partial thromboplastin time (APTT) and Systolic blood pressure (SBP) at admission, Combined renal arteriography in surgery. Finally, we used in terms of Discrimination, the ROC value of the RF model for TAAAD was 0.81 and the ROC value of the LightGBM model for TBAAD was 0.74, both with good accuracy. In terms of calibration, the calibration curve of TAAAD-AKI's RF fits the ideal curve the best and has the lowest and smallest Brier score (0.16). Similarly, the calibration curve of TBAAD-AKI's LightGBM model fits the ideal curve the best and has the smallest Brier score (0.15). In terms of Clinical benefit, the best ML models for both types of AAD have good Net benefit as shown by Decision Curve Analysis (DCA).ConclusionWe successfully constructed and validated clinical prediction models for the occurrence of AKI after surgery in TAAAD and TBAAD patients using different ML algorithms. The main predictors of the two types of AAD-AKI are somewhat different, and the strategies for early prevention and control of AKI are also different and need more external data for validation.

  13. Data from: Crowd and community sourcing to update authoritative LULC data in...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olteanu-Raimond, Ana-Maria; Van Damme, Marie-Dominique; Marcuzzi, Julie; Sturn, Tobias; Fraval, Ludovic; Gombert, Marie; Jolivet, Laurence; See, Linda; Royer, Timothé; Fauret, Simon (2024). Crowd and community sourcing to update authoritative LULC data in urban areas [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3691826
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    National mapping agency
    International Institute for Applied Systems Analysis
    Institut national de l'information géographique et forestière
    Authors
    Olteanu-Raimond, Ana-Maria; Van Damme, Marie-Dominique; Marcuzzi, Julie; Sturn, Tobias; Fraval, Ludovic; Gombert, Marie; Jolivet, Laurence; See, Linda; Royer, Timothé; Fauret, Simon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The French National Mapping Agency (Institut National de l'Information Géographique et Forestière - IGN) is responsible for producing and maintaining the spatial data sets for all of France. At the same time, they must satisfy the needs of different stakeholders who are responsible for decisions at multiple levels from local to national. IGN produces many different maps including detailed road networks and land cover/land use maps over time. The information contained in these maps is crucial for many of the decisions made about urban planning, resource management and landscape restoration as well as other environmental issues in France. Recently, IGN has started the process of creating a high-resolution land use land cover (LULC) maps, aimed at developing smart and accurate monitoring services of LULC over time. To help update and validate the French LULC database, citizens and interested stakeholders can contribute using the Paysages mobile and web applications. This approach presents an opportunity to evaluate the integration of citizens in the IGN process of updating and validating LULC data.

    Dataset 1: Change detection validation 2019

    This dataset contains web-based validations of changes detected by time series (2016 – 2019) analysis of Sentinel-2 satellite imagery. Validation was conducted using two high resolution orthophotos from respectively 2016 and 2019 as reference data. Two tools have been used: Paysages web application and LACO-Wiki. Both tools used the same validation design: blind validation and the same options. For each detected change, contributors are asked to validate if there is a change and if it is the case then to choose a LU or LC class from a pre-defined list of classes.

    The dataset has the following characteristics:

    Time period of the change detection: 2016-2019.

    Time period of data collection: February 2019-December 2019

    Total number of contributors: 105

    Number of validated changes: 1048; each change was validated by between 1 to 6 contributors.

    Region of interest: Toulouse and surrounding areas

    Associated files: 1- Change validation locations.png, 1-Change validation 2019 – Attributes.csv, 1-Change validation 2019.csv, 1-Change validation 2019.geoJSON

    This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France, and GeoVille.

    Dataset 2: Land use classification 2019

    The aim of this data collection campaign was to improve the LU classification of authoritative LULC data (OCS-GE 2016 ©IGN) for built-up area. Using the Paysages web platform, contributors are asked to choose a land use value among a list of pre-defined values for each location.

    The dataset has the following characteristics:

    Time period of data collection: August 2019

    Types of contributors: Surveyors from the production department of IGN

    Total number of contributors: 5

    Total number of observations: 2711

    Data specifications of the OCS-GE ©IGN

    Region of interest: Toulouse and surrounding areas

    Associated files: 2- LU classification points.png, 2-LU classification 2019 – Attributes.csv, 2-LU classification 2019.csv, 2-LU classification 2019.geoJSON

    This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France and the International Institute for Applied Systems Analysis.

    Dataset 3: In-situ validation 2018

    The aim of this data collection campaign was to collect in-situ (ground-based) information, using the Paysages mobile application, to update authoritative LULC data. Contributors visit pre-determined locations, take photographs, of the point location and in the four cardinal directions away from the point and answer a few questions with respect with the task. Two tasks were defined:

    Classify the point by choosing a LU class between three classes: industrial (US2), commercial (US3) or residential (US5).

    Validate changes detected by the LandSense Change Detection Service: for each new detected change, the contributor was requested to validate the change and choose a LU and LC class from a pre-defined list of classes.

    The dataset has the following characteristics

    Time period of data collection: June 2018 – October 2018

    Types of contributors: students from the School of Agricultural and Life Sciences and citizens

    Total number of contributors: 26

    Total number of observations: 281

    Total number of photos: 421

    Region of interest: Toulouse and surrounding areas

    Associated files: 3- Insitu locations.png, 3- Insitu validation 2018 – Attributes.csv, 3- Insitu validation 2018.csv, 3- Insitu validation 2018.geoJSON

    This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France.

    This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no 689812.

  14. f

    DataSheet2_In silico prediction of siRNA ionizable-lipid nanoparticles In...

    • frontiersin.figshare.com
    pdf
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelkader A. Metwally; Amira A. Nayel; Rania M. Hathout (2023). DataSheet2_In silico prediction of siRNA ionizable-lipid nanoparticles In vivo efficacy: Machine learning modeling based on formulation and molecular descriptors.pdf [Dataset]. http://doi.org/10.3389/fmolb.2022.1042720.s002
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Frontiers
    Authors
    Abdelkader A. Metwally; Amira A. Nayel; Rania M. Hathout
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In silico prediction of the in vivo efficacy of siRNA ionizable-lipid nanoparticles is desirable as it can save time and resources dedicated to wet-lab experimentation. This study aims to computationally predict siRNA nanoparticles in vivo efficacy. A data set containing 120 entries was prepared by combining molecular descriptors of the ionizable lipids together with two nanoparticles formulation characteristics. Input descriptor combinations were selected by an evolutionary algorithm. Artificial neural networks, support vector machines and partial least squares regression were used for QSAR modeling. Depending on how the data set is split, two training sets and two external validation sets were prepared. Training and validation sets contained 90 and 30 entries respectively. The results showed the successful predictions of validation set log (siRNA dose) with Rval2= 0.86–0.89 and 0.75–80 for validation sets one and two, respectively. Artificial neural networks resulted in the best Rval2 for both validation sets. For predictions that have high bias, improvement of Rval2 from 0.47 to 0.96 was achieved by selecting the training set lipids lying within the applicability domain. In conclusion, in vivo performance of siRNA nanoparticles was successfully predicted by combining cheminformatics with machine learning techniques.

  15. Results of applying optimized machine learning approach for multi-tasks...

    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmadreza Keihani; Amin Mohammad Mohammadi; Hengameh Marzbani; Shahriar Nafissi; Mohsen Reza Haidari; Amir Homayoun Jafari (2023). Results of applying optimized machine learning approach for multi-tasks classification. [Dataset]. http://doi.org/10.1371/journal.pone.0270757.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ahmadreza Keihani; Amin Mohammad Mohammadi; Hengameh Marzbani; Shahriar Nafissi; Mohsen Reza Haidari; Amir Homayoun Jafari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of applying optimized machine learning approach for multi-tasks classification.

  16. HelpSteer: AI Alignment Dataset

    • kaggle.com
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). HelpSteer: AI Alignment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/helpsteer-ai-alignment-dataset
    Explore at:
    zip(16614333 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    HelpSteer: AI Alignment Dataset

    Real-World Helpfulness Annotated for AI Alignment

    By Huggingface Hub [source]

    About this dataset

    HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use HelpSteer: An Open-Source AI Alignment Dataset

    HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.

    Step 1 - Choosing the Data File

    Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.

    ## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”

    ## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality

    Research Ideas

    • Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
    • Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
    • Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...

  17. Data from: WILLOW - Norther: data set for the full-scale validation of...

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Nov 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). WILLOW - Norther: data set for the full-scale validation of model-based virtual sensing methods for an operational offshore wind turbine [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-11093262?locale=lt
    Explore at:
    unknown(13778)Available download formats
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. General description This data set contains as-build design information, as well as full-scale vibration response measurements from an operational offshore wind-turbine. The turbine is part of the Norther wind farm which is located in the Belgian North Sea and includes a total of 44 Vestas V164 (8.4MW) wind turbines on monopile foundations, see Fig1_Norther_locaction.png. This data set is intended to verify and validate model-based virtual sensing algorithms, using data as well as modeling information from a real turbine. 1.1 Summary of the shared structural information The included information entails a detailed description of the geometric properties of the monopile and transition piece, distributed and lumped structural masses . All information shared in this record is conform the as-designed documentation. An example of the lumped masses considered in the model input files is presented in "Fig2_Sensor_Network.png" 1.2 Summary of the shared geotechnical information Monopiles are distinguished by the significant role of soil-structure interaction. Ground reaction is most typically included in the structural model as non-linear p-y curves. Different p-y curves are available for a certain number of soils in the standards applicable to offshore structures (API RP 2GEO, 2011, and ISO 19901-4:2016(E), 2016). The required soil properties to define p-y curves according to the API framework are given in the soil profile provided in a separate Excel. Rather than symbols, the name of the soil properties is generally used as column header (e.g., Undrained shear strength). Therefore, it is straightforward to identify each soil parameter. The only soil parameter that might lead to confusion is: "epsilon50 [-]" represents the vertical strain at half the maximum principal stress difference in a static undrained triaxial compression test on an undisturbed soil sample. It's worthy to note that estimates for the small shear strain stiffness, referred to as Gmax, are also included. Despite not being required as an input to define the API p-y curves, this parameter remains a key input for other soil reaction frameworks than the API (e.g., PISA). 1.3 Summary of the shared measurement data Two sets of measurement data have been curated for validation purposes; the first interval has been collected during parked conditions, whereas the second interval has been collected during rated operational conditions. Both records have a length of 2 hours, and are subdivided into 10-minute data sets. Furthermore 1Hz SCADA data has been made available for the selected intervals. All different data sources are time synchronized and have been subjected to several internal quality checks. The sensor network on NRT-WTG is illustrated in in Fig. 2, whereas a description of the sensor types is presented in Tab.1. The acceleration sensors are installed in the horizontal plane, and measure tangential (Y) and orthogonal (X) to the wall, where the positive Y direction is pointing clockwise and the positive X direction is pointing inwards. All strain sensors are installed vertically and are located on the inside of the wall. Data type Sensor type Fs (Hz) Level mLAT (m) Description Acceleration (g) Piezo-electric acc. sensor (ACC) 30 15, 69, 97 3 Bi-directional accelerometers at different levels. LAT 15 installed at 240 degree heading; LAT 69 and 97 at 60 degree. Strain (micro strain) Resistive strain gauge (SG) 30 14 6 SGs: equally spaced around the inner circumference of the can. Headings: 50, 110, 170, 230, 290, 350 degree. Strain (micro strain) Fiber-Bragg Grating strain gauge (FBG) 100 -17, -19 2 FBGs per level at 165 and 255 degree respectively. Table 1. Description of sensor types. The FBG strain time series have been synchronized with the SG time series using using a cross-correlation based approach. Therefore the SG data has been used to genereate refrence strain time series at the headings of the FBG sensors; the FBG data is subsequently synchronized with regard to this reference time series. No synchronization of the acceleration data was needed, since these are collected using the same data aquisition system as the SG data. The SG strain time series have been calibrated and temperature compensated, whereas this is not the case for the FBG strain time series. The latter have a yet to be determined calibration offset. In conjunction to the sensor channels presented in Tab. 1, 1 Hz SCADA data is provided. A summary of the provided SCADA parameters, all sampled at 1Hz, is presented in Tab 2. Parameter Unit Description Wind speed m/s Wind speed as recorded in the turbine SCADA Wind direction ° Wind direction relative to North (0°) as recorded in the turbine SCADA Yaw angle ° Yaw orientation of the nacelle relative to North (0°) as recorded in the turbine SCADA Pitch angle ° Rotor blade pitch as recorded in the turbine SCADA Rotor speed rpm Rotor speed in rotations per minute as recorded in the turbine SCADA Power kW Active power o
  18. t

    Data from: Deforestation maps using time series of Sentinel-2A images in...

    • service.tib.eu
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Deforestation maps using time series of Sentinel-2A images in Amazonia, between Brazil and Bolivia, in 2019 [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-921387
    Explore at:
    Dataset updated
    Nov 29, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    Brazil, Bolivia
    Description

    This data set includes deforestation maps, located in the border between the west of Brazil and the north of Bolivia (corresponding to Sentinel-2's tile 20LKP). The source images for this dataset came from ESA's Sentinel-2A satellite. They were processed from top of the atmosphere to surface reflectance using the Sen2Cor 2.8 software and their clouds were masked using the algorithm Fmask 4.0. The K-Fold technique was used to select the best Random Forest (RF) model varying different combinations of Sentinel-2A bands and vegetation indices. The RF models were trained using the time series of 481 samples included in this data set. The two selected models that presented the highest median of F1 score for the Deforestation class were: 1) the combination of the blue, bnir, green, nnir, red, swir1, and swir2 bands (hereafter Bands); and 2) the combination of Enhanced Vegetation Index, Normalized Difference Moisture Index, and Normalized Difference Vegetation Index (hereafter Indices). Each RF model produced a deforestation map. During training, we used RF models of 1000 trees and the full depth of the Sentinel-2A time series, comprising 36 observations ranging from August 2018 to July 2019. To assess the map's accuracy, good practices were followed [1]. To determine the validation data set size (n), the user accuracy was conjectured using a bootstrapping technique. Two validation data sets (n=252) were collected independently to assess the maps' accuracy. For Deforestation, the Bands classification model has the highest values of the F1 score (93.1%) when compared with the Indices model (91.9%). The Forest and Other classes had better results of the F1 score using the Indices (85.8% and 82.2%, respectively) than using the Bands (85.3% and 78.7%, respectively). Our classifications have an overall accuracy of 88.9% for Bands and 84.9% for Indices, and the following user's and producer's accuracy for the models: Accuracy of classification using Bands: Deforestation: UA - 97.4% PA - 89.2% Forest: UA - 80.8% PA - 90.4% Other: UA - 80.2% PA - 77.3%

  19. Comparative Validation of Conventional and RNA-Seq Data-Derived Reference...

    • plos.figshare.com
    tiff
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Vieira; Ana Cabral; Joana Fino; Helena G. Azinheira; Andreia Loureiro; Pedro Talhinhas; Ana Sofia Pires; Vitor Varzea; Pilar Moncada; Helena Oliveira; Maria do Céu Silva; Octávio S. Paulo; Dora Batista (2023). Comparative Validation of Conventional and RNA-Seq Data-Derived Reference Genes for qPCR Expression Studies of Colletotrichum kahawae [Dataset]. http://doi.org/10.1371/journal.pone.0150651
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ana Vieira; Ana Cabral; Joana Fino; Helena G. Azinheira; Andreia Loureiro; Pedro Talhinhas; Ana Sofia Pires; Vitor Varzea; Pilar Moncada; Helena Oliveira; Maria do Céu Silva; Octávio S. Paulo; Dora Batista
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Colletotrichum kahawae is an emergent fungal pathogen causing severe epidemics of Coffee Berry Disease on Arabica coffee crops in Africa. Currently, the molecular mechanisms underlying the Coffea arabica—C. kahawae interaction are still poorly understood, as well as the differences in pathogen aggressiveness, which makes the development of functional studies for this pathosystem a crucial step. Quantitative real time PCR (qPCR) has been one of the most promising approaches to perform gene expression analyses. However, proper data normalization with suitable reference genes is an absolute requirement. In this study, a set of 8 candidate reference genes were selected based on two different approaches (literature and Illumina RNA-seq datasets) to assess the best normalization factor for qPCR expression analysis of C. kahawae samples. The gene expression stability of candidate reference genes was evaluated for four isolates of C. kahawae bearing different aggressiveness patterns (Ang29, Ang67, Zim12 and Que2), at different stages of fungal development and key time points of the plant-fungus interaction process. Gene expression stability was assessed using the pairwise method incorporated in geNorm and the model-based method used by NormFinder software. For C. arabica—C. kahawae interaction samples, the best normalization factor included the combination of PP1, Act and ck34620 genes, while for C. kahawae samples the combination of PP1, Act and ck20430 revealed to be the most appropriate choice. These results suggest that RNA-seq analyses can provide alternative sources of reference genes in addition to classical reference genes. The analysis of expression profiles of bifunctional catalase-peroxidase (cat2) and trihydroxynaphthalene reductase (thr1) genes further enabled the validation of the selected reference genes. This study provides, for the first time, the tools required to conduct accurate qPCR studies in C. kahawae considering its aggressiveness pattern, developmental stage and host interaction.

  20. Comparison of the modified unbounded penalty and the LASSO to select...

    • plos.figshare.com
    pptx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier Collignon; Jeongseop Han; Hyungmi An; Seungyoung Oh; Youngjo Lee (2023). Comparison of the modified unbounded penalty and the LASSO to select predictive genes of response to chemotherapy in breast cancer [Dataset]. http://doi.org/10.1371/journal.pone.0204897
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Olivier Collignon; Jeongseop Han; Hyungmi An; Seungyoung Oh; Youngjo Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Covariate selection is a fundamental step when building sparse prediction models in order to avoid overfitting and to gain a better interpretation of the classifier without losing its predictive accuracy. In practice the LASSO regression of Tibshirani, which penalizes the likelihood of the model by the L1 norm of the regression coefficients, has become the gold-standard to reach these objectives. Recently Lee and Oh developed a novel random-effect covariate selection method called the modified unbounded penalty (MUB) regression, whose penalization function can equal minus infinity at 0 in order to produce very sparse models. We sought to compare the predictive accuracy and the number of covariates selected by these two methods in several high-dimensional datasets, consisting in genes expressions measured to predict response to chemotherapy in breast cancer patients. These comparisons were performed by building the Receiver Operating Characteristics (ROC) curves of the classifiers obtained with the selected genes and by comparing their area under the ROC curve (AUC) corrected for optimism using several variants of bootstrap internal validation and cross-validation. We found consistently in all datasets that the MUB penalization selected a remarkably smaller number of covariates than the LASSO while offering a similar—and encouraging—predictive accuracy. The models selected by the MUB were actually nested in the ones obtained with the LASSO. Similar findings were observed when comparing these results to those obtained in their first publication by other authors or when using the area under the Precision-Recall curve (AUCPR) as another measure of predictive performance. In conclusion, the MUB penalization seems therefore to be one of the best options when sparsity is required in high-dimension. Further investigation in other datasets is however required to validate these findings.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). OpenAI Summarization Corpus [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-summarization-corpus/code
Organization logo

OpenAI Summarization Corpus

Training and Validation Data from TL;DR, CNN, and Daily Mail

Explore at:
zip(35399096 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

OpenAI Summarization Corpus

Training and Validation Data from TL;DR, CNN, and Daily Mail

By Huggingface Hub [source]

About this dataset

This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content

Research Ideas

  • Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
  • Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
  • Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |

File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...

Search
Clear search
Close search
Google apps
Main menu