19 datasets found
  1. m

    pinterest_dataset

    • data.mendeley.com
    Updated Oct 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
    Explore at:
    Dataset updated
    Oct 27, 2017
    Authors
    Juan Carlos Gomez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

    This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

    There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

    If you have questions regarding the data, write to: jc dot gomez at ugto dot mx

  2. Synthetic nursing handover training and development data set - text files

    • data.csiro.au
    • researchdata.edu.au
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - text files [Dataset]. http://doi.org/10.4225/08/58d097ee92e95
    Explore at:
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Dataset funded by
    NICTAhttp://nicta.com.au/
    Description

    This is one of two collection records. Please see the link below for the other collection of associated audio files.

    Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

    This collection contains 3 sets of text documents.

    Data Set 1 for Training and Development

    The data set, released in June 2014, includes the following documents:

    Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

    An Independent Data Set 2

    The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

    The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

    An Independent Data Set 3

    For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

    Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

    See Suominen et al (2015) in the links below for a detailed description and examples.

  3. Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

  4. f

    DataSheet1_Comparative analysis of tissue-specific genes in maize based on...

    • frontiersin.figshare.com
    docx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang (2023). DataSheet1_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.docx [Dataset]. http://doi.org/10.3389/fgene.2023.1190887.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

  5. OGBN-Products (Processed for PyG)

    • kaggle.com
    Updated Feb 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-Products (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbn-products/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Redao da Taupl
    Description

    OGBN-Products

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-products

    Usage in Python

    import os.path as osp
    import pandas as pd
    import datatable as dt
    import torch
    import torch_geometric as pyg
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnProducts(PygNodePropPredDataset):
      def _init_(self, meta_csv = None):
        root, name, transform = '/kaggle/input', 'ogbn-products', None
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbn-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        if osp.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
        if self.is_hetero:
          train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
          for nodetype in train_idx_dict.keys():
            train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
            valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
            test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
            return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
        else:
          train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
          train_idx = torch.from_numpy(train_idx).to(torch.long)
          valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
          valid_idx = torch.from_numpy(valid_idx).to(torch.long)
          test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
          test_idx = torch.from_numpy(test_idx).to(torch.long)
          return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnProducts()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. The authors follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

    Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

    Dataset splitting: The authors consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, the authors sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.

    Note 1: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.

    Note 2: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.12,449,02961,859,140Sales rankMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] http://manikvarma.org/downloads/XC/XMLRepository.html [2] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 257–266, 2019. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    License: Amazon License

    By accessing the Amazon Customer Reviews Library ("Reviews Library"), you agree that the Reviews Library is an Amazon Service subject to the Amazon.com Conditions of Use (https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088) and you agree to be bound by them, with the following additional conditions: In addition to the license rights granted under the Conditions of Use, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable, revocable license to access and use the Reviews Library for purposes of academic research. You may not resell, republish, or make any commercial use of the Reviews Library or its contents, including use of the Reviews Library for commercial research, such as research related to a funding or consultancy contract, internship, or other relationship in which the results are provided for a fee or delivered to a for-profit organization. You may not (a) link or associate content in the Reviews Library with any personal information (including Amazon customer accounts), or (b) attempt to determine the identity of the author of any content in the Reviews Library. If you violate any of the foregoing conditions, your license to access and use the Reviews Library will automatically terminate without prejudice to any of the other rights or remedies Amazon may have.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  6. Data associated with "A collaborative filtering based approach to biomedical...

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jake Lever; Jake Lever (2020). Data associated with "A collaborative filtering based approach to biomedical knowledge discovery" [Dataset]. http://doi.org/10.5281/zenodo.1227313
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jake Lever; Jake Lever
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data set associated with the publication: "A collaborative filtering based approach to biomedical knowledge discovery" published in Bioinformatics.

    The data are sets of cooccurrences of biomedical terms extracted from published abstracts and full text articles. The cooccurrences are then represented in sparse matrix form. There are three different splits of this data denoted by the prefix number on the files.

    1. All - All cooccurrences combined in a single file

    2. Training/Validation - All cooccurrences in publications before 2010 in training, all novel cooccurrences in publication in 2010 go in validation

    3. Training+Validation/Test - All cooccurrences in publication upto and including 2010 in training+validation. All novel cooccurrences after 2010 in year by year increments and also all combined together

    Furthermore there are subset files which are used in some experiments to deal with the computational cost of evaluating the full set. The associated cuids.txt file containing a link between the row/column in the matrix with the UMLS Metathesaurus CUIDs. Hence the first row of cuids.txt matches up to the 0th row/column in the matrix. Note that the matrix is square and symmetric. This work was done with UMLS Metathesaurus 2016AB.

  7. d

    Randomized controlled oncology trials with tumor stage inclusion criteria

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Windisch; Daniel Zwahlen (2024). Randomized controlled oncology trials with tumor stage inclusion criteria [Dataset]. http://doi.org/10.5061/dryad.g4f4qrfzn
    Explore at:
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Paul Windisch; Daniel Zwahlen
    Description

    Background: Extracting inclusion and exclusion criteria in a structured, automated fashion remains a challenge to developing better search functionalities or automating systematic reviews of randomized controlled trials in oncology. The question “Did this trial enroll patients with localized disease, metastatic disease, or both?†could be used to narrow down the number of potentially relevant trials when conducting a search. Dataset collection: 600 randomized controlled trials from high-impact medical journals were classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease. The dataset was randomly split into a training/validation and a test set of 500 and 100 trials respectively. However, the sets could be merged to allow for different splits. Data properties: Each trial is a row in the csv file. For each trial there is a doi, a publication date, a title, an abstract, the abstract sections (introduction, methods, results, conclus..., Randomized controlled oncology trials from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2005 and 2023 were randomly sampled and annotated with the labels “LOCAL†, “METASTATIC†, both or none. Trials that allowed for the inclusion of patients with localized disease received the label “LOCAL†. Trials that allowed for the inclusion of patients with metastatic disease received the label “METASTATIC†. Trials that allowed for the inclusion of patients with either localized or metastatic disease received bot labels. Screening trials that enrolled patients without known cancer or trials of interventions to prevent cancer were assigned no label. Trials of tumor entities where the distinction between localized and metastatic disease is usually not made (e.g., hematologic malignancies) were skipped. Annotation was based on the title and abstract. If those were inconclusiv..., , # Randomized controlled oncology trials with tumor stage inclusion criteria

    https://doi.org/10.5061/dryad.g4f4qrfzn

    600 randomized controlled oncology trials from high-impact medical journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2005 and 2023**Â **were randomly sampled and classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease. The dataset was randomly split into a training/validation and a test set of 500 and 100 trials respectively. However, the sets could be merged to allow for different splits.

    Description of the data and file structure

    Each trial is a row in the csv file. For each trial there are the follwing columns:

    • doi:Â Digital Object Identifier of the trial
    • date: Publication data according to PubMed
    • title: Title of the trial according to PubMed
    • ab...
  8. m

    Raw data outputs 1-18

    • bridges.monash.edu
    • researchdata.edu.au
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie (2023). Raw data outputs 1-18 [Dataset]. http://doi.org/10.26180/21259491.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw data outputs 1-18 Raw data output 1. Differentially expressed genes in AML CSCs compared with GTCs as well as in TCGA AML cancer samples compared with normal ones. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 2. Commonly and uniquely differentially expressed genes in AML CSC/GTC microarray and TCGA bulk RNA-seq datasets. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 3. Common differentially expressed genes between training and test set samples the microarray dataset. This data was generated based on the results of AML microarray data analysis. Raw data output 4. Detailed information on the samples of the breast cancer microarray dataset (GSE52327) used in this study. Raw data output 5. Differentially expressed genes in breast CSCs compared with GTCs as well as in TCGA BRCA cancer samples compared with normal ones. Raw data output 6. Commonly and uniquely differentially expressed genes in breast cancer CSC/GTC microarray and TCGA BRCA bulk RNA-seq datasets. This data was generated based on the results of breast cancer microarray and TCGA BRCA data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 7. Differential and common co-expression and protein-protein interaction of genes between CSC and GTC samples. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 8. Differentially expressed genes between AML dormant and active CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 9. Uniquely expressed genes in dormant or active AML CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 10. Intersections between the targeting transcription factors of AML key CSC genes and differentially expressed genes between AML CSCs vs GTCs and between dormant and active AML CSCs or the uniquely expressed genes in either class of CSCs. Raw data output 11. Targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 12. CSC-specific targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 13. The protein-protein interactions between AML key CSC genes with themselves and their targeting transcription factors. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. Raw data output 14. The previously confirmed associations of genes having the highest targeting desirableness and CSC-specific targeting desirableness scores with AML or other cancers’ (stem) cells as well as hematopoietic stem cells. These data were generated based on a PubMed database-based literature mining. Raw data output 15. Drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 16. CSC-specific drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 17. Candidate drugs for experimental validation. These drugs were selected based on their respective (CSC-specific) drug scores. CSC is the abbreviation of cancer stem cell. Raw data output 18. Detailed information on the samples of the AML microarray dataset GSE30375 used in this study.

  9. Dataset for modeling spatial and temporal variation in natural background...

    • s.cnmilf.com
    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset for modeling spatial and temporal variation in natural background specific conductivity [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/dataset-for-modeling-spatial-and-temporal-variation-in-natural-background-specific-conduct
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).

  10. TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary...

    • zenodo.org
    • data.niaid.nih.gov
    bin, json
    Updated Mar 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Torres Aguilar; Sergio Torres Aguilar; Vincent Jolivet; Vincent Jolivet (2024). TRIDIS: HTR model for Multilingual Medieval and Early Modern Documentary Manuscripts (11th-16th) [Dataset]. http://doi.org/10.5281/zenodo.10800223
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Torres Aguilar; Sergio Torres Aguilar; Vincent Jolivet; Vincent Jolivet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TRIDIS (Tria Digita Scribunt) is a Handwriting Text Recognition model trained on semi-diplomatic transcriptions from medieval and Early Modern Manuscripts. It is suitable for work on documentary manuscripts, that is, manuscripts arising from legal, administrative, and memorial practices more commonly from the Late Middle Ages (13th century and onwards). It can also show good performance on documents from other domains, such as literature books, scholarly treatises and cartularies providing a versatile tool for historians and philologists in transforming and analyzing historical texts.

    A paper presenting the first version of the model is available here: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. Journal of Data Mining and Digital Humanities. 2023. https://hal.science/hal-03892163

    Transcriptions rules :

    Since the majority of the training documents come from diplomatic editions, the transcriptions were normalized to contemporary reading standards, and abbreviations were expanded with the aim of facilitating a more fluid reading of the document.

    The following rules were applied:

    • The abbreviations have been expanded, both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini). Likewise, those using conventional signs ( --> et ; --> pro) have been resolved.
    • The named entities (names of persons, places and institutions) have been capitalized. The beginning of a block of text as well as the original capitals used by the scribe are also capitalized.
    • The consonantal i and u characters have been transcribed as j and v in both French and Latin.
    • The punctuation marks used in the manuscript like: . or / or | have not been systematically transcribed as the transcription has been standardized with modern punctuation.
    • Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.

    Versions :

    Version 1 of the model was trained on charters and registers dataset from the Late Medieval period (12th-15th centuries). The training and evaluation involved 1855 pages, 120k lines of text, and almost 1M tokens, conducted using three freely available ground-truth corpora:

    Version 2 of the model has added new datasets from feudal books and legal proceedings (14th-16th centuries), incorporating an additional 115k lines and more than 1.2M tokens to the previous version using other corpora like:

    Accuracy

    TRIDIS was trained using a CNN+RNN+CTC architecture within the Kraken suite (https://kraken.re/). This final model operates in a multilingual environment (Latin, Old French, and Old Spanish) and is capable of recognizing several Latin script families (mostly Textualis and Cursiva) in documents produced circa 11th - 16th centuries. During evaluation, the model showed an accuracy of 93.1% on the validation set and a CER (Character Error Ratio) of about 0.11 to 0.15 on four external unseen datasets. Fine-tuning the model with 10 ground-truth pages can improve these results to a CER of between 0.06 to 0.10, respectively.

    Other formats

    The ground truth used for version 2 was also employed to train a Transformer HTR model that combines TrOCR as the encoder with a RoBERTa medieval model as the decoder. This model exhibits a slighly better performance in terms of CER metrics to the current TRIDIS version and shows an improved WER by about 25%. The model is available on the Hugging Face Hub: magistermilitum/tridis_HTR

  11. Data from: Predictive modeling for clinical features associated with...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Mar 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip Payne; Stephanie Morris; Aditi Gupta; Seunghwan Kim; Randi Foraker; David Gutmann (2022). Predictive modeling for clinical features associated with Neurofibromatosis Type 1 [Dataset]. http://doi.org/10.5061/dryad.nvx0k6drn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 10, 2022
    Dataset provided by
    Washington University in St. Louis
    Authors
    Philip Payne; Stephanie Morris; Aditi Gupta; Seunghwan Kim; Randi Foraker; David Gutmann
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: Perform a longitudinal analysis of clinical features associated with Neurofibromatosis Type 1 (NF1) based on demographic and clinical characteristics, and to apply a machine learning strategy to determine feasibility of developing exploratory predictive models of optic pathway glioma (OPG) and attention-deficit/hyperactivity disorder (ADHD) in a pediatric NF1 cohort.

    Methods: Using NF1 as a model system, we perform retrospective data analyses utilizing a manually-curated NF1 clinical registry and electronic health record (EHR) information, and develop machine-learning models. Data for 798 individuals were available, with 578 comprising the pediatric cohort used for analysis.

    Results: Males and females were evenly represented in the cohort. White children were more likely to develop OPG (OR: 2.11, 95%CI: 1.11-4.00, p=0.02) relative to their non-white peers. Median age at diagnosis of OPG was 6.5 years (1.7-17.0), irrespective of sex. Males were more likely than females to have a diagnosis of ADHD (OR: 1.90, 95%CI: 1.33-2.70, p<0.001), and earlier diagnosis in males relative to females was observed. The gradient boosting classification model predicted diagnosis of ADHD with an AUROC of 0.74, and predicted diagnosis of OPG with an AUROC of 0.82.

    Conclusions: Using readily available clinical and EHR data, we successfully recapitulated several important and clinically-relevant patterns in NF1 semiology specifically based on demographic and clinical characteristics. Naïve machine learning techniques can be potentially used to develop and validate predictive phenotype complexes applicable to risk stratification and disease management in NF1.

    Methods Patients and Data Description

    This study was performed using retrospective clinical data extracted from two sources within the Washington University Neurofibromatosis (NF) Center. First, data were extracted from an existing longitudinal clinical registry that was manually curated using clinical data obtained from patients followed in the Washington University NF Clinical Program at St. Louis Children’s Hospital. All individuals included in this database had a clinical diagnosis of NF1 based on current National Institutes of Health Consensus Development Conference diagnostic criteria,9 and had been assessed over multiple visits from 2002 to 2016 for the presence of clinical features associated with NF1. Data points in this registry included demographic information, such as age, race, and sex, in addition to NF1-related clinical features and associated conditions, such as café-au-lait macules, skinfold freckling, cutaneous neurofibromas, Lisch nodules, OPG, hypertension, ADHD, and cognitive impairment. These data were maintained in a semi-structured format containing textual and binary fields, capturing each individual’s data over multiple clinical visits. From these data, clinical features and phenotypes were extracted using data manipulation, imputation, and text mining techniques. Data obtained from this NF1 clinical registry were converted to data tables, which captured each patient visit and the presence/absence of specific clinical features at each visit. Clinical features which were once marked as present were assumed to be present for all future visits, and missing data were assumed absent for that specific visit. Categorical variables are reported as frequencies and proportions, and compared using odds ratios (ORs). Continuously distributed traits, adhering to both conventional normality assumptions and homogeneity of variances, are reported as mean and standard deviations, and compared using analysis of variance methods. Non-parametric equivalents were used for data with non-normative distributions.

    Clinical Feature Extraction from Clinical Registry and EHR

    The NF1 Clinical Registry comprised string-based clinical feature values, such as ADHD, OPG, and asthma. From these data, we extracted 27 unique clinical features in addition to longitudinal data on the development of NF1-related clinical features and associated diagnoses. For each clinical feature, age at initial presentation and/or diagnosis was computed, and median age of occurrence was calculated for each sex. The exact age of presentation and/or diagnosis could not be definitively ascertained for any feature that was present at a child’s initial clinic visit. As such, we computed the age of diagnosis only for those clinical features for which we have at least one visit documenting feature absence prior to the manifestation of that feature.

    Diagnosis codes from the EHR-derived data set were also extracted. Diagnosis codes were recorded as 15,890 unique ICD 9/10 codes. Given the large number of ICD 9/10 codes, a consistent, concept-level “roll up” of relevant codes to a single phenotype description was created by mapping the extracted ICD 9/10 values to phenome-wide association (PheWAS) codes called Phecodes, which have been demonstrated to better align with clinical disease compared to individual ICD codes.

    Machine Learning Analyses

    Using a combination of clinical features obtained from the NF1 Clinical Registry and EHR-derived data sets, we developed prediction models using a gradient boosting platform for identifying patients with specific NF1-related diagnoses to establish the usefulness of clinical history and documentation of clinical findings in predicting phenotypic variability of NF1. Initial analyses used a state-of-the-art classification algorithm, gradient boosting model, which uses a tree-based algorithm to produce a predictive model from an ensemble of weak predictive models. Gradient boosting model was selected, as it supports identifying importance of features used in the final prediction model. Subsequent analyses employed training each model for three different feature sets: (1) demographic features for all patients, including race, sex, and family history of NF1 [5 features]; (2) clinical features associated with NF1 [27 features] extracted from the NF1 Clinical Registry; and (3) diagnosis codes extracted from the EHR data, which were reduced to 50 Phecodes. Four-fold cross validation was then applied for the three models, and comparisons for the prediction accuracies of each model determined. Positive predictive value (PPV), F1 score and the area under the receiver operator characteristic (AUROC) curve were used as evaluation metrics. Scikit Learn, a machine learning library in Python, was employed to implement all analyses.

    Standard Protocol Approvals, Registrations, and Patient Consents

    The NF1 Clinical Registry is an existing longitudinal clinical registry that was manually curated using clinical data obtained from patients followed in the Washington University NF Clinical Program at St. Louis Children’s Hospital. All individuals included in this database have a clinical diagnosis of NF1 based on current National Institutes of Health criteria and have provided informed consent for participation in the clinical registry. All data collection, usage and analysis for this study were approved by the Institutional Review Board (IRB) at the Washington University School of Medicine.

  12. f

    Table1_Identifying oral disease variables associated with pneumonia...

    • frontiersin.figshare.com
    docx
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neel Shimpi; Ingrid Glurich; Aloksagar Panny; Harshad Hegde; Frank A. Scannapieco; Amit Acharya (2024). Table1_Identifying oral disease variables associated with pneumonia emergence by application of machine learning to integrated medical and dental big data to inform eHealth approaches.docx [Dataset]. http://doi.org/10.3389/fdmed.2022.1005140.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Frontiers
    Authors
    Neel Shimpi; Ingrid Glurich; Aloksagar Panny; Harshad Hegde; Frank A. Scannapieco; Amit Acharya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe objective of this study was to build models that define variables contributing to pneumonia risk by applying supervised Machine Learning (ML) to medical and oral disease data to define key risk variables contributing to pneumonia emergence for any pneumonia/pneumonia subtypes.MethodsRetrospective medical and dental data were retrieved from the Marshfield Clinic Health System's data warehouse and the integrated electronic medical-dental health records (iEHR). Retrieved data were preprocessed prior to conducting analyses and included matching of cases to controls by (a) race/ethnicity and (b) 1:1 Case: Control ratio. Variables with >30% missing data were excluded from analysis. Datasets were divided into four subsets: (1) All Pneumonia (all cases and controls); (2) community (CAP)/healthcare-associated (HCAP) pneumonias; (3) ventilator-associated (VAP)/hospital-acquired (HAP) pneumonias; and (4) aspiration pneumonia (AP). Performance of five algorithms was compared across the four subsets: Naïve Bayes, Logistic Regression, Support Vector Machine (SVM), Multi Layer Perceptron (MLP), and Random Forests. Feature (input variables) selection and 10-fold cross validation was performed on all the datasets. An evaluation set (10%) was extracted from the subsets for further validation. Model performance was evaluated in terms of total accuracy, sensitivity, specificity, F-measure, Mathews-correlation-coefficient, and area under receiver operating characteristic curve (AUC).ResultsIn total, 6,034 records (cases and controls) met eligibility for inclusion in the main dataset. After feature selection, the variables retained in the subsets were: All Pneumonia (n = 29 variables), CAP-HCAP (n = 26 variables), VAP-HAP (n = 40 variables), and AP (n = 37 variables). Variables retained (n = 22) were common across all four pneumonia subsets. Of these, the number of missing teeth, periodontal status, periodontal pocket depth more than 5 mm, and number of restored teeth contributed to all the subsets and were retained in the model. MLP outperformed other predictive models for All Pneumonia, CAP-HCAP, and AP subsets, while SVM outperformed other models in VAP-HAP subset.ConclusionThis study validates previously described associations between poor oral health and pneumonia. Benefits of an integrated medical-dental record and care delivery environment for modeling pneumonia risk are highlighted. Based on findings, risk score development could inform referrals and follow-up in integrated healthcare delivery environments and coordinated patient management.

  13. f

    Table_1_Cytokine TGFβ Gene Polymorphism in Asthma: TGF-Related SNP Analysis...

    • frontiersin.figshare.com
    docx
    Updated Jun 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michał Panek; Konrad Stawiski; Marcin Kaszkowiak; Piotr Kuna (2023). Table_1_Cytokine TGFβ Gene Polymorphism in Asthma: TGF-Related SNP Analysis Enhances the Prediction of Disease Diagnosis (A Case-Control Study With Multivariable Data-Mining Model Development).docx [Dataset]. http://doi.org/10.3389/fimmu.2022.746360.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Frontiers
    Authors
    Michał Panek; Konrad Stawiski; Marcin Kaszkowiak; Piotr Kuna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionTGF-β and its receptors play a crucial role in asthma pathogenesis and bronchial remodeling in the course of the disease. TGF-β1, TGF-β2, and TGF-β3 isoforms are responsible for chronic inflammation, bronchial hyperreactivity, myofibroblast activation, fibrosis, bronchial remodeling, and change the expression of approximately 1000 genes in asthma. TGF-β SNPs are associated with the elevated plasma level of TGF-β1, an increased level of total IgE, and an increased risk of remodeling of bronchi.MethodsThe analysis of selected TGF-β1, TGF-β2, TGF-β3-related single-nucleotide polymorphisms (SNP) was conducted on 652 DNA samples with an application of the MassARRAY® using the mass spectrometry (MALDI-TOF MS). Dataset was randomly split into training (80%) and validation sets (20%). For both asthma diagnosis and severity prediction, the C5.0 modelling with hyperparameter optimization was conducted on: clinical and SNP data (Clinical+TGF), only clinical (OnlyClinical) and minimum redundancy feature selection set (MRMR). Area under ROC (AUCROC) curves were compared using DeLong’s test.ResultsMinor allele carriers (MACs) in SNP rs2009112 [OR=1.85 (95%CI:1.11-3.1), p=0.016], rs2796821 [OR=1.72 (95%CI:1.1-2.69), p=0.017] and rs2796822 [OR=1.71 (95%CI:1.07-2.71), p=0.022] demonstrated an increased odds of severe asthma. Clinical+TGF model presented better diagnostic potential than OnlyClinical model in both training (p=0.0009) and validation (AUCROC=0.87 vs. 0.80,p=0.0052). At the same time, the MRMR model was not worse than the Clinical+TGF model (p=0.3607 on the training set, p=0.1590 on the validation set), while it was better in comparison with the Only Clinical model (p=0.0010 on the training set, p=0.0235 on validation set, AUCROC=0.85 vs. 0.87). On validation set Clinical+TGF model allowed for asthma diagnosis prediction with 88.4% sensitivity and 73.8% specificity.DiscussionDerived predictive models suggest the analysis of selected SNPs in TGF-β genes in combination with clinical factors could predict asthma diagnosis with high sensitivity and specificity, however, the benefit of SNP analysis in severity prediction was not shown.

  14. Soil and Landscape Grid National Soil Attribute Maps - Depth of Regolith (3"...

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Aug 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mike Grundy; Mark Thomas; Ross Searle; John Wilford; Searle, Ross (2024). Soil and Landscape Grid National Soil Attribute Maps - Depth of Regolith (3" resolution) - Release 2 [Dataset]. http://doi.org/10.4225/08/55C9472F05295
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Mike Grundy; Mark Thomas; Ross Searle; John Wilford; Searle, Ross
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1900 - Dec 31, 2013
    Area covered
    Description

    This is Version 2 of the Depth of Regolith product of the Soil and Landscape Grid of Australia (produced 2015-06-01).

    The Soil and Landscape Grid of Australia has produced a range of digital soil attribute products. The digital soil attribute maps are in raster format at a resolution of 3 arc sec (~90 x 90 m pixels).

    Attribute Definition: The regolith is the in situ and transported material overlying unweathered bedrock; Units: metres; Spatial prediction method: data mining using piecewise linear regression; Period (temporal coverage; approximately): 1900-2013; Spatial resolution: 3 arc seconds (approx 90m); Total number of gridded maps for this attribute:3; Number of pixels with coverage per layer: 2007M (49200 * 40800); Total size before compression: about 8GB; Total size after compression: about 4GB; Data license : Creative Commons Attribution 4.0 (CC BY); Variance explained (cross-validation): R^2 = 0.38; Target data standard: GlobalSoilMap specifications; Format: GeoTIFF. Lineage: The methodology consisted of the following steps: (i) drillhole data preparation, (ii) compilation and selection of the environmental covariate raster layers and (iii) model implementation and evaluation.

    Drillhole data preparation: Drillhole data was sourced from the National Groundwater Information System (NGIS) database. This spatial database holds nationally consistent information about bores that were drilled as part of the Bore Construction Licensing Framework (http://www.bom.gov.au/water/groundwater/ngis/). The database contains 357,834 bore locations with associated lithology, bore construction and hydrostratigraphy records. This information was loaded into a relational database to facilitate analysis.

    Regolith depth extraction: The first step was to recognise and extract the boundary between the regolith and bedrock within each drillhole record. This was done using a key word look-up table of bedrock or lithology related words from the record descriptions. 1,910 unique descriptors were discovered. Using this list of new standardised terms analysis of the drillholes was conducted, and the depth value associated with the word in the description that was unequivocally pointing to reaching fresh bedrock material was extracted from each record using a tool developed in C# code.

    The second step of regolith depth extraction involved removal of drillhole bedrock depth records deemed necessary because of the “noisiness” in depth records resulting from inconsistencies we found in drilling and description standards indentified in the legacy database.

    On completion of the filtering and removal of outliers the drillhole database used in the model comprised of 128,033 depth sites.

    Selection and preparation of environmental covariates The environmental correlations style of DSM applies environmental covariate datasets to predict target variables, here regolith depth. Strongly performing environmental covariates operate as proxies for the factors that control regolith formation including climate, relief, parent material organisms and time.

    Depth modelling was implemented using the PC-based R-statistical software (R Core Team, 2014), and relied on the R-Cubist package (Kuhn et al. 2013). To generate modelling uncertainty estimates, the following procedures were followed: (i) the random withholding of a subset comprising 20% of the whole depth record dataset for external validation; (ii) Bootstrap sampling 100 times of the remaining dataset to produce repeated model training datasets, each time. The Cubist model was then run repeated times to produce a unique rule set for each of these training sets. Repeated model runs using different training sets, a procedure referred to as bagging or bootstrap aggregating, is a machine learning ensemble procedure designed to improve the stability and accuracy of the model. The Cubist rule sets generated were then evaluated and applied spatially calculating a mean predicted value (i.e. the final map). The 5% and 95% confidence intervals were estimated for each grid cell (pixel) in the prediction dataset by combining the variance from the bootstrapping process and the variance of the model residuals. Version 2 differs from version 1, in that the modelling of depths was performed on the log scale to better conform to assumptions of normality used in calculating the confidence intervals. The method to estimate the confidence intervals was improved to better represent the full range of variability in the modelling process. (Wilford et al, in press)

  15. f

    Dataset 2 accuracy.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neva J. Bull; Bridget Honan; Neil J. Spratt; Simon Quilty (2023). Dataset 2 accuracy. [Dataset]. http://doi.org/10.1371/journal.pone.0284965.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Neva J. Bull; Bridget Honan; Neil J. Spratt; Simon Quilty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classifying free-text from historical databases into research-compatible formats is a barrier for clinicians undertaking audit and research projects. The aim of this study was to (a) develop interactive active machine-learning model training methodology using readily available software that was (b) easily adaptable to a wide range of natural language databases and allowed customised researcher-defined categories, and then (c) evaluate the accuracy and speed of this model for classifying free text from two unique and unrelated clinical notes into coded data. A user interface for medical experts to train and evaluate the algorithm was created. Data requiring coding in the form of two independent databases of free-text clinical notes, each of unique natural language structure. Medical experts defined categories relevant to research projects and performed ‘label-train-evaluate’ loops on the training data set. A separate dataset was used for validation, with the medical experts blinded to the label given by the algorithm. The first dataset was 32,034 death certificate records from Northern Territory Births Deaths and Marriages, which were coded into 3 categories: haemorrhagic stroke, ischaemic stroke or no stroke. The second dataset was 12,039 recorded episodes of aeromedical retrieval from two prehospital and retrieval services in Northern Territory, Australia, which were coded into 5 categories: medical, surgical, trauma, obstetric or psychiatric. For the first dataset, macro-accuracy of the algorithm was 94.7%. For the second dataset, macro-accuracy was 92.4%. The time taken to develop and train the algorithm was 124 minutes for the death certificate coding, and 144 minutes for the aeromedical retrieval coding. This machine-learning training method was able to classify free-text clinical notes quickly and accurately from two different health datasets into categories of relevance to clinicians undertaking health service research.

  16. f

    Data from: Evidence-Based Prediction of Cellular Toxicity for Amorphous...

    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin; Reiko Watanabe; Kosuke Hashimoto; Kazuma Higashisaka; Yuya Haga; Yasuo Tsutsumi; Kenji Mizuguchi (2023). Evidence-Based Prediction of Cellular Toxicity for Amorphous Silica Nanoparticles [Dataset]. http://doi.org/10.1021/acsnano.2c11968.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Martin; Reiko Watanabe; Kosuke Hashimoto; Kazuma Higashisaka; Yuya Haga; Yasuo Tsutsumi; Kenji Mizuguchi
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Developing a generalized model for a robust prediction of nanotoxicity is critical for designing safe nanoparticles. However, complex toxicity mechanisms of nanoparticles in biological environments, such as biomolecular corona formation, prevent a reliable nanotoxicity prediction. This is exacerbated by the potential evaluation bias caused by internal validation, which is not fully appreciated. Herein, we propose an evidence-based prediction method for distinguishing between cytotoxic and noncytotoxic nanoparticles at a given condition by uniting literature data mining and machine learning. We illustrate the proposed method for amorphous silica nanoparticles (SiO2-NPs). SiO2-NPs are currently considered a safety concern; however, they are still widely produced and used in various consumer products. We generated the most diverse attributes of SiO2-NP cellular toxicity to date, using >100 publications, and built predictive models, with algorithms ranging from linear to nonlinear (deep neural network, kernel, and tree-based) classifiers. These models were validated using internal (4124-sample) and external (905-sample) data sets. The resultant categorical boosting (CatBoost) model outperformed other algorithms. We then identified 13 key attributes, including concentration, serum, cell, size, time, surface, and assay type, which can explain SiO2-NP toxicity, using the Shapley Additive exPlanation values in the CatBoost model. The serum attribute underscores the importance of nanoparticle–corona complexes for nanotoxicity prediction. We further show that internal validation does not guarantee generalizability. In general, safe SiO2-NPs can be obtained by modifying their surfaces and using low concentrations. Our work provides a strategy for predicting and explaining the toxicity of any type of engineered nanoparticles in real-world practice.

  17. f

    Table3_Comparative analysis of tissue-specific genes in maize based on...

    • figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang (2023). Table3_Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest.xlsx [Dataset]. http://doi.org/10.3389/fgene.2023.1190887.s006
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Zijie Wang; Yuzhi Zhu; Zhule Liu; Hongfu Li; Xinqiang Tang; Yi Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants.Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes.Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant.Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.

  18. Exact precision, recall, and F1 score of the BiLSTM-CRF and BlueBERT on each...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shang Gao; Olivera Kotevska; Alexandre Sorokine; J. Blair Christian (2023). Exact precision, recall, and F1 score of the BiLSTM-CRF and BlueBERT on each of our target datasets when fine-tuning on different amounts of labeled sentences, with and without semi-supervised self-training. [Dataset]. http://doi.org/10.1371/journal.pone.0246310.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shang Gao; Olivera Kotevska; Alexandre Sorokine; J. Blair Christian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A fully supervised version is included for comparison. For all sets of training data, 80% of the available data is used for training and 20% of the available data is used for validation.

  19. Data from: Investigating the contributors to hit-and-run crashes using...

    • figshare.com
    xlsx
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gen Li (2024). Investigating the contributors to hit-and-run crashes using gradient boosting decision trees [Dataset]. http://doi.org/10.6084/m9.figshare.27178305.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    figshare
    Authors
    Gen Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper uses the 2021 traffic crash data from the NHTSA CRSS as a sample for model training and validation. The CRSS data collects crash report data provided by police departments from all 50 states in the United States. It details various factors of each traffic crash, including crash information, driver information, vehicle information, road information, and environmental information.The crash accident data provided by CRSS include crash-related details such as the location, time, cause, type of crash, driver’s age, gender, attention level, injury status, risky driving behavior, vehicle type, usage, damage, and hit-and-run situations. However, due to the separate recording of the dataset and the presence of systematic errors and redundant information, the CRSS 2021 data undergo the following merging and filtering processes:1) Match and merge separately recorded data based on the unique case number "CASENUM" in the dataset.2) Records with missing values in critical variables (e.g., whether the crash involved a hit-and-run) were removed to avoid bias in the analysis. For non-critical variables, missing values were imputed using the mean or mode depending on the variable type. For continuous variables, such as speed limits, we used mean imputation. For categorical variables (e.g., weather, road surface conditions), mode imputation was applied.3) Noise in the dataset arises from both human error in crash reporting and random fluctuations in recorded variables. We used z-scores to detect and remove extreme outliers in numerical variables (e.g., speed limits, crash angle). Data points with a z-score beyond ±3 standard deviations were considered outliers and were excluded from the analysis. To handle noisy fluctuations in continuous variables (e.g., speed limits), we applied a symmetrical exponential moving average (EMA) filter.After processing, the CRSS 2021 data include a total of 54,187 crash accidents, among which there are 5,944 hit-and-run accidents, accounting for 10.97% of crash accidents. The hit-and-run and non-hit-and-run categories face a serious class imbalance issue, and data balancing processing is applied to the target variable during parameter calibration. Hit-and-run crashes constitute a relatively small proportion of total crashes in the dataset, leading to class imbalance in the binary classification target. To address this issue, we utilized the resampling techniques available in the data mining software. Specifically, random undersampling was applied to the majority class (non-hit-and-run crashes), while Synthetic Minority Over-sampling Technique (SMOTE) was used for the minority class. This ensured balanced class distribution in the training set, improving model performance and preventing the classifier from being biased toward the majority class.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2

pinterest_dataset

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 27, 2017
Authors
Juan Carlos Gomez
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

If you have questions regarding the data, write to: jc dot gomez at ugto dot mx

Search
Clear search
Close search
Google apps
Main menu