16 datasets found
  1. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. f

    Long Covid Risk

    • figshare.com
    txt
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Shaheen (2024). Long Covid Risk [Dataset]. http://doi.org/10.6084/m9.figshare.25599591.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 13, 2024
    Dataset provided by
    figshare
    Authors
    Ahmed Shaheen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature preparation Preprocessing was applied to the data, such as creating dummy variables and performing transformations (centering, scaling, YeoJohnson) using the preProcess() function from the “caret” package in R. The correlation among the variables was examined and no serious multicollinearity problems were found. A stepwise variable selection was performed using a logistic regression model. The final set of variables included: Demographic: age, body mass index, sex, ethnicity, smoking History of disease: heart disease, migraine, insomnia, gastrointestinal disease, COVID-19 history: covid vaccination, rashes, conjunctivitis, shortness of breath, chest pain, cough, runny nose, dysgeusia, muscle and joint pain, fatigue, fever ,COVID-19 reinfection, and ICU admission. These variables were used to train and test various machine-learning models Model selection and training The data was randomly split into 80% training and 20% testing subsets. The “h2o” package in R version 4.3.1 was employed to implement different algorithms. AutoML was first used, which automatically explored a range of models with different configurations. Gradient Boosting Machines (GBM), Random Forest (RF), and Regularized Generalized Linear Model (GLM) were identified as the best-performing models on our data and their parameters were fine-tuned. An ensemble method that stacked different models together was also used, as it could sometimes improve the accuracy. The models were evaluated using the area under the curve (AUC) and C-statistics as diagnostic measures. The model with the highest AUC was selected for further analysis using the confusion matrix, accuracy, sensitivity, specificity, and F1 and F2 scores. The optimal prediction threshold was determined by plotting the sensitivity, specificity, and accuracy and choosing the point of intersection as it balanced the trade-off between the three metrics. The model’s predictions were also plotted, and the quantile ranges were used to classify the model’s prediction as follows: > 1st quantile, > 2nd quantile, > 3rd quartile and < 3rd quartile (very low, low, moderate, high) respectively. Metric Formula C-statistics (TPR + TNR - 1) / 2 Sensitivity/Recall TP / (TP + FN) Specificity TN / (TN + FP) Accuracy (TP + TN) / (TP + TN + FP + FN) F1 score 2 * (precision * recall) / (precision + recall) Model interpretation We used the variable importance plot, which is a measure of how much each variable contributes to the prediction power of a machine learning model. In H2O package, variable importance for GBM and RF is calculated by measuring the decrease in the model's error when a variable is split on. The more a variable's split decreases the error, the more important that variable is considered to be. The error is calculated using the following formula: 𝑆𝐸=𝑀𝑆𝐸∗𝑁=𝑉𝐴𝑅∗𝑁 and then it is scaled between 0 and 1 and plotted. Also, we used The SHAP summary plot which is a graphical tool to visualize the impact of input features on the prediction of a machine learning model. SHAP stands for SHapley Additive exPlanations, a method to calculate the contribution of each feature to the prediction by averaging over all possible subsets of features [28]. SHAP summary plot shows the distribution of the SHAP values for each feature across the data instances. We use the h2o.shap_summary_plot() function in R to generate the SHAP summary plot for our GBM model. We pass the model object and the test data as arguments, and optionally specify the columns (features) we want to include in the plot. The plot shows the SHAP values for each feature on the x-axis, and the features on the y-axis. The color indicates whether the feature value is low (blue) or high (red). The plot also shows the distribution of the feature values as a density plot on the right.

  3. S083

    • zenodo.org
    tar
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Zurowietz; Martin Zurowietz (2020). S083 [Dataset]. http://doi.org/10.5281/zenodo.3600132
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Zurowietz; Martin Zurowietz
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A fully annotated subset of the SO242/1_83-1_AUV10 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

    • anemone
    • coral
    • crustacean
    • ipnops fish
    • litter
    • ophiuroid
    • other fauna
    • sea cucumber
    • sponge
    • stalked crinoid

    For a definition of the classes see [1].

    Related datasets:

    This dataset contains the following files:

    • annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.
    • annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.
    • images/: Directory that contains all the original image files.
    • dataset.json: JSON file that contains information about the dataset.
      • name: The name of the dataset.
      • images_dir: Name of the directory that contains the original image files.
      • metadata_file: Path to the CSV file that contains image metadata.
      • test_annotations_file: Path to the CSV file that contains the test annotations.
      • train_annotations_file: Path to the CSV file that contains the train annotations.
      • annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.
      • crop_dimension: Edge length of an annotation or style patch in pixels.
    • metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
  4. S171

    • zenodo.org
    tar
    Updated Oct 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Zurowietz; Martin Zurowietz (2020). S171 [Dataset]. http://doi.org/10.5281/zenodo.3603809
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Zurowietz; Martin Zurowietz
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A fully annotated subset of the SO242/2_171-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

    • anemone
    • coral
    • crustacean
    • ipnops fish
    • litter
    • ophiuroid
    • other fauna
    • sea cucumber
    • sponge
    • stalked crinoid

    For a definition of the classes see [1].

    Related datasets:

    This dataset contains the following files:

    • annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.
    • annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.
    • images/: Directory that contains all the original image files.
    • dataset.json: JSON file that contains information about the dataset.
      • name: The name of the dataset.
      • images_dir: Name of the directory that contains the original image files.
      • metadata_file: Path to the CSV file that contains image metadata.
      • test_annotations_file: Path to the CSV file that contains the test annotations.
      • train_annotations_file: Path to the CSV file that contains the train annotations.
      • annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.
      • crop_dimension: Edge length of an annotation or style patch in pixels.
    • metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
  5. h

    libritts-r-mimi

    • huggingface.co
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Keisling (2024). libritts-r-mimi [Dataset]. https://huggingface.co/datasets/jkeisling/libritts-r-mimi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Authors
    Jacob Keisling
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LibriTTS-R Mimi encoding

    This dataset converts all audio in the dev.clean, test.clean, train.100 and train.360 splits of the LibriTTS-R dataset from waveforms to tokens in Kyutai's Mimi neural codec. These tokens are intended as targets for DualAR audio models, but also allow you to simply download all audio in ~50-100x less space, if you're comfortable decoding later on with rustymimi or Transformers. This does NOT contain the original audio, please use the regular LibriTTS-R for… See the full description on the dataset page: https://huggingface.co/datasets/jkeisling/libritts-r-mimi.

  6. P

    RoBo6 Dataset

    • paperswithcode.com
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Kyselica; Marek Šuppa; Jiří Šilha; Roman Ďurikovič (2024). RoBo6 Dataset [Dataset]. https://paperswithcode.com/dataset/robo6
    Explore at:
    Dataset updated
    Nov 29, 2024
    Authors
    Daniel Kyselica; Marek Šuppa; Jiří Šilha; Roman Ďurikovič
    Description

    Dataset contains light curves of 6 rocket body types from Mini Mega Tortora database (MMT)1. The dataset was created to be used as a benchmark for rocket body light curve classification. For more informations follow the original paper: RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification2

    Class labels: - ARIANE 5 R/B - ATLAS 5 CENTAUR R/B - CZ-3B R/B - DELTA 4 R/B - FALCON 9 R/B - H-2A R/B

    Dataset description Usage ```python

    from datasets import load_dataset

    dataset = load_dataset("kyselica/RoBo6", data_files={"train": "train.csv", "test": "test.csv"}) dataset DatasetDict({ train: Dataset({ features: ['label', ' id', ' part', ' period', ' mag', ' phase', ' time'], num_rows: 5676 }) test: Dataset({ features: ['label', ' id', ' part', ' period', ' mag', ' phase', ' time'], num_rows: 1404 }) }) ```

    label - class name id - unique identifier of the light curve from MMT part - part number of the light curve period - rotational period of the object mag - relative path to the magnitude values file phase - relative path to the phase values file time - relative path to the time values file

    Mean and standard deviation of magnitudes are stored in mean_std.csv file.

    File structure

    data directory contains 5 subdirectories, one for each class. Light curves are stored in file triplets in the following format:

    where

    MMT Rocket Bodies ├── README.md ├── train.csv ├── test.csv ├── mean_std.csv ├── data │ ├── ARIANE 5 R_B │ │ ├──

    Data preprocessing To create data sutable for both CNN and RNN based models, the light curves were preprocessed in the following way:

    Split the light curves if the gap between two consecutive measurements is larger than object's rotational period. Split the light curves to have maximum span 1_000 seconds. Filter out light curves which folded form divided into 100 bins has more than 25% of bins empty. Resample the light curves to 10_000 points with step 0.1 seconds. Filter out light curves with less than 100 measurements.

    Citation @article{kyselica2024robo6, title={RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification}, author={Kyselica, Daniel and {\v{S}}uppa, Marek and {\v{S}}ilha, Ji{\v{r}}{\'\i} and {\v{D}}urikovi{\v{c}}, Roman}, journal={arXiv preprint arXiv:2412.00544}, year={2024} }

    References

    1. Karpov, S., et al. "Mini-Mega-TORTORA wide-field monitoring system with sub-second temporal resolution: first year of operation." Revista Mexicana de Astronomía y Astrofísica 48 (2016): 91-96. 

    2. RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification 

  7. MERGE Dataset

    • zenodo.org
    zip
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva (2025). MERGE Dataset [Dataset]. http://doi.org/10.5281/zenodo.13939205
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The MERGE dataset is a collection of audio, lyrics, and bimodal datasets for conducting research on Music Emotion Recognition. A complete version is provided for each modality. The audio datasets provide 30-second excerpts for each sample, while full lyrics are provided in the relevant datasets. The amount of available samples in each dataset is the following:

    • MERGE Audio Complete: 3554
    • MERGE Audio Balanced: 3232
    • MERGE Lyrics Complete: 2568
    • MERGE Lyrics Balanced: 2400
    • MERGE Bimodal Complete: 2216
    • MERGE Bimodal Balanced: 2000

    Additional Contents

    Each dataset contains the following additional files:

    • av_values: File containing the arousal and valence values for each sample sorted by their identifier;
    • tvt_dataframes: Train, validate, and test splits for each dataset. Both a 70-15-15 and a 40-30-30 split are provided.

    Metadata

    A metadata spreadsheet is provided for each dataset with the following information for each sample, if available:

    • Song (Audio and Lyrics datasets) - Song identifiers. Identifiers starting with MT were extracted from the AllMusic platform, while those starting with A or L were collected from private collections;
    • Quadrant - Label corresponding to one of the four quadrants from Russell's Circumplex Model;
    • AllMusic Id - For samples starting with A or L, the matching AllMusic identifier is also provided. This was used to complement the available information for the samples originally obtained from the platform;
    • Artist - First performing artist or band;
    • Title - Song title;
    • Relevance - AllMusic metric representing the relevance of the song in relation to the query used;
    • Duration - Song length in seconds;
    • Moods - User-generated mood tags extracted from the AllMusic platform and available in Warriner's affective dictionary;
    • MoodsAll - User-generated mood tags extracted from the AllMusic platform;
    • Genres - User-generated genre tags extracted from the AllMusic platform;
    • Themes - User-generated theme tags extracted from the AllMusic platform;
    • Styles - User-generated style tags extracted from the AllMusic platform;
    • AppearancesTrackIDs - All AllMusic identifiers related with a sample;
    • Sample - Availability of the sample in the AllMusic platform;
    • SampleURL - URL to the 30-second excerpt in AllMusic;
    • ActualYear - Year of song release.

    Citation

    If you use some part of the MERGE dataset in your research, please cite the following article:

    Louro, P. L. and Redinho, H. and Santos, R. and Malheiro, R. and Panda, R. and Paiva, R. P. (2024). MERGE - A Bimodal Dataset For Static Music Emotion Recognition. arxiv. URL: https://arxiv.org/abs/2407.06060.

    BibTeX:

    @misc{louro2024mergebimodaldataset,
    title={MERGE -- A Bimodal Dataset for Static Music Emotion Recognition},
    author={Pedro Lima Louro and Hugo Redinho and Ricardo Santos and Ricardo Malheiro and Renato Panda and Rui Pedro Paiva},
    year={2024},
    eprint={2407.06060},
    archivePrefix={arXiv},
    primaryClass={cs.SD},
    url={https://arxiv.org/abs/2407.06060},
    }

    Acknowledgements

    This work is funded by FCT - Foundation for Science and Technology, I.P., within the scope of the projects: MERGE - DOI: 10.54499/PTDC/CCI-COM/3171/2021 financed with national funds (PIDDAC) via the Portuguese State Budget; and project CISUC - UID/CEC/00326/2020 with funds from the European Social Fund, through the Regional Operational Program Centro 2020.

    Renato Panda was supported by Ci2 - FCT UIDP/05567/2020.

  8. Z

    rwave-4096 - Retinal Wave Dataset

    • data.niaid.nih.gov
    Updated Apr 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cappell, Benjamin (2023). rwave-4096 - Retinal Wave Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7779498
    Explore at:
    Dataset updated
    Apr 10, 2023
    Dataset authored and provided by
    Cappell, Benjamin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    4096 classes of retinal waves, 2000 images per class.

    4096_split.tar.gz: Images split into train/test/val sets (80%/10%/10%).

    4096_info.zip: Retinal Wave Simulator Parameters per Class.

    Generated using Retinal Wave Simulator https://github.com/BennyCa/Retinal-Wave-Simulator adapted from https://swindale.ecc.ubc.ca/home-page/software/retinal-wave-models/

    Used in project Retinal Waves for Pre-Training Artificial Neural Networks Mimicking Real Prenatal Development: https://github.com/BennyCa/ReWaRD

  9. r

    Training.gov.au - Web service access to sandbox environment

    • researchdata.edu.au
    • cloud.csiss.gmu.edu
    • +3more
    Updated Sep 17, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Employment and Workplace Relations (2014). Training.gov.au - Web service access to sandbox environment [Dataset]. https://researchdata.edu.au/traininggovau-web-service-sandbox-environment/2996152
    Explore at:
    Dataset updated
    Sep 17, 2014
    Dataset provided by
    data.gov.au
    Authors
    Department of Employment and Workplace Relations
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    Introduction\r

    Training.gov.au (TGA) is the National Register of Vocational Education and Training in Australia and contains authoritative information about Registered Training Organisations (RTOs), Nationally Recognised Training (NRT) and the approved scope of each RTO to deliver NRT as required in national and jurisdictional legislation.\r \r

    TGA web-services overview\r

    TGA has a web service available to allow external systems to access and utilise information stored in TGA through an external system. The TGA web service is exposed through a single interface and web service users are assigned a data reader role which will apply to all data stored in the TGA.\r \r The web service can be broadly split into three categories:\r \r 1. RTOs and other organisation types;\r \r 2. Training components including Accredited courses, Accredited course Modules Training Packages, Qualifications, Skill Sets and Units of Competency;\r \r 3. System metadata including static data and statistical classifications.\r \r Users will gain access to the TGA web service by first passing a user name and password through to the web server. The web server will then authenticate the user against the TGA security provider before passing the request to the application that supplies the web services.\r \r There are two web services environments:\r \r 1. Production - ws.training.gov.au – National Register production web services\r \r 2. Sandbox - ws.sandbox.training.gov.au – National Register sandbox web services. \r \r The National Register sandbox web service is used to test against the current version of the web services where the functionality will be identical to the current production release. The web service definition and schema of the National Register sandbox database will also be identical to that of production release at any given point in time. The National Register sandbox database will be cleared down at regular intervals and realigned with the National Register production environment.\r \r Each environment has three configured services:\r \r 1. Organisation Service;\r \r 2. Training Component Service; and\r \r 3. Classification Service.\r \r

    Sandbox environment access\r

    To access the download area for web services, navigate to http://tga.hsd.com.au and use the below name and password:\r \r Username: WebService.Read (case sensitive)\r \r Password: Asdf098 (case sensitive)\r \r This download area contains various versions of the following artefacts that you may find useful\r \r • Training.gov.au web service specification document;\r \r • Training.gov.au logical data model and definitions document;\r \r • .NET web service SDK sample app (with source code);\r \r • Java sample client (with source code);\r \r • How to setup web service client in VS 2010 video; and\r \r • Web services WSDL's and XSD's.\r \r For the business areas, the specification/definition documents and the sample application is a good place to start while the IT areas will find the sample source code and the video useful to start developing against the TGA web services.\r \r The web services Sandbox end point is: https://ws.sandbox.training.gov.au/Deewr.Tga.Webservices \r \r

    Production web service access\r

    Once you are ready to access the production web service, please email the TGA team at tgaproject@education.gov.au to obtain a unique user name and password.\r

  10. Data from: Transfer learning reveals sequence determinants of the...

    • zenodo.org
    application/gzip, zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahin Naqvi; Sahin Naqvi (2024). Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage [Dataset]. http://doi.org/10.5281/zenodo.11224809
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sahin Naqvi; Sahin Naqvi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.

    Directory is organized into 4 subfolders, each tar'ed and gzipped:

    data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage

    • atac_design.txt - design matrix for ATAC-seq TWIST1 titration samples
    • all.sub.150bpclust.greater2.500bp.merge.TWIST1.titr.ATAC.counts.txt - ATAC-seq counts from all samples over all reproducible ATAC-seq peak regions, as defined in Naqvi et al 2023
    • atac_deseq_fitmodels_moded50.R - R code for calculating new version of ED50 and response to full depletion from TWIST1 titration data (note, uses drm.R function from 10.5281/zenodo.7689948, install drc() with this version to avoid errors)

    baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage

    • {sox9|twist1}.{0v100|ed50}.{train|valid|test}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
    • HOCOMOCOv11_core_HUMAN_mono_jaspar_format.all.sub.150bpclust.greater2.500bp.merge.minus300bp.p01.maxscore.mat.cpg.gc.basemean.txt.gz - matrix of predictors for all REs. Quantitative encoding of PWM match for all HOCOMOCO motifs + CpG + GC content, plus unperturbed ATAC-seq signal
    • train_baseline.R - R code to train baseline (LASSO regression or random forest) models using predictor matrix and the provided training data.
      • Note: training the random forest to predict full TF depletion is computationally intensive because it is across all REs, if doing this run on CPU for ~6 hrs.

    chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage

    • Fine-tuning code, data, models
      • {all|sox9.direct|twist1.bound.down}.{train|valid|test}.{ed50|0v100.log2fc}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
      • pretrained.unperturbed.chrombpnet.h5 - Pretrained model of unperturbed ATAC-seq signal in CNCCs, obtained by running ChromBPNet (https://github.com/kundajelab/chrombpnet) on DMSO-treated SOX9/TWIST1-tagged ATAC-seq data
      • finetune_chrombpnet.py - code for fine-tuning the pretrained model for any of the relevant prediction tasks (ED50/ effect of full TF depletion for SOX9/TWIST1)
      • best.model.chrombpnet.{0v100|ed50}.{sox9|twist1}.h5 - output of finetune_chrombpnet.py, best model after 10 training epochs for the indicated task
      • chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.{h5|bw} - contribution scores for the indicated predictive model, obtained by running chrombpnet contribs_bw on the corresponding model h5 file.
      • chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.modisco.{h5|bw} - TF-MoDIsCo output from the corresponding contribution score file
    • Interpretation code, data, models
      • contrib_h5_to_projshap_npy.py - code to convert contrib .h5 files into .npy files containing projected SHAP scores (required because the CWM matching code takes this format of contribution scores)
      • sox9.direct.10col.bed, twist1.bound.down.10col.uniq.bed - regions over which CWMs will be matched (likely direct targets of each TF)
      • match_cwms.py - Python code to match individual CWM instances. Takes as input: modisco .h5 file, SHAP .npy file, bed file of regions to be matched. Output is a bed file of all CWM matches (not pruned, contains many redundant matches).
      • chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.bed - output of match_cwms.py
      • take_max_overlap.py - code to merge output of match_cwms.py into clusters, and then take the maximum (length-normalized) match score in each cluster as the representative CWM match of that cluster. Requires upstream bedtools commands to be piped in, see example usage in file.
      • chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.maxoverlap.bed - output of take_max_overlap.py. These CWM instances are the ones used throughout the paper.

    modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models

    • modisco_report_{sox9|twist1}_{0v100|ed50}: folders containing images of discovered CWMs and HTMLs/PDFs of summarized reports from running TF-MoDisCo on the indicated fine-tuned ChromBPNet model

    mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves

    • twist1.strong.multi.only.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.
    • twist1.strong.weak{1|2|3}.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and the indicated number of sensitizing (weak) Coordinators and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.
    • MirnyModelAnalysis.py - Python code for analysis of Mirny model of TF-nucleosome competition. Contains implementations of analytic solutions, as well as code to fit model to observed ED50 and hill coefficients in the provided data files.
  11. OrbNet Denali Training Data

    • figshare.com
    application/x-gzip
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anders S. Christensen; Sai Krishna Sirumalla; Zhuoran Qiao; Michael B. O'Connor; Daniel G. A. Smith; Feizhi Ding; Peter J. Bygrave; Animashree Anandkumar; Matthew Welborn; Frederick R. Manby; Thomas F. Miller III (2023). OrbNet Denali Training Data [Dataset]. http://doi.org/10.6084/m9.figshare.14883867.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Anders S. Christensen; Sai Krishna Sirumalla; Zhuoran Qiao; Michael B. O'Connor; Daniel G. A. Smith; Feizhi Ding; Peter J. Bygrave; Animashree Anandkumar; Matthew Welborn; Frederick R. Manby; Thomas F. Miller III
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OrbNet Denali Training Data This repository contains the data for the paper "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy". The data set consists of geometries of molecules and the corresponding energy labels calculated and the DFT and semi-empirical level. Citation Anders S. Christensen(1,a), Sai Krishna Sirumalla(1,a), Zhuoran Qiao(2), Michael B. O'Connor(1), Daniel G. A. Smith(1), Feizhi Ding(1), Peter J. Bygrave(1), Animashree Anandkumar(3,4), Matthew Welborn(1), Frederick R. Manby(1), and Thomas F. Miller III(1,2) "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy" (2021) https://arxiv.org/abs/2107.00299 a) Indicates equal contribution Entos, Inc., Los Angeles, CA 90027, USADivision of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125, USADivision of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA 91125, USANVIDIA, Santa Clara, CA 95051, USA Contents The following files are included:

    Filename Description MD5checksum

    denali_labels.tar.gz .csv file with energy labels and other metadata bc9b612f75373d1d191ce7493eebfd62

    denali_xyz_files.tar.gz Archive with .xyz geometry files edd35e95a018836d5f174a3431a751df

    Geometry data The geometries are stored in XYZ+ format, which is compatible with a standard .xyz format, but additionally has the multiplicity and charges annotated in the comment line (2nd) line. The coordinates are in units of Ångstrøm. For example, a water molecule with a charge of 0 and a spin-multiplicity of 1 (i.e. singlet) can be specified in this format as: 3 0 1 O -1.08201 1.07900 -0.02472 H -0.09268 1.08664 0.01745 H -1.37137 1.24781 0.90715

    The directory structure of the geometry data contained within denali_xyz_files.tar.gz is as follows: xyz_files/ ├── mol_id1/ │ ├──sample_id0.xyz │ ├──sample_id1.xyz │ ├──sample_id2.xyz │ ├──sample_id3.xyz │ └──sample_id4.xyz ├── mol_id2/ │ ├──sample_id0.xyz │ ├──sample_id1.xyz │ ├──sample_id2.xyz │ └──sample_id3.xyz ├── ... etc

    Each uniquely identifies a molecule, with the various conformer geometries for that molecule stored in the corresponding folder. Those geometries are in turn identified by a unique identifier. Grouping the geometries by is used in the OrbNet loss-function, see the Eqn. 3 in the paper. Note that not all molecules has multiple geometries. Training labels The training labels (i.e. the wB97X-D3/def2-TZVP and GFN1-xTB energies) and the training and test/validation splits are provided in the file denali_labels.csv in units of Hartree. All molecules are singlet states. The .csv file contains the following columns:

    Column Description

    sample_id A unique hash generated from the QM input, also corresponds to the .xyz filename of that geometry

    subset The data source for that geometry, please refer to the paper for a detailed description of the various subsets

    mol_id Identifier for the parent molecule

    test_set True if the geometry is part of the test/validation set of neutral molecules

    test_set_plus True if the geometry is part of the test/validation set of charged molecules

    prelim_1 True if the geometry is part of the 10% OrbNet Denali training set

    training_set_plus True if the geometry is part of the full OrbNet Denali training set

    charge The charge of the molecule

    dft_energy wB97X-D3/def2-TZVP energy calculated with Qcore 0.8.17 in Hartree

    xtb1_energy GFN1-xTB energy calculated with Qcore 0.8.17 in Hartree

    The .csv file can be loaded in python, for example using Pandas.

  12. Downsized camera trap images for automated classification

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:

    Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.

    Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions

    Funding: These data were collected as part of research funded by:

    This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here

    Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip

    CT_image_data_info2.xlsx

    This file contains dataset metadata and 1 data tables:

    1. Dataset Images (described in worksheet Dataset_images)

      Description: This worksheet details the composition of each dataset used in the analyses

      Number of fields: 69

      Number of data rows: 270287

      Fields:

      • filename: Root ID (Field type: id)
      • camera_trap_site: Site ID for the camera trap location (Field type: location)
      • taxon: Taxon recorded by camera trap (Field type: taxa)
      • dist_level: Level of disturbance at site (Field type: ordered categorical)
      • baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:

  13. Fish Dataset

    • kaggle.com
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alin Cijov (2021). Fish Dataset [Dataset]. https://www.kaggle.com/alincijov/fish-dataset/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alin Cijov
    Description

    Dataset

    Camper dataset form https://stats.idre.ucla.edu/r/dae/zip/. The dataset contains data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), if they used a live bait and whether or not they brought a camper to the park (camper). You split the data into train and test dataset.

    Acknowledgements

    University of California, Los Angeles (UCLA) Dataset.

  14. f

    RRegrs study for Growth Yield

    • figshare.com
    txt
    Updated Jun 5, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristian Robert Munteanu (2016). RRegrs study for Growth Yield [Dataset]. http://doi.org/10.6084/m9.figshare.3409804.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 5, 2016
    Dataset provided by
    figshare
    Authors
    Cristian Robert Munteanu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RRegrs study for Growth Yield for original and corrected/filterred datasets: inputs training and test files, R scripts to split the datasets, plot for outlier removal.

  15. Link-prediction on Biomedical Knowledge Graphs

    • zenodo.org
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jun 25, 2021
    Description

    Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

    Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.
    Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).
    On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.
    Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.
    Inside experimental_data.zip, the following files are provided for each dataset:
    • {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
    • test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;
    • entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
    • relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

    The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

    All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

  16. EternaBrain CNN accuracies on eternamoves-select with different splits of...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohan V. Koodli; Benjamin Keep; Katherine R. Coppess; Fernando Portela; Rhiju Das (2023). EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1007059.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rohan V. Koodli; Benjamin Keep; Katherine R. Coppess; Fernando Portela; Rhiju Das
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Organization logo

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu