29 datasets found
  1. Super.Complex: A supervised machine learning pipeline for molecular complex...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Feb 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meghana V. Palukuri; Meghana V. Palukuri; Edward M. Marcotte; Edward M. Marcotte (2024). Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks: Experiment data [Dataset]. http://doi.org/10.5281/zenodo.4814944
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Meghana V. Palukuri; Meghana V. Palukuri; Edward M. Marcotte; Edward M. Marcotte
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Details of experiments are given in the paper, titled 'Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks.': https://doi.org/10.1371/journal.pone.0262056

    For additional details, please see https://sites.google.com/view/supercomplex/super-complex-v3-0

    Supporting code is available on github at: https://github.com/marcottelab/super.complex

    Details of files provided for each experiment are given below:

    Toy network experiment

    Input data:

    • Toy network, available as a weighted edge list. Format: node1 node2 edge-weight

    • All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

    Intermediate output results:

    • Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

    • Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

    • Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)

    • Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)

    Output results:

    • Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

    • Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community.

    • Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges.

    hu.MAP experiment:

    Input data:

    • hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight

    • All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

    Intermediate output results:

    • Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

    • Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

    • Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

    • Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

    Output results:

    • Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

    • Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where the columns are - Learned complex name (Named as the most similar CORUM complex, prepended by the Jaccard coefficient similarity) , Proteins in learned complex (gene names, i.e gene_name1 gene_name2 gene_name3 .. gene_nameN ), Proteins in learned complex (gene IDs, i.e gene_ID1 gene_ID2 gene_ID3 .. gene_IDN ) and Score (Community fitness function of the learned protein complex)

    • Learned protein complexes from hu.MAP PPI network, available as gene ID edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

    • Learned protein complexes from hu.MAP PPI network, available as gene name edge lists. Format: gene_name1 gene_name2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

    Yeast experiments:

    Input data:

    • DIP yeast PPI network, available as a weighted edge list. Format: gene_ID1* gene_ID2 edge-weight

    • Yeast protein complexes from MIPS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

    • Yeast protein complexes from TAP-MS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

    Experiment 1: Training on TAP-MS and Testing on MIPS:

    Experiment 1 Intermediate output results:

    • Training data, i.e. feature matrix of TAP-MS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

    • Testing data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

    Experiment 1 output results:

    • Trained community fitness function of TAP-MS complexes (with edge weights from DIP PPI network), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

    • Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.

    • Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

    Experiment 2: Training on MIPS and Testing on TAP-MS:

    Experiment 2 Intermediate output results:

    • Training data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular

  2. Defeating human IQ Tests with Machine Learning

    • kaggle.com
    Updated Oct 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2021). Defeating human IQ Tests with Machine Learning [Dataset]. http://doi.org/10.34740/kaggle/dsv/2762635
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yam Peleg
    Description

    From https://openpsychometrics.org/tests/FSIQ/

    This data is the results collected from a test that measures IQ with an overall score as well as scores for component abilities.

    Introduction

    The Intelligence Quotient (IQ) is the measure of human cogntive ability. Scores are set so that the average is 100. There is controversey about how IQ scores should be broken down, this test uses the 3 domains from Hampshire, Highfield, Parkin and Owen (2012) of (1) Short-Term Memory, (2) Reasoning, and (3) Verbal. This model seems to fit best on internet populations.

    WARNING: Every on-line IQ test is bad

    While a lot of working has been put into making sure this has good measurement properties, it is not a replacement for a real IQ test. Not on-line test is. A main reason is that no one in the on-line context has the attention span for a reliable assessment. The Wechsler Adult Intelligence Scale (a reasonable candidate for the title 'gold standard IQ test') takes more than an hour [1]. This test was designed with a cap of taking 20 minutes, because we like engagement and engagement drops off fast after that. This should be understood as demonstration of how an IQ test can work, rather than a score you should make decisions based off of. Procedure

    This test has six sections. Each section has its own instructions. This test is meant to be taken solo, without references or materials. It will only be valid the first time the taker take it, so if the taker want an accurate result they must not start the test until they are ready. The average person takes between 10 and 15 minutes to finish the test, and this is important.

    Dataset

    This experimental IQ test has 25 items, each an matrix with one tile missing and eight possibilities for that tile.

    The items are /questions/ folder, in subfolders **1...25 which correspond to Q1....Q25** in the data file. In the folder for each question you will find the files:

    q.png - the incomplete matrix. 1.png.....7.png - wrong answers a.png - the right answer.

    In the data file, if Q1 is 4, that means they chose 4.png as their answer. If it is 10, that means they chose a.png as their answer.

    At the end they were also directed to indicate:

    **gender **- chosen from a drop down list where 1=male, 2=female, 3=other. age - entered as free response (ages < 18 removed)

    NOTES: 1. The possible answers were presented in two rows of four with a random order for each participant. 2. The collection of this data was of mediocre quality.

    So, can we train a model to defeat this test like a human?

  3. Z

    Data from: Dataset for train and test BRITTANY (Biometric RecognITion...

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +2more
    Updated Jan 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guerrero-Higueras, Ángel Manuel (2022). Dataset for train and test BRITTANY (Biometric RecognITion Through gAit aNalYsis) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5825884
    Explore at:
    Dataset updated
    Jan 7, 2022
    Dataset provided by
    Álvarez-Aparicio, Claudia
    Guerrero-Higueras, Ángel Manuel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset can be used train and test the BRITTANY tool. Information contained in the dataset is especially suitable to be used as train and test data for neural network-based classifiers.

    This dataset contains 198 Rosbag files, of 5 seconds duration, recorded in different locations (kitchen, livingroom-window and livingroom-door) with Orbi-One robot stood still. Two sorts of Rosbag files have been recorded. In 90 Rosbag files (train*.bag), data recorded correspond to a person walking in a straight line in front of the robot. Data from five different people have been recorded. For each location and person, six Rosbag files have been recorded.

    In 108 Rosbag files (test*.bag), data recorded correspond to a person walking in a straight line in front of the robot. Data from six different people have been recorded. Five of those six people are the same as in the other rosbags and the other one is not registered in the system to evaluate the false-positive cases in the system.

  4. f

    Model performance results based on random forest, gradient boosting,...

    • figshare.com
    xls
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang (2024). Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times. [Dataset]. http://doi.org/10.1371/journal.pone.0299625.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.

  5. s

    Twitter cascade dataset

    • researchdata.smu.edu.sg
    • smu.edu.sg
    • +1more
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter cascade dataset [Dataset]. http://doi.org/10.25440/smu.12062709.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.

  6. Data from: Improving antibody language models with native pairing

    • zenodo.org
    application/gzip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Burbach; Bryan Briney; Bryan Briney; Sarah Burbach (2025). Improving antibody language models with native pairing [Dataset]. http://doi.org/10.5281/zenodo.8237396
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sarah Burbach; Bryan Briney; Bryan Briney; Sarah Burbach
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.

    Results. Using a unique and recently reported dataset of approximately 1.6 x 106 natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.

    Files. The following files are included in this repository:

    • BALM-paired.tar.gz: Model weights for the BALM-paired model.
    • BALM-unpaired.tar.gz: Model weights for the BALM-unpaired model.
    • ESM2-650M_paired-fine-tuned.tar.gz: Model weights for the 650M-parameter ESM-2 model after fine-tuning with natively paired antibody sequences.
    • jaffe-paired-dataset_airr-annotation.tar.gz: All natively paired antibody sequences from the Jaffe dataset were annotated with abstar and subsequently filtered to remove duplicates or unproductive sequences. The annotated sequences are provided in an AIRR-compliant format.
    • train-test-eval_paired.tar.gz: Datasets used to train, test, and evaluate the BALM-paired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line. This dataset was also used to fine-tune the 650M-parameter ESM-2 variant.
    • train-test-eval_unpaired.tar.gz: Datasets used to train, test, and evaluate the BALM-unpaired model. Compressed folder containing three files: train.txt, test.txt, and eval.txt. Each file has one input sequence per line.

    Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub.

  7. f

    Effects of Non-Symbolic Approximate Number Practice on Symbolic Numerical...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saeeda Khanum; Rubina Hanif; Elizabeth S. Spelke; Ilaria Berteletti; Daniel C. Hyde (2023). Effects of Non-Symbolic Approximate Number Practice on Symbolic Numerical Abilities in Pakistani Children [Dataset]. http://doi.org/10.1371/journal.pone.0164436
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Saeeda Khanum; Rubina Hanif; Elizabeth S. Spelke; Ilaria Berteletti; Daniel C. Hyde
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Pakistan
    Description

    Current theories of numerical cognition posit that uniquely human symbolic number abilities connect to an early developing cognitive system for representing approximate numerical magnitudes, the approximate number system (ANS). In support of this proposal, recent laboratory-based training experiments with U.S. children show enhanced performance on symbolic addition after brief practice comparing or adding arrays of dots without counting: tasks that engage the ANS. Here we explore the nature and generality of this effect through two brief training experiments. In Experiment 1, elementary school children in Pakistan practiced either a non-symbolic numerical addition task or a line-length addition task with no numerical content, and then were tested on symbolic addition. After training, children in the numerical training group completed the symbolic addition test faster than children in the line length training group, suggesting a causal role of brief, non-symbolic numerical training on exact, symbolic addition. These findings replicate and extend the core findings of a recent U.S. laboratory-based study to non-Western children tested in a school setting, attesting to the robustness and generalizability of the observed training effects. Experiment 2 tested whether ANS training would also enhance the consistency of performance on a symbolic number line task. Over several analyses of the data there was some evidence that approximate number training enhanced symbolic number line placements relative to control conditions. Together, the findings suggest that engagement of the ANS through brief training procedures enhances children's immediate attention to number and engagement with symbolic number tasks.

  8. Comedians transcript

    • kaggle.com
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinandan Sharma19 (2024). Comedians transcript [Dataset]. https://www.kaggle.com/datasets/abhinandansharma19/comedians-transcript
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2024
    Dataset provided by
    Kaggle
    Authors
    Abhinandan Sharma19
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Comedy Scraps Corpus is a comprehensive collection of stand-up comedy scripts meticulously compiled from performances held at "The Loft" comedy club from 2020 to 2022. This corpus serves as a treasure trove for natural language processing (NLP) enthusiasts, researchers, and comedy fans alike, offering a unique lens into the evolution of comedic content over time.

    Spanning two years, the Comedy Scraps Corpus encapsulates the diverse comedic styles, themes, and linguistic nuances employed by many talented comedians who graced the stage at The Loft. From witty one-liners to elaborate storytelling, this corpus captures the essence of comedic expression in its rawest form.

    One of the distinctive features of the Comedy Scraps Corpus is its chronological arrangement, allowing researchers and analysts to trace the evolution of comedic trends, topics, and language usage over the designated period. By examining scripts from different years, users can discern the subtle shifts in comedic sensibilities, audience preferences, and cultural influences that have shaped the comedy landscape during this time frame.

    The corpus comprises various comedic material, ranging from observational humor to political satire, from surrealistic narratives to cultural commentary. Each script is meticulously annotated, providing valuable metadata such as performer name, performance date, and audience response, facilitating in-depth analyses and comparative studies.

    Moreover, the Comedy Scraps Corpus is a valuable resource for training and testing NLP models, enabling researchers to develop sophisticated algorithms for tasks such as joke generation, sentiment analysis, and humor recognition. By leveraging the rich and varied content within the corpus, developers can explore innovative approaches to computational humor and language understanding.

    In summary, the Comedy Scraps Corpus is a testament to stand-up comedy's vibrancy and diversity, offering a comprehensive glimpse into the evolution of comedic discourse from 2020 to 2022. Whether for academic research, algorithmic development, or simply for the joy of comedic exploration, this corpus provides an invaluable resource for anyone interested in the intersection of language, humor, and culture.

  9. polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. http://doi.org/10.5281/zenodo.7124188
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.

    I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask
    ```python
    import dask.dataframe as dd
    ddf = dd.read_parquet("*.parquet", engine="pyarrow")
    ```

    For example, compute the description of data set
    ```python
    df_describe = ddf.describe().compute()
    df_describe

    ```

    PSMILES strings only

    • generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
    • generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
  10. n

    Data and Results for paper End-to-end on-line rescheduling from Gantt chart...

    • narcis.nl
    • data.mendeley.com
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palombarini, J (via Mendeley Data) (2021). Data and Results for paper End-to-end on-line rescheduling from Gantt chart images using Deep Reinforcement Learning [Dataset]. http://doi.org/10.17632/x9vdrdwyfh.1
    Explore at:
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Palombarini, J (via Mendeley Data)
    Description

    In this dataset, the initial schedules used to perform the training and testing processes for the reference paper can be found. In addition, the results summarization archives are available.

  11. Data Prediction using ML

    • kaggle.com
    Updated Aug 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik Garimella (2020). Data Prediction using ML [Dataset]. https://www.kaggle.com/karthikgarimella/data-prediction-using-ml/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Karthik Garimella
    Description

    Content

    The dataset comprises of the following columns:

    "code" "clientType" "registrationMode" "planName" "accident" "duration" "country" "netSales" "netProfit" "gender" "age"

    Accident is the column to be predicted.

    The test set is available in the file test_data.csv.

    Constraints

    Output should match the number of test cases available in test_data.csv

    Output Format

    Output prediction should be a probability value of class "1" , a value between 0 and 1, new line delimited for each row in the test data set.

    No column names need to be entered.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

  12. E

    Data from: A Computational Theory for the Emergence of Grammatical...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jun 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). A Computational Theory for the Emergence of Grammatical Categories in Cortical Dynamics [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/939
    Explore at:
    Dataset updated
    Jun 9, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file Corpora.txt keeps the corpus used to train the model and the different instances of the classifier. It is basically a text file with one sentence per line from the original corpus called test.tsv available at https://github.com/google-research-datasets/wiki-split.git. We eliminated punctuation marks and special characters from the original file putting each sentence per line.

    Enju_Output.txt holds the outputs generated by Enju in -so mode (Output in stand-off format) using Corpora.txt as input. This file has basically a natural language English per-sentence parse with a wide-coverage probabilistic for HPSG grammar.

    The file Supervision.txt keeps the grammatical tags of the corpus. This file holds a tag per word and each tag is situated in a single line. Sentences are separated by one empty line while tags from words in the same sentence are located in adjacent lines.

    The file Word_Category.txt carries the coarse-grained word category information needed by the model and introduced in it by apical dendrites. Each word in the corpus has a word-category tag which provides additional constraints to those provided by lateral dendrites. This file contains a tag per word and each tag is situated in a single line. Sentences are separated by one empty line while tags from words in the same sentence are located in adjacent lines.

    The file SynSemTests.xlsx keeps all the grammar classification results as well as the statistical analysis in the classification tests.

  13. ReMEA paper Supplementary data by Higgins et al.

    • figshare.com
    bin
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Rodriguez Cutillas; Luke Higgins (2025). ReMEA paper Supplementary data by Higgins et al. [Dataset]. http://doi.org/10.6084/m9.figshare.29040821.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Pedro Rodriguez Cutillas; Luke Higgins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Usage Notes- Data is provided under the CC BY-NC, which permits reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.- Please cite this dataset using the Figshare DOI if used in any publications or derivative work.---## ContentsSupplementary data 1 Cell line ReMEA scoresSupplementary data 2 Fold changes for the LSD1i proteomicsSupplementary data 3 ReMEA scores for the LSD1i proteomicsSupplementary data 4 Average ReMEA scores for primary AML cellsSupplementary data 5 Fold changes from chemotherapy phosphoSupplementary data 6 ReMEA score for chemotherapy phospho---## DescriptionThis dataset contains supporting data for the publication titled "Response Marker Enrichment Analysis (ReMEA) predicts the antiproliferative impact of pharmacological and genetic perturbagens on cancer cells" submitted to Nucleic Acids Research. The data include the collection of ReMEA scores computed with the four large cell line proteomics datasets referenced in the paper. Scores were made using the cell line signatures whose construction is described in the publication.All files are provided in [format(s), e.g., CSV, TXT, PDF, .R, .py, .zip].---## File InventoryBelow is a description of each of the 9 files included dataset 1:1. [STAT_matched_DRUG_scores.csv] Description: Single Train And Test Set scores for drug perturbations. Signatures made from and tested with same proteomics dataset.2. [STAT_matched_RNAi_scores.csv] Description: Single Train And Test Set scores for RNAi perturbations. Signatures made from and tested with same proteomics dataset.3. [STAT_matched_CRISPR_scores.csv] Description: Single Train And Test Set scores for CRISPR perturbations. Signatures made from and tested with same proteomics dataset.4. [LODO_AVG_DRUG_scores.csv] Description: Leave One Dataset Out scores for drug perturbations.5. [LODO_AVG_RNAi_scores.csv] Description: Leave One Dataset Out scores for RNAi perturbations.6. [LODO_AVG_CRISPR_scores.csv] Description: Leave One Dataset Out scores for CRISPR perturbations.7. [k_fold_goncalves_DRUG_remea_scores.csv] Description: K-fold ReMEA drug scores using the Goncalves et al. 2022 proteomics dataset as described in publication.8. [k_fold_goncalves_RNAi_remea_scores.csv] Description: K-fold ReMEA RNAi scores using the Goncalves et al. 2022 proteomics dataset as described in publication.9. [k_fold_goncalves_CRISPR_remea_scores.csv] Description: K-fold ReMEA CRISPR scores using the Goncalves et al. 2022 proteomics dataset as described in publication.---

  14. o

    Data from: WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition...

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Strohmayer; Martin Kampel (2023). WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32 [Dataset]. http://doi.org/10.5281/zenodo.8021099
    Explore at:
    Dataset updated
    Sep 27, 2023
    Authors
    Julian Strohmayer; Martin Kampel
    Description

    WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32 This repository contains the WiFi CSI human presence detection and activity recognition datasets proposed in [1]. Datasets DP_LOS - Line-of-sight (LOS) presence detection dataset, comprised of 392 CSI amplitude spectrograms. DP_NLOS - Non-line-of-sight (NLOS) presence detection dataset, comprised of 384 CSI amplitude spectrograms. DA_LOS - LOS activity recognition dataset, comprised of 392 CSI amplitude spectrograms. DA_NLOS - NLOS activity recognition dataset, comprised of 384 CSI amplitude spectrograms. Table 1: Characteristics of presence detection and activity recognition datasets. Dataset Scenario #Rooms #Persons #Classes Packet Sending Rate Interval #Spectrograms DP_LOS LOS 1 1 6 100Hz 4s (400 packets) 392 DP_NLOS NLOS 5 1 6 100Hz 4s (400 packets) 384 DA_LOS LOS 1 1 3 100Hz 4s (400 packets) 392 DA_NLOS NLOS 5 1 3 100Hz 4s (400 packets) 384 Data Format Each dataset employs an 8:1:1 training-validation-test split, defined in the provided label files trainLabels.csv, validationLabels.csv, and testLabels.csv. Label files use the sample format [i c], with i corresponding to the spectrogram index (i.png) and c corresponding to the class. For presence detection datasets (DP_LOS , DP_NLOS), c in {0 = "no presence", 1 = "presence in room 1", ..., 5 = "presence in room 5"}. For activity recognition datasets (DA_LOS , DA_NLOS), c in {0="no activity", 1="walking", and 2="walking + arm-waving"}. Furthermore, the mean and standard deviation of a given dataset are provided in meanStd.csv. Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1]. [1] Strohmayer, Julian, and Martin Kampel. "WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32" International Conference on Computer Vision Systems. Cham: Springer Nature Switzerland, 2023. BibTeX citation: @inproceedings{strohmayer2023wifi, title={WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={International Conference on Computer Vision Systems}, pages={41--50}, year={2023}, organization={Springer} }

  15. f

    ABC2018 dataset

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toru Tamaki; Ken Yoda (2023). ABC2018 dataset [Dataset]. http://doi.org/10.6084/m9.figshare.7941134.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Toru Tamaki; Ken Yoda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the competition ABC2018. The competition website is:https://competitions.codalab.org/competitions/16283FileA zip file contains all the training and test trajectories, and a ground truth label file for the training set.When you unzip the dataset, you find./test/***.csv./train/***.csv./train_labels.csvwhere *** is the trajectory number (000, 001, ..., 630 for train, 000, 001, ..., 274 for test).TaskClassifying GPS trajectories of birds into male or femaleTrajectory file formatA single CSV file (000.csv, 001.csv, ...) contains a trajectory of a trip, and each line represents the information of a GPS location of a shearwater. In addition to longitude and latitude, some other information is provided; elapsed time and local clock time, solar azimuth and elevation angles.- float: longitude- float: latitude- float: sun azimuth [degree] clockwise from the North- float: sun elevation [degree] upward from the horizon- int: (1) daytime (between sunrise and sunset), or (0) nighttime- int: elapsed time [second] after starting the trip- clock: local time (hh:mm:ss)- int: days (starts from 0, and increments by 1 when the local time passes 23:59:59)Float values are of the format %.5f, and fields are separated by a single comma.Here is an example:=================139.29220,38.56632,76.42170,-4.45122,0,0,04:54:03,0139.29300,38.56763,76.58196,-4.25726,0,60,04:55:03,0139.29400,38.57053,76.73674,-4.06880,0,118,04:56:01,0139.29620,38.57563,76.89729,-3.87201,0,178,04:57:01,0...=================- Different trajectories have different number of GPS locations.- The time interval between successive two GPS locations is approximately one minute (60 seconds) when GPS works well, otherwise interval may vary from one to several minutes, even hours and days.- Trajectories in the training and test sets are in the same format.- Ground truth labels for the training set are given in a separate file. Labels: gender, or male/femaleA single txt file of ground truth labels of the training set is provided. Each line has the label of the corresponding training trajectory; that is, line 0 is the label of the training trajectory file 000.csv.Label is binary (character):- male: 0- female: 1Here is an example:=================1110100...================= Stats: Numbers of the datasetTraining set- 326 male trajectories- 305 female trajectories- 631 in totalTest set- 275 trajectoriesDisclaimerThe procedures used in the field study for collecting the data were approved by the Animal Experimental Committee of Nagoya University.License of the datasetThe dataset was collected by scientific teams for scientific purpose. If you use the dataset for any scientific purposes except this competition, please refer the following paper:Sakiko MATSUMOTO, Takashi YAMAMOTO, Maki YAMAMOTO, Carlos B ZAVALAGA and Ken YODA (2017) Sex-related differences in the foraging movement of Streaked Shearwaters Calonectris leucomelas breeding on Awashima Island in the Sea of Japan. Ornithological Science 16(1):23-32. 2017 doi: http://dx.doi.org/10.2326/osj.16.23Please contact the corresponding researcher, Ken Yoda (http://yoda-ken.sakura.ne.jp/yoda_lab/English.html), if you would like to use the dataset for any other purposes, or access un-preprocessed original raw data.

  16. d

    Data from: Automatic delineation of glacier grounding lines in differential...

    • datadryad.org
    • zenodo.org
    zip
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yara Mohajerani; Seongsu Jeong; Bernd Scheuchl; Isabella Velicogna; Eric Rignot; Pietro Milillo (2021). Automatic delineation of glacier grounding lines in differential interferometric synthetic-aperture radar data using deep learning [Dataset]. http://doi.org/10.7280/D1VD6G
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 9, 2021
    Dataset provided by
    Dryad
    Authors
    Yara Mohajerani; Seongsu Jeong; Bernd Scheuchl; Isabella Velicogna; Eric Rignot; Pietro Milillo
    Time period covered
    Mar 4, 2021
    Description

    The grounding lines for the entire Antarctic coastline for available Sentinel1-a/b tracks in 2018 are provided as Shapefiles for the 6-day and 12-day tracks separately, as "AllTracks_6d_GL.shp" and "AllTracks_12d_GL.shp" respectively. The corresponding uncertainty estimates are also provided, as described in the manuscript, which are labelled as "AllTracks_6d_uncertainty.shp" and "AllTracks_12d_uncertainty.shp".

    Each grounding line in the Shapefile contains 6 attribudes:

    ID: grounding line ID for each DInSAR scene 
    Type: whether the line was used as training or testing data.
    Class: whether each identifined line is a grounding line or a pinning point
    Length: length of the enclosing polygon determining the uncertainty
    Width: width of the enclosing polygon determining the uncertainty
    FILENAME: name of the original shapefile for the grounding line (before all files were combined into one), which gives all relevant information of the DInSAR data, in the fo...
    
  17. E

    Data from: Czech Text Document Corpus v 2.0

    • live.european-language-grid.eu
    binary format
    Updated May 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Czech Text Document Corpus v 2.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1253
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    May 6, 2018
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    BASIC INFORMATION

    --------------------

    Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer.

    The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles.

    The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models.

    Technical Details

    ------------------------

    Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one.

    For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols.

    Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files.

    This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.

  18. Z

    Data from: SemEval-2021 Task 12: Learning with Disagreements

    • data.niaid.nih.gov
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fornaciari, Tommaso (2021). SemEval-2021 Task 12: Learning with Disagreements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5130736
    Explore at:
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Fornaciari, Tommaso
    Dumitrache, Anca
    Uma, Alexandra Nnemamaka
    Chamberlain, Jon
    Poesio, Massimo
    Simpson, Edwin
    Miller, Tristan
    Plank, Barbara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the Post-Evaluation data for SemEval-2021 Task 12: Learning with Disagreement, a shared task on learning to classify with datasets containing disagreements.

    The aim of this shared task is to provide a unified testing framework for learning from disagreements using the best-known datasets containing information about disagreements for interpreting language and classifying images:

    1. LabelMe-IC: Image Classification using a subset of LabelMe images (Russell et al., 2008), is a widely used, community-created image classification dataset where images are assigned to one of 8 categories: highway, inside city, tall building, street, forest, coast, mountain, open country. Rodrigues and Pereira (2017) collected crowd labels for these images using Amazon Mechanical Turk (AMT).
    
    
    2. CIFAR10-IC: Image Classification using a subset of CIFAR-10 dataset, https://www.cs.toronto.edu/~kriz/cifar.html. The entire dataset consists of colour images in 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Crowdsourced labels for this dataset were collected by Peterson et al (2019).
    
    
    3. PDIS: Information Status Classification using Phrase Detectives Information. Information Status Classification (IS) in Phrase Detectives (Poesio et al., 2019) dataset involves identifying the information status of a noun phrase: whether that noun phrase refers to new information or to old information.
    
    
    4. Gimpel-POS: Part-of-Speech tagging using the Gimpel dataset (Gimpel et al., 2011) for Twitter posts. Plank et al.(2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2011), using these tags as gold, and collected crowdsourced labels.
    
    
    5. Humour: ranking one-line texts using pairwise funniness judgements (Simpson et al., 2019). Crowdworkers have annotated pairs of puns to indicate which is funniest. A gold standard ranking was produced using a large number of redundant annotations. The goal is to infer the gold standard ranking from a reduced number of crowdsourced judgements.
    

    The files contained in this data collection are as follows: starting_kit.zip - Base models used provided for the shared task. practice_phase_data.zip - The training and development data used during the Practice Phase of the competition. test_phase_data.zip - The test data, used during the Evaluation Phase of the competition

    Details of format of each dataset for each task can be found on Codalab.

  19. c

    Data from: Research data supporting "On-line Active Reward Learning for...

    • repository.cam.ac.uk
    bin, zip
    Updated May 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Su, Pei-Hao; Gasic, Milica; Mrksic, Nikola; Rojas-Barahona, Lina; Ultes, Stefan; Vandyke, David; Wen, Tsung-Hsien; Young, Steve (2016). Research data supporting "On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems" [Dataset]. https://www.repository.cam.ac.uk/items/6910626c-bc6f-46c0-b40a-a43913cb48a9
    Explore at:
    zip(124821449 bytes), bin(1953 bytes)Available download formats
    Dataset updated
    May 17, 2016
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Su, Pei-Hao; Gasic, Milica; Mrksic, Nikola; Rojas-Barahona, Lina; Ultes, Stefan; Vandyke, David; Wen, Tsung-Hsien; Young, Steve
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    This repository contains the data presented in the paper "On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems" in ACL 2016. Two separate datasets as described in section 4 of the paper are presented: 1. DialogueEmbedding/ It contains the [train|valid|test] data for the unsupervised dialogue embedding creation, each with *.[feature|reward|turn|subjsuc]. Note that *.turn includes the lines to be read for each dialogue in *.[feature|reward|subjsuc], and *.subjsuc is the user's subjective rating. The feature size is 74. 2. DialoguePolicy/ It includes four contrasting systems with different reward models: [GP|RNN|ObjSubj|Subj]. Inside each system directory is the data obtained in interaction with Amazon Mechanical Turk users while training three policies with same config: policy_[1|2|3]. and a .csv for the evaluation result along with the trainig process. In each policy_[1|2|3]/ there is a list of calls with a time stamp in the name which contains session.xml file for dialogue log and feedback.xml file for user feedback

  20. Data from: Neural Reverse Engineering of Stripped Binaries using Augmented...

    • zenodo.org
    application/gzip
    Updated Nov 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaniv David; Uri Alon; Eran Yahav; Yaniv David; Uri Alon; Eran Yahav (2020). Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs [Dataset]. http://doi.org/10.5281/zenodo.4099685
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 15, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yaniv David; Uri Alon; Eran Yahav; Yaniv David; Uri Alon; Eran Yahav
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and pre-trained models are released as a companion to our OOPSLA '20 publication: "Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs":

    1. The dataset file (nero_dataset_binaries.tar.gz) is composed from packages of binary executables created by compiling several GNU source-code packages. We used these executables to evaluate our approach as implemented in our prototype "Nero" and compare it to other approaches. All executables contain debug information which serves as the ground truth for the procedure name predictions. The packages are split into three sets: training, validation and test.
      1. The executable file name structure is: "
    2. The procedure representation file (procedure_representations.tar.gz) contains:
      1. The raw representations for all the binary procedures in the above dataset. Each procedure is represented by one line in the relevant file for each set (training.json, validation.json and test.json)
      2. The above representations preprocessed for training.
    3. The pre-trained model file (nero_gnn_model.tar.gz) was created using the above preprocessed dataset and contains:
      1. Pre-trained model.
      2. Training log.
      3. Prediction results log.

    For the code of the "Nero" prototype, and more information about the above artifacts see our Github repo

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Meghana V. Palukuri; Meghana V. Palukuri; Edward M. Marcotte; Edward M. Marcotte (2024). Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks: Experiment data [Dataset]. http://doi.org/10.5281/zenodo.4814944
Organization logo

Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks: Experiment data

Explore at:
zipAvailable download formats
Dataset updated
Feb 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Meghana V. Palukuri; Meghana V. Palukuri; Edward M. Marcotte; Edward M. Marcotte
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Details of experiments are given in the paper, titled 'Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks.': https://doi.org/10.1371/journal.pone.0262056

For additional details, please see https://sites.google.com/view/supercomplex/super-complex-v3-0

Supporting code is available on github at: https://github.com/marcottelab/super.complex

Details of files provided for each experiment are given below:

Toy network experiment

Input data:

  • Toy network, available as a weighted edge list. Format: node1 node2 edge-weight

  • All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

Intermediate output results:

  • Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

  • Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

  • Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)

  • Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)

Output results:

  • Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community.

  • Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges.

hu.MAP experiment:

Input data:

  • hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight

  • All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

Intermediate output results:

  • Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

  • Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

  • Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

  • Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

Output results:

  • Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where the columns are - Learned complex name (Named as the most similar CORUM complex, prepended by the Jaccard coefficient similarity) , Proteins in learned complex (gene names, i.e gene_name1 gene_name2 gene_name3 .. gene_nameN ), Proteins in learned complex (gene IDs, i.e gene_ID1 gene_ID2 gene_ID3 .. gene_IDN ) and Score (Community fitness function of the learned protein complex)

  • Learned protein complexes from hu.MAP PPI network, available as gene ID edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

  • Learned protein complexes from hu.MAP PPI network, available as gene name edge lists. Format: gene_name1 gene_name2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

Yeast experiments:

Input data:

  • DIP yeast PPI network, available as a weighted edge list. Format: gene_ID1* gene_ID2 edge-weight

  • Yeast protein complexes from MIPS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

  • Yeast protein complexes from TAP-MS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

Experiment 1: Training on TAP-MS and Testing on MIPS:

Experiment 1 Intermediate output results:

  • Training data, i.e. feature matrix of TAP-MS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

  • Testing data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

Experiment 1 output results:

  • Trained community fitness function of TAP-MS complexes (with edge weights from DIP PPI network), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.

  • Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

Experiment 2: Training on MIPS and Testing on TAP-MS:

Experiment 2 Intermediate output results:

  • Training data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular

Search
Clear search
Close search
Google apps
Main menu