91 datasets found

Bioinformatics Protein Dataset - Simulated
kaggle.com
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10315204
Dataset updated
Dec 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
r
Alternative Splicing Annotation Project II Database
rrid.site
neuinfo.org
+2more
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Alternative Splicing Annotation Project II Database [Dataset]. http://identifiers.org/RRID:SCR_000322
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_000322
Dataset updated
Feb 24, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on 8/12/13. An expanded version of the Alternative Splicing Annotation Project (ASAP) database with a new interface and integration of comparative features using UCSC BLASTZ multiple alignments. It supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human alternative splicing data, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue and cancer (normal) specific alternatively spliced genes are re-calculated and updated. They have created a novel orthologous exon and intron databases and their splice variants based on multiple alignment among several species. These orthologous exon and intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II can be searched by several different criteria such as gene symbol, gene name and ID (UniGene, GenBank etc.). The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons and orthologous exons; (IV) introns and orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue and cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser. All alternative splicing relationships with supporting evidence information, types of alternative splicing patterns, and inclusion rate for skipped exons are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page. The p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences are all also reported.
Scorpio Gene-Taxa Benchmark Dataset
zenodo.org
data.niaid.nih.gov
bin, csv, txt
Updated Dec 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Saleh Refahi; Mohammad Saleh Refahi (2024). Scorpio Gene-Taxa Benchmark Dataset [Dataset]. http://doi.org/10.5281/zenodo.12964684
Explore at:
bin, csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12964684
Dataset updated
Dec 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mohammad Saleh Refahi; Mohammad Saleh Refahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.

To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.

We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.
r
European Nucleotide Archive (ENA)
rrid.site
scicrunch.org
+2more
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). European Nucleotide Archive (ENA) [Dataset]. http://identifiers.org/RRID:SCR_006515
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006515
Dataset updated
Feb 9, 2025
Description
Public archive providing a comprehensive record of the world''''s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. All submitted data, once public, will be exchanged with the NCBI and DDBJ as part of the INSDC data exchange agreement. The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources including submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centers and routine and comprehensive exchange with their partners in the International Nucleotide Sequence Database Collaboration (INSDC). Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and mandatory step in the dissemination of research findings to the scientific community. ENA works with publishers of scientific literature and funding bodies to ensure compliance with these principles and to provide optimal submission systems and data access tools that work seamlessly with the published literature. ENA is made up of a number of distinct databases that includes the EMBL Nucleotide Sequence Database (Embl-Bank), the newly established Sequence Read Archive (SRA) and the Trace Archive. The main tool for downloading ENA data is the ENA Browser, which is available through REST URLs for easy programmatic use. All ENA data are available through the ENA Browser. Note: EMBL Nucleotide Sequence Database (EMBL-Bank) is entirely included within this resource.
Data, R code and output Seurat Objects for single cell RNA-seq analysis of...
figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunshun Chen; Gordon Smyth (2023). Data, R code and output Seurat Objects for single cell RNA-seq analysis of human breast tissues [Dataset]. http://doi.org/10.6084/m9.figshare.17058077.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17058077.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Yunshun Chen; Gordon Smyth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all the Seurat objects that were used for generating all the figures in Pal et al. 2021 (https://doi.org/10.15252/embj.2020107333). All the Seurat objects were created under R v3.6.1 using the Seurat package v3.1.1. The detailed information of each object is listed in a table in Chen et al. 2021.
n
IntEnz- Integrated relational Enzyme database
neuinfo.org
rrid.site
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). IntEnz- Integrated relational Enzyme database [Dataset]. http://identifiers.org/RRID:SCR_002992
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002992
Dataset updated
Jan 29, 2022
Description
IntEnz (Integrated relational Enzyme database) is a freely available resource focused on enzyme nomenclature. IntEnz is created in collaboration with the Swiss Institute of Bioinformatics (SIB). This collaboration is responsible for the production of the ENZYME resource. IntEnz contains the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) on the nomenclature and classification of enzyme-catalysed reactions.
o
Data from: EukProt: a database of genome-scale predicted proteins across the...
explore.openaire.eu
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas (2022). EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotes [Dataset]. http://doi.org/10.6084/m9.figshare.12417881
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.12417881
Dataset updated
Jan 1, 2022
Authors
Daniel Richter; Cédric Berney; Jürgen Strassert; Yu-Ping Poh; Emily K. Herman; Sergio A. Muñoz-Gómez; Jeremy G. Wideman; Fabien Burki; Colomban de Vargas
Description
Version 3 (22 November, 2021) See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above). Scroll to the end of this page for changes since version 2. Are we missing anything? Please let us know! EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification. This release contains 5 files: EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17). EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file. EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns: EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier. Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files. Strain: the strain(s) of the species sequenced. Previous_Names: any previous names that this species was known by. Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table). Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691). Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited). Merged_Strains: whether multiple strains of the same species were merged to create the data set. Data_Source_URL: the URL(s) from which the data were downloaded. Data_Source_Name: the name of the data set (as assigned by the data source). Paper_DOI: the DOI(s) of the paper(s) that published the data set. Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified. Data_Source_Type: the type o...
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
R
Replication data for: "Gene Regulatory Network Inference Methodology for...
entrepot.recherche.data.gouv.fr
bin, csv, text/tsv +2
Updated Oct 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry; Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry (2024). Replication data for: "Gene Regulatory Network Inference Methodology for Genomic and Transcriptomic Data Acquired in Genetically Related Heterozygote Individuals", 100 simulated datasets of RNA gene expressions of sunflower hybrids [Dataset]. http://doi.org/10.15454/VRGWZ2
Explore at:
text/tsv(659833), tsv(1833), tsv(1800), tsv(1804), tsv(1789), text/tsv(654928), tsv(1872), text/tsv(662958), tsv(1802), text/tsv(661869), tsv(1881), tsv(4586), text/tsv(661641), text/tsv(661988), tsv(1828), tsv(1821), text/tsv(657995), tsv(1819), text/tsv(655256), text/tsv(661909), tsv(1832), text/tsv(661021), text/tsv(662302), tsv(1846), text/tsv(661873), text/tsv(662024), text/tsv(661937), tsv(1830), text/tsv(661653), text/tsv(660064), tsv(1798), tsv(1950), tsv(1862), text/tsv(662551), tsv(1818), tsv(1838), text/tsv(657811), tsv(1773), tsv(1811), tsv(4591), tsv(1770), tsv(1775), text/tsv(655689), text/tsv(661488), tsv(1822), text/tsv(661734), text/tsv(658440), text/tsv(662487), text/tsv(659689), tsv(1827), text/tsv(660842), tsv(1808), csv(259993), text/tsv(139701), tsv(1861), tsv(1859), text/tsv(662756), bin(36), text/tsv(662239), text/tsv(661996), tsv(1851), tsv(17656), text/tsv(661305), text/tsv(660526), text/tsv(662081), tsv(1873), text/tsv(662441), tsv(1743), text/tsv(662142), tsv(1857), text/tsv(662323), tsv(1845), tsv(1787), tsv(1841), tsv(1831), text/tsv(661972), text/tsv(661591), text/tsv(660460), text/tsv(663495), text/tsv(661958), tsv(1858), text/tsv(660991), text/tsv(662072), text/tsv(661964), text/tsv(661906), tsv(1844), csv(265132), tsv(4682602), text/tsv(661830), text/tsv(662327), tsv(4599), tsv(1820), text/tsv(662629), text/tsv(662583), txt(2411), text/tsv(662188), tsv(4587), tsv(1809), csv(264784), tsv(4607), tsv(1840), text/tsv(662244), tsv(1944), tsv(1794), text/tsv(661594), tsv(1777), tsv(1740), text/tsv(661233), text/tsv(661868), tsv(1823), text/tsv(657946), text/tsv(657579), tsv(1877), tsv(1834), csv(258559), tsv(1879), text/tsv(660968), text/tsv(657331), tsv(1801), text/tsv(661994), tsv(4592), tsv(1848), text/tsv(656055), tsv(1860), text/tsv(662154), text/tsv(662133), csv(258052), tsv(1785), text/tsv(662211), text/tsv(662109), tsv(1865), text/tsv(661947), text/tsv(661805), tsv(1825), text/tsv(662460), text/tsv(657571), text/tsv(662397), text/tsv(662023), tsv(1816), text/tsv(661823), text/tsv(659349), text/tsv(661912), text/tsv(660540), tsv(1836), tsv(1842), text/tsv(662615), tsv(1807), tsv(4605), tsv(1835), tsv(1748), text/tsv(661905), tsv(1871), text/tsv(662207), text/tsv(660580), tsv(4603), text/tsv(659207), text/tsv(659100), text/tsv(661987), text/tsv(662427), text/tsv(661978), text/tsv(661593), tsv(1781), text/tsv(657397), tsv(1812), tsv(1799), tsv(1817), tsv(3552), csv(255601), csv(248611), text/tsv(662185), text/tsv(662309), text/tsv(661735), text/tsv(662696), text/tsv(662216), text/tsv(661775), tsv(1813), text/tsv(662176), text/tsv(659784), text/tsv(661273), tsv(1855), text/tsv(659982), text/tsv(662648), tsv(1796), tsv(4590), csv(251719), text/tsv(662132), csv(255565), tsv(1878), text/tsv(662231), text/tsv(658367), tsv(1736), text/tsv(661931), text/tsv(660954), csv(251050)Available download formats
Unique identifier
https://doi.org/10.15454/VRGWZ2
Dataset updated
Oct 23, 2024
Dataset provided by
Recherche Data Gouv
Authors
Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry; Lise Pomiès; Céline Brouard; Harold Duruflé; Élise Maigné; Clément Carré; Louise Gody; Fulya Trösser; George Katsirelos; Brigitte Mangin; Nicolas Langlade; Simon de Givry
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Description
Replication data for: "Gene regulatory network inference methodology for genomic and transcriptomic data acquired in genetically related heterozygote individuals": 100 simulated datasets of RNA gene expressions of sunflower hybrids. This data set includes the 100 simulated datasets that have been used in the paper "Gene regulatory network inference methodology for genomic and transcriptomic data acquired in genetically related heterozygote individuals", Bioinformatics, 2022, https://doi.org/10.1093/bioinformatics/btac445 They are artificial expression datasets created with the data simulator SysGenSIM (modified) from the same gene regulatory network: artificialDataSet_network.csv. The files that have been used to generate the 100 expression datasets are also included (activation/repression sign networkSign and heterosis effect zMatrix directories). The "networks" directory contains the learned networks. The network inference method can be found here. For the description of the files, and the dimensions, see README.txt.
f
Data from: Challenge for Deep Learning: Protein Structure Prediction of...
acs.figshare.com
xlsx
Updated Nov 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gustav Olanders; Giulia Testa; Alessandro Tibo; Eva Nittinger; Christian Tyrchan (2024). Challenge for Deep Learning: Protein Structure Prediction of Ligand-Induced Conformational Changes at Allosteric and Orthosteric Sites [Dataset]. http://doi.org/10.1021/acs.jcim.4c01475.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c01475.s003
Dataset updated
Nov 1, 2024
Dataset provided by
ACS Publications
Authors
Gustav Olanders; Giulia Testa; Alessandro Tibo; Eva Nittinger; Christian Tyrchan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the realm of biomedical research, understanding the intricate structure of proteins is crucial, as these structures determine how proteins function within our bodies and interact with potential drugs. Traditionally, methods like X-ray crystallography and cryo-electron microscopy have been used to unravel these structures, but they are often challenging, time-consuming and costly. Recently, a breakthrough in computational biology has emerged with the development of deep learning algorithms capable of predicting protein structures based on their amino acid sequences (Jumper, J., et al. Nature 2021, 596, 583. Lane, T. J. Nature Methods 2023, 20, 170. Kryshtafovych, A., et al. Proteins: Structure, Function and Bioinformatics 2021, 89, 1607). This study focuses on predicting the dynamic changes that proteins undergo upon ligand binding, specifically when they bind to allosteric sites, i.e. a pocket different from the active site. Allosteric modulators are particularly important for drug discovery, as they open new avenues for designing drugs that can target proteins more effectively and with fewer side effects (Nussinov, R.; Tsai, C. J. Cell 2013, 153, 293). To study this, we curated a data set of 578 X-ray structures comprised of proteins displaying orthosteric and allosteric binding as well as a general framework to evaluate deep learning-based structure prediction methods. Our findings demonstrate the potential and current limitations of deep learning methods, such as AlphaFold2 (Jumper, J., et al. Nature 2021, 596, 583), NeuralPLexer (Qiao, Z., et al. Nat Mach Intell 2024, 6, 195), and RoseTTAFold All-Atom (Krishna, R., et al. Science 2024, 384, eadl2528) to predict not just static protein structures but also the dynamic conformational changes. Herein we show that predicting the allosteric induce-fit conformation still poses a challenge to deep learning methods as they more accurately predict the orthosteric bound conformation compared to the allosteric induce fit conformation. For AlphaFold2, we observed that conformational diversity, and sampling between the apo and holo state could be increased by modifying the MSA depth, but this did not enhance the ability to generate conformations close to the allosteric induced-fit conformation. To further support advancements in protein structure prediction field, the curated data set and evaluation framework are made publicly available.
BOriS Training Datasets
figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theodor Sperlea (2023). BOriS Training Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.8108357.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8108357.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Theodor Sperlea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Positive and negative dataset created for training of machine learning models to classify/identify gammaproteobacterial oriC sequences.
ChEMBL EBI Small Molecules Database
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash
s
References and test datasets for the Cactus pipeline
figshare.scilifelab.se
txt
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerome Salignon; Lluis Milan Arino; Maxime Garcia; Christian Riedel (2025). References and test datasets for the Cactus pipeline [Dataset]. http://doi.org/10.17044/scilifelab.20171347.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17044/scilifelab.20171347.v4
Dataset updated
Jan 15, 2025
Dataset provided by
Karolinska Institutet
Authors
Jerome Salignon; Lluis Milan Arino; Maxime Garcia; Christian Riedel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview This item contains references and test datasets for the Cactus pipeline. Cactus (Chromatin ACcessibility and Transcriptomics Unification Software) is an mRNA-Seq and ATAC-Seq analysis pipeline that aims to provide advanced molecular insights on the conditions under study.

Test datasets The test datasets contain all data needed to run Cactus in each of the 4 supported organisms. This include ATAC-Seq and mRNA-Seq data (.fastq.gz), parameter files (.yml) and design files (*.tsv). They were were created for each species by downloading publicly available datasets with fetchngs (Ewels et al., 2020) and subsampling reads to the minimum required to have enough DAS (Differential Analysis Subsets) for enrichment analysis. Datasets downloaded: - Worm and Humans: GSE98758 - Fly: GSE149339 - Mouse: GSE193393

References One of the goals of Cactus is to make the analysis as simple and fast as possible for the user while providing detailed insights on molecular mechanisms. This is achieved by parsing all needed references for the 4 ENCODE (Dunham et al., 2012; Stamatoyannopoulos et al., 2012; Luo et al., 2020) and modENCODE (THE MODENCODE CONSORTIUM et al., 2010; Gerstein et al., 2010) organisms (human, M. musculus, D. melanogaster and C. elegans). This parsing step was done with a Nextflow pipeline with most tools encapsulated within containers for improved efficiency and reproducibility and to allow the creation of customized references. Genomic sequences and annotations were downloaded from Ensembl (Cunningham et al., 2022). The ENCODE API (Luo et al., 2020) was used to download the CHIP-Seq profiles of 2,714 Transcription Factors (TFs) (Landt et al., 2012; Boyle et al., 2014) and chromatin states in the form of 899 ChromHMM profiles (Boix et al., 2021; van der Velde et al., 2021) and 6 HiHMM profiles (Ho et al., 2014). Slim annotations (cell, organ, development, and system) were parsed and used to create groups of CHIP-Seq profiles that share the same annotations, allowing users to analyze only CHIP-Seq profiles relevant to their study. 2,779 TF motifs were obtained from the Cis-BP database (Lambert et al., 2019). GO terms and KEGG pathways were obtained via the R packages AnnotationHub (Morgan and Shepherd, 2021) and clusterProfiler (Yu et al., 2012; Wu et al., 2021), respectively.

Documentation More information on how to use Cactus and how references and test datasets were created is available on the documentation website: https://github.com/jsalignon/cactus.
Z
Data for: 'FAS: assessing the similarity between proteins using...
data.niaid.nih.gov
zenodo.org
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ingo Ebersberger (2023). Data for: 'FAS: assessing the similarity between proteins using multi-layered feature architectures' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7896005
Explore at:
Dataset updated
May 4, 2023
Dataset provided by
Julian Dosch
Ingo Ebersberger
Holger Bergmann
Vinh Tran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data and result data for the analyses made for the manuscript:

'FAS: assessing the similarity between proteins using multi-layered feature architectures'

https://doi.org/10.1093/bioinformatics/btad226

This dataset contains raw data obtained from QFO Orthobench and Gene Ontology database. Analyses were made to showcase the different uses of the FAS algorithm.
u
Data from: CottonGen Synteny Viewer
agdatacommons.nal.usda.gov
catalog.data.gov
bin
Updated Feb 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main (2024). CottonGen Synteny Viewer [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/CottonGen_Synteny_Viewer/24853278
Explore at:
binAvailable download formats
Dataset updated
Feb 13, 2024
Dataset provided by
MainLab, Washington State University
Authors
Taein Lee; Sook Jung; Ksenija Gasic; Todd Campbell; Jing Yu; Jodi Humann; Heidi Hough; Dorrie Main
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Conserved syntenic regions among publicly available cotton genomes were analyzed by CottonGen and made available using the Tripal Synteny Viewer developed by the Fei Bioinformatics Lab from the Boyce Thomson Institute at Cornell University. Analysis was done using MCScanX (Wang et al. 2012) with default settings and blast files were made using blastp with an expectation value cutoff < 1e-10, maximum alignment of 5, and maximum scores of 5. The synteny viewer displays all the conserved syntenic blocks between a selected chromosome of a genome and another genome in a circular and tabular layout. Once a block is chosen in the circular or tabular layout, all the genes in the block are shown in a graphic and tabular format. The gene names have hyperlinks to gene pages where detailed information of the gene can be accessed. The ‘synteny’ section of the gene page displays all the orthologs and the paralogs with link to the corresponding syntenic blocks or gene pages. Resources in this dataset:Resource Title: Website Pointer for Cottongen Synteny Viewer. File Name: Web Page, url: https://www.cottongen.org/synview/search Synteny among Cotton genomes can be viewed using the new Tripal Synteny Viewer. Conserved syntenic regions among publicly available cotton genomes were analyzed by CottonGen and made available using the Tripal Synteny Viewer developed by the Fei Bioinformatics Lab from the Boyce Thomson Institute at Cornell University. Analysis was done using MCScanX (Wang et al. 2012) with default settings and blast files were made using blastp with an expectation value cutoff < 1e-10, maximum alignment of 5, and maximum scores of 5. The synteny viewer dynamically displays all the conserved syntenic blocks between a selected chromosome of a genome and another genome in a circular and tabular layout. Once a block is chosen in the circular or tabular layout, all the genes in the block are shown in a graphic and tabular format. The gene names have hyperlinks to gene pages where detailed information of the gene can be accessed. The ‘synteny’ section of the gene page displays all the orthologs and the paralogs with link to the corresponding syntenic blocks or gene pages.
m
NeonatalPortugal2018
data.mendeley.com
Updated Dec 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Machado e Costa (2019). NeonatalPortugal2018 [Dataset]. http://doi.org/10.17632/br8tnh3h47.1
Explore at:
Unique identifier
https://doi.org/10.17632/br8tnh3h47.1
Dataset updated
Dec 7, 2019
Authors
Francisco Machado e Costa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Portuguese National Registry on low weight newborns between 2013 and 2018, made available for research purposes. Dataset is composed of 3823 unique entries registering birthweight, biological sex of the infant (1-Male; 2-Female), CRIB score (0-21) and survival (0-Survival; 1-Death).
Data from: CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research...
search.datacite.org
data.mendeley.com
+1more
Updated Dec 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stian Soiland-Reyes (2018). CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object) [Dataset]. http://doi.org/10.17632/xnwncxpw42
Explore at:
Unique identifier
https://doi.org/10.17632/xnwncxpw42
Dataset updated
Dec 4, 2018
Dataset provided by
DataCitehttps://www.datacite.org/
Mendeley
Authors
Stian Soiland-Reyes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from: 1. Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM. 2. The Genome BAM file is processed using Picard MarkDuplicates producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation). 3. SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step. 4. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics. 5. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences. For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation. This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl
Artificial single-cell datasets with cell clusters used for...
figshare.com
application/x-gzip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Vandenbon (2023). Artificial single-cell datasets with cell clusters used for singleCellHaystack manuscript [Dataset]. http://doi.org/10.6084/m9.figshare.12319787.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12319787.v1
Dataset updated
Jun 11, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Alexis Vandenbon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An archive containing 100 artificial single-cell datasets. The data of each dataset is an R data file (.rda). The file names have the following format: splatter_thousands_of_cellskCells_groups_set_set_id.rda'For example: "splatter_1kCells_groups_set_9.rda" represents the 9th set containing 1000 cells made using the "groups" option of splatSimulate. You can import the data into R using the load() function. Each dataset includes the following R objects:1) counts : the number of reads or UMIs for each gene in each cell2) gene.data : a summary of the data generated by Splatter3) params : the parameters used to generate the dataset
Data from: Aligner optimization increases accuracy and decreases compute...
zenodo.org
data.niaid.nih.gov
+2more
application/gzip
Updated May 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelly M. Robinson; Aziah S. Hawkins; Ivette Santana-Cruz; Ricky S. Adkins; Amol C. Shetty; Sushma Nagaraj; Lisa Sadzewicz; Luke J. Tallon; David A. Rasko; Claire M. Fraser; Anup Mahurkar; Joana C. Silva; Julie C. Dunning Hotopp; Kelly M. Robinson; Aziah S. Hawkins; Ivette Santana-Cruz; Ricky S. Adkins; Amol C. Shetty; Sushma Nagaraj; Lisa Sadzewicz; Luke J. Tallon; David A. Rasko; Claire M. Fraser; Anup Mahurkar; Joana C. Silva; Julie C. Dunning Hotopp (2022). Data from: Aligner optimization increases accuracy and decreases compute times in multi-species sequence data [Dataset]. http://doi.org/10.5061/dryad.m1m0p
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m1m0p
Dataset updated
May 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kelly M. Robinson; Aziah S. Hawkins; Ivette Santana-Cruz; Ricky S. Adkins; Amol C. Shetty; Sushma Nagaraj; Lisa Sadzewicz; Luke J. Tallon; David A. Rasko; Claire M. Fraser; Anup Mahurkar; Joana C. Silva; Julie C. Dunning Hotopp; Kelly M. Robinson; Aziah S. Hawkins; Ivette Santana-Cruz; Ricky S. Adkins; Amol C. Shetty; Sushma Nagaraj; Lisa Sadzewicz; Luke J. Tallon; David A. Rasko; Claire M. Fraser; Anup Mahurkar; Joana C. Silva; Julie C. Dunning Hotopp
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows–Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium–human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
1000-Character Data Sets with Missing Data
figshare.com
application/gzip
Updated Jan 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
April Wright (2016). 1000-Character Data Sets with Missing Data [Dataset]. http://doi.org/10.6084/m9.figshare.1164185.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1164185.v1
Dataset updated
Jan 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
April Wright
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Publication: Wright AM and Hillis DM (2014). Bayesian analysis using a simple likelihood model outperforms parsimony for estimation of phylogeny from discrete morphological data. PLOS ONE. Contents: 1000-character data sets with missing data, and the phylogenetic trees estimated from these sets. Details: These data sets were simulated along the tree in Fig. 1 of the paper, and contain 1000 characters. To assess the effects of missing data on phylogenetic estimation, we used several schemes for character deletion. We sorted the characters by rate of change, and divided them into three categories: fast-, intermediate-, and slow-evolving sites. Within each class of sites, we created data sets in which we removed between 10% and 100% of sites to investigate the effects of underrepresentation of certain classes of characters. Missing data were concentrated in fossil taxa, as seen in Figure 2.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/10315204

Dataset updated

Dec 27, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rafael Gallo

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.
Sequence: String of amino acids.
Molecular_Weight: Molecular weight calculated from the sequence.
Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
Hydrophobicity: Average hydrophobicity calculated from the sequence.
Total_Charge: Sum of the charges of the amino acids in the sequence.
Polar_Proportion: Percentage of polar amino acids in the sequence.
Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
Sequence_Length: Total number of amino acids in the sequence.
Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
Property Calculation: Physicochemical properties were calculated using the Biopython library.
Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Clear search

Close search

Google apps

Main menu

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Alternative Splicing Annotation Project II Database

Scorpio Gene-Taxa Benchmark Dataset

European Nucleotide Archive (ENA)

Data, R code and output Seurat Objects for single cell RNA-seq analysis of...

IntEnz- Integrated relational Enzyme database

Data from: EukProt: a database of genome-scale predicted proteins across the...

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Replication data for: "Gene Regulatory Network Inference Methodology for...

Data from: Challenge for Deep Learning: Protein Structure Prediction of...

BOriS Training Datasets

ChEMBL EBI Small Molecules Database

Context

Content

Acknowledgements

References and test datasets for the Cactus pipeline

Data for: 'FAS: assessing the similarity between proteins using...

Data from: CottonGen Synteny Viewer

NeonatalPortugal2018

Data from: CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research...

Artificial single-cell datasets with cell clusters used for...

Data from: Aligner optimization increases accuracy and decreases compute...

1000-Character Data Sets with Missing Data

Bioinformatics Protein Dataset - SimulatedSee More Versions

Synthetic protein dataset with sequences, physical properties, and functional cl

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Bioinformatics Protein Dataset - Simulated