10 datasets found
  1. Supplementary Table 3. Knowledge and attitudes among life scientists towards...

    • figshare.com
    pdf
    Updated Aug 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evanthia Kaimaklioti Samota (2020). Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey. [Dataset]. http://doi.org/10.6084/m9.figshare.7706753.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 11, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Evanthia Kaimaklioti Samota
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Table 3.Profiles of individuals in the eLIFE cohort who replied "No have never tried reproducing any published results" stratified by how they responded in each of the questions 13, 15, 16 and 17.

  2. Drosophila Melanogaster Genome

    • kaggle.com
    • ieee-dataport.org
    zip
    Updated Nov 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
    Explore at:
    zip(136202106 bytes)Available download formats
    Dataset updated
    Nov 17, 2019
    Authors
    Myles O'Neill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Drosophila Melanogaster

    Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

    When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

    https://en.wikipedia.org/wiki/Drosophila_melanogaster

    About the Genome

    This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

    ![D. melanogaster chromosomes][1]

    The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

    Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

    Bioinformatics

    Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

    Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

    Learning Bioinformatics

    There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

    Files in this Dataset

    Drosophila Melanogaster Genome

    • genome.fa

    The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.

    Meta Information

    There are 3 additional files with meta information about the genome.

    • meta-cpg-island-ext-unmasked.csv

    This file contains descriptive information about CpG Islands in the genome.

    https://en.wikipedia.org/wiki/CpG_site

    • meta-cytoband.csv

    This file describes the positions of cytogenic bands on each chromosome.

    https://en.wikipedia.org/wiki/Cytogenetics

    • meta-simple-repeat.csv

    This file describes simple tandem repeats in the genome.

    https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat

    Drosophila Melanogaster mRNA Sequences

    Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

    https://en.wikipedia.org/wiki/Messenger_RNA

    • mrna-genbank.fa

    This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/genbank/

    • mrna-refseq.fa

    This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/refseq/

    Gene Predictions

    A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...

  3. G

    Protein Crystallography Services Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Protein Crystallography Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/protein-crystallography-services-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Protein Crystallography Services Market Outlook



    According to our latest research, the global protein crystallography services market size reached USD 1.21 billion in 2024, reflecting robust demand across multiple end-user segments. The market is anticipated to grow at a CAGR of 8.4% from 2025 to 2033, propelled by technological advancements and the expanding applications of protein crystallography in drug discovery and structural biology. By 2033, the market is forecasted to attain a value of USD 2.51 billion. This growth trajectory is primarily driven by increasing investments in pharmaceutical R&D, the rising prevalence of chronic diseases necessitating novel therapeutics, and the integration of automation and artificial intelligence in structural biology workflows.




    A key growth factor for the protein crystallography services market is the surging demand for structure-based drug design in the pharmaceutical and biotechnology sectors. Drug discovery processes have become increasingly reliant on high-resolution protein structures to identify, validate, and optimize drug targets. Protein crystallography, especially X-ray crystallography, remains the gold standard for elucidating atomic-level details of biomolecules, enabling the rational design of more effective and selective therapeutics. The growing pipeline of biologics and small-molecule drugs, coupled with the need to shorten drug development timelines, has led to a significant uptick in outsourcing crystallography services to specialized providers. These providers offer advanced instrumentation, experienced personnel, and comprehensive data analysis, allowing pharmaceutical companies to focus their resources on core competencies while accelerating their R&D initiatives.




    Another major driver is the rapid evolution of crystallography technologies, including the adoption of cryo-electron microscopy (cryo-EM), neutron crystallography, and state-of-the-art synchrotron facilities. These advancements have expanded the range of proteins and complexes amenable to structural analysis, including membrane proteins and large macromolecular assemblies that were previously challenging to crystallize. The integration of automation, robotics, and artificial intelligence into sample preparation, data collection, and structure determination has dramatically increased throughput and accuracy, reducing costs and turnaround times. Furthermore, collaborations between academic institutions, research organizations, and industry players have fostered innovation in crystallization techniques, data processing algorithms, and structural databases, further fueling market growth.




    The increasing prevalence of chronic and infectious diseases, such as cancer, diabetes, and emerging viral infections, has underscored the need for novel therapeutic targets and vaccines. Protein crystallography services play a pivotal role in the structural characterization of pathogenic proteins, antigen-antibody complexes, and enzyme-inhibitor interactions, facilitating the rational design of next-generation drugs and vaccines. Government initiatives to promote biomedical research, coupled with rising investments from venture capital and pharmaceutical giants, are creating a conducive environment for market expansion. Additionally, the emergence of personalized medicine and precision therapeutics is driving the demand for structural insights into patient-specific protein variants, further boosting the uptake of crystallography services globally.



    The role of Structural Bioinformatics Software is becoming increasingly pivotal in the field of protein crystallography. These software tools facilitate the modeling and simulation of protein structures, enabling researchers to predict molecular interactions and optimize crystallization conditions. By integrating structural bioinformatics with experimental data, scientists can enhance the accuracy of protein models and streamline the drug discovery process. The synergy between computational and experimental approaches is driving innovation in structural biology, allowing for more efficient identification of drug targets and the development of novel therapeutics. As the demand for high-resolution protein structures grows, the adoption of advanced bioinformatics software is expected to rise, further propelling the market forward.




    Regionally, North America con

  4. SARS-sequence-data

    • kaggle.com
    zip
    Updated Mar 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Powers (2020). SARS-sequence-data [Dataset]. https://www.kaggle.com/datasets/spowers/sarssequencedata
    Explore at:
    zip(8071988 bytes)Available download formats
    Dataset updated
    Mar 31, 2020
    Authors
    Scott Powers
    Description

    Context

    SARS-cov-2 is the causative agent in the current global pandemic. SARS-cov-2, also called novel Coronavirus, is related to both SARS and bat SARS. Many datasets exist on kaggle related to this epidemic, however genomics data had yet to be added. NCBI is an open repository of biomedical data including sequencing data from laboratories around the world. Many sequences have been collected for all three families of viruses mentioned, however the data is presented in an easy to use format for data scientists. This dataset is a collection of those sequences, which will be updated periodically as new sequencing data is added.

    Content

    This dataset contains sequence data obtained from NCBI for various coronaviridae. Specifically of interest at this time are the causative agents of SARS and COVID-19 and the related family that causes bat SARS. The data specific to those three groups is contained with a CSV file along with the full text description and NCBI accession number. Additional information about each can be obtained by searching NCBI for the specific accession number.

    In addition to the csv file are the original FASTA files for those sequence data, along with another for related coronavirus.

    Acknowledgements

    These FASTA files were collected using a script maintained by the BioStars Handbook authors. The actual sequence data has been generated by various research and clinical groups around the world dealing with infectious diseases.

    Inspiration

    The BioStars Handbook nCov Analysis text is a great starting point to look at these data from a general bioinformatics perspective. However of interest is how we can look beyond those methods to incorporate general data science techniques to gain more insight into these agents.

    Sequence similarity is a good place to start to understand the evolutionary history of these organisms. This is well studied in the literature, however it can be useful as a starting point.

    For features I would recommend looking into kmer counts as well as one hot encoding the sequence. To help one hot encode the sequences might need to have their length padded, and the classic placeholder in bioinformatics is the character N.

  5. w

    Data from: BBGD: an online database for blueberry genomic data

    • data.wu.ac.at
    • agdatacommons.nal.usda.gov
    • +1more
    html, xls
    Updated Dec 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Agriculture (2017). Data from: BBGD: an online database for blueberry genomic data [Dataset]. https://data.wu.ac.at/schema/data_gov/MmM3MTAyNTktNTYwMS00M2Q5LWI1OGEtNzFkNzA0NDkwYzEz
    Explore at:
    html, xlsAvailable download formats
    Dataset updated
    Dec 21, 2017
    Dataset provided by
    Department of Agriculture
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is supplemental to the article "BBGD: an online database for blueberry genomic data," (2007); it is titled "list of genes printed on microarray slides."

    The article, "BBGD: an online database for blueberry genomic data," (2007) involving blueberry cold hardiness experiments has a list of all the genes that were printed on microarray slides. This dataset, supplemental to the article, is called: "list of genes printed on microarray slides." 1471-2229-7-5-s1.xls 663k.
    By using the BBGD database, researchers developed EST-based markers for mapping, and have identified a number of "candidate" cold tolerance genes that are highly expressed in blueberry flower buds after exposure to low temperatures.

    BBGD (http://bioinformatics.towson.edu/BBGD/) is a public online database, and was developed for blueberry genomics. BBGD is both a sequence and gene expression database: it stores both EST and microarray data, and allows scientists to correlate expression profiles with gene function. Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure.

  6. f

    DataSheet_1_Read Mapping and Transcript Assembly: A Scalable and...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sateesh Peri; Sarah Roberts; Isabella R. Kreko; Lauren B. McHan; Alexandra Naron; Archana Ram; Rebecca L. Murphy; Eric Lyons; Brian D. Gregory; Upendra K. Devisetty; Andrew D. L. Nelson (2023). DataSheet_1_Read Mapping and Transcript Assembly: A Scalable and High-Throughput Workflow for the Processing and Analysis of Ribonucleic Acid Sequencing Data.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.01361.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Sateesh Peri; Sarah Roberts; Isabella R. Kreko; Lauren B. McHan; Alexandra Naron; Archana Ram; Rebecca L. Murphy; Eric Lyons; Brian D. Gregory; Upendra K. Devisetty; Andrew D. L. Nelson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a high-throughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.

  7. ClinVar_BRCA_Mutation_Filtering_Ensembl_VEP

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). ClinVar_BRCA_Mutation_Filtering_Ensembl_VEP [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/clinvar-brca-mutation-filtering-ensembl-vep
    Explore at:
    zip(3655498 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset focuses on BRCA gene mutations linked to breast and ovarian cancer.

    It contains mutation information sourced from ClinVar, a public database of clinically relevant variants.

    Variants are filtered and annotated using Ensembl Variant Effect Predictor (VEP).

    The dataset includes information about mutation types, clinical significance, and genomic coordinates.

    It is suitable for bioinformatics analysis, variant interpretation, and cancer research.

    Researchers and data scientists can use this dataset to explore pathogenicity of BRCA variants.

    The dataset supports reproducible workflows for variant filtering and annotation in Python.

  8. Median science identity and intent to pursue bioinformatics for the Virtual...

    • plos.figshare.com
    xls
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niquo Ceberio; Peter Le; Jasmón Bailey; Sonthonax Vernard; Nichole Coleman; Yazmin P. Carrasco; Telisa King; Kirsten Bibbins-Domingo; Tung Nguyen; Audrey Parangan-Smith; Kelechi Uwaezuoke; Robert C. Rivers; Kenjus Watson; Leticia Márquez-Magaña; Kala M. Mehta (2024). Median science identity and intent to pursue bioinformatics for the Virtual BUILD Research Collaboratory 2020. [Dataset]. http://doi.org/10.1371/journal.pone.0294307.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Niquo Ceberio; Peter Le; Jasmón Bailey; Sonthonax Vernard; Nichole Coleman; Yazmin P. Carrasco; Telisa King; Kirsten Bibbins-Domingo; Tung Nguyen; Audrey Parangan-Smith; Kelechi Uwaezuoke; Robert C. Rivers; Kenjus Watson; Leticia Márquez-Magaña; Kala M. Mehta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Median science identity and intent to pursue bioinformatics for the Virtual BUILD Research Collaboratory 2020.

  9. TreeCluster: Clustering biological sequences using phylogenetic trees

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab (2023). TreeCluster: Clustering biological sequences using phylogenetic trees [Dataset]. http://doi.org/10.1371/journal.pone.0221068
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

  10. Number of singleton clusters (σ), total number of clusters (Σ), and maximum...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab (2023). Number of singleton clusters (σ), total number of clusters (Σ), and maximum cluster size (max) for TreeCluster and GreenGenes for various thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0221068.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the Greengenes database, OTU definitions for thresholds α = 0.015 and α = 0.045 are not available.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Evanthia Kaimaklioti Samota (2020). Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey. [Dataset]. http://doi.org/10.6084/m9.figshare.7706753.v3
Organization logoOrganization logo

Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey.

Explore at:
pdfAvailable download formats
Dataset updated
Aug 11, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Evanthia Kaimaklioti Samota
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Supplementary Table 3.Profiles of individuals in the eLIFE cohort who replied "No have never tried reproducing any published results" stratified by how they responded in each of the questions 13, 15, 16 and 17.

Search
Clear search
Close search
Google apps
Main menu