10 datasets found

Supplementary Table 3. Knowledge and attitudes among life scientists towards...
figshare.com
pdf
Updated Aug 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evanthia Kaimaklioti Samota (2020). Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey. [Dataset]. http://doi.org/10.6084/m9.figshare.7706753.v3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7706753.v3
Dataset updated
Aug 11, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Evanthia Kaimaklioti Samota
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Table 3.Profiles of individuals in the eLIFE cohort who replied "No have never tried reproducing any published results" stratified by how they responded in each of the questions 13, 15, 16 and 17.
Drosophila Melanogaster Genome
kaggle.com
ieee-dataport.org
zip
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
Explore at:
zip(136202106 bytes)Available download formats
Dataset updated
Nov 17, 2019
Authors
Myles O'Neill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Drosophila Melanogaster

Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

https://en.wikipedia.org/wiki/Drosophila_melanogaster

About the Genome

This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

![D. melanogaster chromosomes][1]

The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

Bioinformatics

Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

Learning Bioinformatics

There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

Files in this Dataset

Drosophila Melanogaster Genome

genome.fa

The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.
Meta Information

There are 3 additional files with meta information about the genome.

meta-cpg-island-ext-unmasked.csv

This file contains descriptive information about CpG Islands in the genome.

https://en.wikipedia.org/wiki/CpG_site

meta-cytoband.csv

This file describes the positions of cytogenic bands on each chromosome.

https://en.wikipedia.org/wiki/Cytogenetics

meta-simple-repeat.csv

This file describes simple tandem repeats in the genome.

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat
Drosophila Melanogaster mRNA Sequences

Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

https://en.wikipedia.org/wiki/Messenger_RNA

mrna-genbank.fa

This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/genbank/

mrna-refseq.fa

This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/refseq/
Gene Predictions

A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...
G
Protein Crystallography Services Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Protein Crystallography Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/protein-crystallography-services-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Protein Crystallography Services Market Outlook

According to our latest research, the global protein crystallography services market size reached USD 1.21 billion in 2024, reflecting robust demand across multiple end-user segments. The market is anticipated to grow at a CAGR of 8.4% from 2025 to 2033, propelled by technological advancements and the expanding applications of protein crystallography in drug discovery and structural biology. By 2033, the market is forecasted to attain a value of USD 2.51 billion. This growth trajectory is primarily driven by increasing investments in pharmaceutical R&D, the rising prevalence of chronic diseases necessitating novel therapeutics, and the integration of automation and artificial intelligence in structural biology workflows.

A key growth factor for the protein crystallography services market is the surging demand for structure-based drug design in the pharmaceutical and biotechnology sectors. Drug discovery processes have become increasingly reliant on high-resolution protein structures to identify, validate, and optimize drug targets. Protein crystallography, especially X-ray crystallography, remains the gold standard for elucidating atomic-level details of biomolecules, enabling the rational design of more effective and selective therapeutics. The growing pipeline of biologics and small-molecule drugs, coupled with the need to shorten drug development timelines, has led to a significant uptick in outsourcing crystallography services to specialized providers. These providers offer advanced instrumentation, experienced personnel, and comprehensive data analysis, allowing pharmaceutical companies to focus their resources on core competencies while accelerating their R&D initiatives.

Another major driver is the rapid evolution of crystallography technologies, including the adoption of cryo-electron microscopy (cryo-EM), neutron crystallography, and state-of-the-art synchrotron facilities. These advancements have expanded the range of proteins and complexes amenable to structural analysis, including membrane proteins and large macromolecular assemblies that were previously challenging to crystallize. The integration of automation, robotics, and artificial intelligence into sample preparation, data collection, and structure determination has dramatically increased throughput and accuracy, reducing costs and turnaround times. Furthermore, collaborations between academic institutions, research organizations, and industry players have fostered innovation in crystallization techniques, data processing algorithms, and structural databases, further fueling market growth.

The increasing prevalence of chronic and infectious diseases, such as cancer, diabetes, and emerging viral infections, has underscored the need for novel therapeutic targets and vaccines. Protein crystallography services play a pivotal role in the structural characterization of pathogenic proteins, antigen-antibody complexes, and enzyme-inhibitor interactions, facilitating the rational design of next-generation drugs and vaccines. Government initiatives to promote biomedical research, coupled with rising investments from venture capital and pharmaceutical giants, are creating a conducive environment for market expansion. Additionally, the emergence of personalized medicine and precision therapeutics is driving the demand for structural insights into patient-specific protein variants, further boosting the uptake of crystallography services globally.

The role of Structural Bioinformatics Software is becoming increasingly pivotal in the field of protein crystallography. These software tools facilitate the modeling and simulation of protein structures, enabling researchers to predict molecular interactions and optimize crystallization conditions. By integrating structural bioinformatics with experimental data, scientists can enhance the accuracy of protein models and streamline the drug discovery process. The synergy between computational and experimental approaches is driving innovation in structural biology, allowing for more efficient identification of drug targets and the development of novel therapeutics. As the demand for high-resolution protein structures grows, the adoption of advanced bioinformatics software is expected to rise, further propelling the market forward.

Regionally, North America con
SARS-sequence-data
kaggle.com
zip
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Powers (2020). SARS-sequence-data [Dataset]. https://www.kaggle.com/datasets/spowers/sarssequencedata
Explore at:
zip(8071988 bytes)Available download formats
Dataset updated
Mar 31, 2020
Authors
Scott Powers
Description
Context

SARS-cov-2 is the causative agent in the current global pandemic. SARS-cov-2, also called novel Coronavirus, is related to both SARS and bat SARS. Many datasets exist on kaggle related to this epidemic, however genomics data had yet to be added. NCBI is an open repository of biomedical data including sequencing data from laboratories around the world. Many sequences have been collected for all three families of viruses mentioned, however the data is presented in an easy to use format for data scientists. This dataset is a collection of those sequences, which will be updated periodically as new sequencing data is added.

Content

This dataset contains sequence data obtained from NCBI for various coronaviridae. Specifically of interest at this time are the causative agents of SARS and COVID-19 and the related family that causes bat SARS. The data specific to those three groups is contained with a CSV file along with the full text description and NCBI accession number. Additional information about each can be obtained by searching NCBI for the specific accession number.

In addition to the csv file are the original FASTA files for those sequence data, along with another for related coronavirus.

Acknowledgements

These FASTA files were collected using a script maintained by the BioStars Handbook authors. The actual sequence data has been generated by various research and clinical groups around the world dealing with infectious diseases.

Inspiration

The BioStars Handbook nCov Analysis text is a great starting point to look at these data from a general bioinformatics perspective. However of interest is how we can look beyond those methods to incorporate general data science techniques to gain more insight into these agents.

Sequence similarity is a good place to start to understand the evolutionary history of these organisms. This is well studied in the literature, however it can be useful as a starting point.

For features I would recommend looking into kmer counts as well as one hot encoding the sequence. To help one hot encode the sequences might need to have their length padded, and the classic placeholder in bioinformatics is the character N.
w
Data from: BBGD: an online database for blueberry genomic data
data.wu.ac.at
agdatacommons.nal.usda.gov
+1more
html, xls
Updated Dec 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Agriculture (2017). Data from: BBGD: an online database for blueberry genomic data [Dataset]. https://data.wu.ac.at/schema/data_gov/MmM3MTAyNTktNTYwMS00M2Q5LWI1OGEtNzFkNzA0NDkwYzEz
Explore at:
html, xlsAvailable download formats
Dataset updated
Dec 21, 2017
Dataset provided by
Department of Agriculture
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset is supplemental to the article "BBGD: an online database for blueberry genomic data," (2007); it is titled "list of genes printed on microarray slides."

The article, "BBGD: an online database for blueberry genomic data," (2007) involving blueberry cold hardiness experiments has a list of all the genes that were printed on microarray slides. This dataset, supplemental to the article, is called: "list of genes printed on microarray slides." 1471-2229-7-5-s1.xls 663k.
By using the BBGD database, researchers developed EST-based markers for mapping, and have identified a number of "candidate" cold tolerance genes that are highly expressed in blueberry flower buds after exposure to low temperatures.

BBGD (http://bioinformatics.towson.edu/BBGD/) is a public online database, and was developed for blueberry genomics. BBGD is both a sequence and gene expression database: it stores both EST and microarray data, and allows scientists to correlate expression profiles with gene function. Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure.
f
DataSheet_1_Read Mapping and Transcript Assembly: A Scalable and...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sateesh Peri; Sarah Roberts; Isabella R. Kreko; Lauren B. McHan; Alexandra Naron; Archana Ram; Rebecca L. Murphy; Eric Lyons; Brian D. Gregory; Upendra K. Devisetty; Andrew D. L. Nelson (2023). DataSheet_1_Read Mapping and Transcript Assembly: A Scalable and High-Throughput Workflow for the Processing and Analysis of Ribonucleic Acid Sequencing Data.pdf [Dataset]. http://doi.org/10.3389/fgene.2019.01361.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.01361.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Sateesh Peri; Sarah Roberts; Isabella R. Kreko; Lauren B. McHan; Alexandra Naron; Archana Ram; Rebecca L. Murphy; Eric Lyons; Brian D. Gregory; Upendra K. Devisetty; Andrew D. L. Nelson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a high-throughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.
ClinVar_BRCA_Mutation_Filtering_Ensembl_VEP
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). ClinVar_BRCA_Mutation_Filtering_Ensembl_VEP [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/clinvar-brca-mutation-filtering-ensembl-vep
Explore at:
zip(3655498 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset focuses on BRCA gene mutations linked to breast and ovarian cancer.

It contains mutation information sourced from ClinVar, a public database of clinically relevant variants.

Variants are filtered and annotated using Ensembl Variant Effect Predictor (VEP).

The dataset includes information about mutation types, clinical significance, and genomic coordinates.

It is suitable for bioinformatics analysis, variant interpretation, and cancer research.

Researchers and data scientists can use this dataset to explore pathogenicity of BRCA variants.

The dataset supports reproducible workflows for variant filtering and annotation in Python.
Median science identity and intent to pursue bioinformatics for the Virtual...
plos.figshare.com
xls
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niquo Ceberio; Peter Le; Jasmón Bailey; Sonthonax Vernard; Nichole Coleman; Yazmin P. Carrasco; Telisa King; Kirsten Bibbins-Domingo; Tung Nguyen; Audrey Parangan-Smith; Kelechi Uwaezuoke; Robert C. Rivers; Kenjus Watson; Leticia Márquez-Magaña; Kala M. Mehta (2024). Median science identity and intent to pursue bioinformatics for the Virtual BUILD Research Collaboratory 2020. [Dataset]. http://doi.org/10.1371/journal.pone.0294307.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294307.t004
Dataset updated
Feb 27, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Niquo Ceberio; Peter Le; Jasmón Bailey; Sonthonax Vernard; Nichole Coleman; Yazmin P. Carrasco; Telisa King; Kirsten Bibbins-Domingo; Tung Nguyen; Audrey Parangan-Smith; Kelechi Uwaezuoke; Robert C. Rivers; Kenjus Watson; Leticia Márquez-Magaña; Kala M. Mehta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Median science identity and intent to pursue bioinformatics for the Virtual BUILD Research Collaboratory 2020.
TreeCluster: Clustering biological sequences using phylogenetic trees
plos.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab (2023). TreeCluster: Clustering biological sequences using phylogenetic trees [Dataset]. http://doi.org/10.1371/journal.pone.0221068
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0221068
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.
Number of singleton clusters (σ), total number of clusters (Σ), and maximum...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab (2023). Number of singleton clusters (σ), total number of clusters (Σ), and maximum cluster size (max) for TreeCluster and GreenGenes for various thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0221068.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0221068.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Metin Balaban; Niema Moshiri; Uyen Mai; Xingfan Jia; Siavash Mirarab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the Greengenes database, OTU definitions for thresholds α = 0.015 and α = 0.045 are not available.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Evanthia Kaimaklioti Samota (2020). Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey. [Dataset]. http://doi.org/10.6084/m9.figshare.7706753.v3

Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey.

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.7706753.v3

Dataset updated

Aug 11, 2020

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Evanthia Kaimaklioti Samota

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Supplementary Table 3.Profiles of individuals in the eLIFE cohort who replied "No have never tried reproducing any published results" stratified by how they responded in each of the questions 13, 15, 16 and 17.

Clear search

Close search

Google apps

Main menu

Supplementary Table 3. Knowledge and attitudes among life scientists towards...

Drosophila Melanogaster Genome

Drosophila Melanogaster

About the Genome

Bioinformatics

Learning Bioinformatics

Files in this Dataset

Protein Crystallography Services Market Research Report 2033

Protein Crystallography Services Market Outlook

SARS-sequence-data

Context

Content

Acknowledgements

Inspiration

Data from: BBGD: an online database for blueberry genomic data

DataSheet_1_Read Mapping and Transcript Assembly: A Scalable and...

ClinVar_BRCA_Mutation_Filtering_Ensembl_VEP

Median science identity and intent to pursue bioinformatics for the Virtual...

TreeCluster: Clustering biological sequences using phylogenetic trees

Number of singleton clusters (σ), total number of clusters (Σ), and maximum...

Supplementary Table 3. Knowledge and attitudes among life scientists towards reproducibility within journal articles: a research survey.