Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Table 3.Profiles of individuals in the eLIFE cohort who replied "No have never tried reproducing any published results" stratified by how they responded in each of the questions 13, 15, 16 and 17.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.
When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).
https://en.wikipedia.org/wiki/Drosophila_melanogaster
This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.
![D. melanogaster chromosomes][1]
The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly
Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.
Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].
Of course, if you've got some idea of the basics already - don't be afraid to jump right in!
There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.
Drosophila Melanogaster Genome
The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.
Meta InformationThere are 3 additional files with meta information about the genome.
This file contains descriptive information about CpG Islands in the genome.
https://en.wikipedia.org/wiki/CpG_site
This file describes the positions of cytogenic bands on each chromosome.
https://en.wikipedia.org/wiki/Cytogenetics
This file describes simple tandem repeats in the genome.
https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat
Drosophila Melanogaster mRNA SequencesMessenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.
https://en.wikipedia.org/wiki/Messenger_RNA
This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.
http://www.ncbi.nlm.nih.gov/genbank/
This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.
http://www.ncbi.nlm.nih.gov/refseq/
Gene PredictionsA gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...
Facebook
Twitter
According to our latest research, the global protein crystallography services market size reached USD 1.21 billion in 2024, reflecting robust demand across multiple end-user segments. The market is anticipated to grow at a CAGR of 8.4% from 2025 to 2033, propelled by technological advancements and the expanding applications of protein crystallography in drug discovery and structural biology. By 2033, the market is forecasted to attain a value of USD 2.51 billion. This growth trajectory is primarily driven by increasing investments in pharmaceutical R&D, the rising prevalence of chronic diseases necessitating novel therapeutics, and the integration of automation and artificial intelligence in structural biology workflows.
A key growth factor for the protein crystallography services market is the surging demand for structure-based drug design in the pharmaceutical and biotechnology sectors. Drug discovery processes have become increasingly reliant on high-resolution protein structures to identify, validate, and optimize drug targets. Protein crystallography, especially X-ray crystallography, remains the gold standard for elucidating atomic-level details of biomolecules, enabling the rational design of more effective and selective therapeutics. The growing pipeline of biologics and small-molecule drugs, coupled with the need to shorten drug development timelines, has led to a significant uptick in outsourcing crystallography services to specialized providers. These providers offer advanced instrumentation, experienced personnel, and comprehensive data analysis, allowing pharmaceutical companies to focus their resources on core competencies while accelerating their R&D initiatives.
Another major driver is the rapid evolution of crystallography technologies, including the adoption of cryo-electron microscopy (cryo-EM), neutron crystallography, and state-of-the-art synchrotron facilities. These advancements have expanded the range of proteins and complexes amenable to structural analysis, including membrane proteins and large macromolecular assemblies that were previously challenging to crystallize. The integration of automation, robotics, and artificial intelligence into sample preparation, data collection, and structure determination has dramatically increased throughput and accuracy, reducing costs and turnaround times. Furthermore, collaborations between academic institutions, research organizations, and industry players have fostered innovation in crystallization techniques, data processing algorithms, and structural databases, further fueling market growth.
The increasing prevalence of chronic and infectious diseases, such as cancer, diabetes, and emerging viral infections, has underscored the need for novel therapeutic targets and vaccines. Protein crystallography services play a pivotal role in the structural characterization of pathogenic proteins, antigen-antibody complexes, and enzyme-inhibitor interactions, facilitating the rational design of next-generation drugs and vaccines. Government initiatives to promote biomedical research, coupled with rising investments from venture capital and pharmaceutical giants, are creating a conducive environment for market expansion. Additionally, the emergence of personalized medicine and precision therapeutics is driving the demand for structural insights into patient-specific protein variants, further boosting the uptake of crystallography services globally.
The role of Structural Bioinformatics Software is becoming increasingly pivotal in the field of protein crystallography. These software tools facilitate the modeling and simulation of protein structures, enabling researchers to predict molecular interactions and optimize crystallization conditions. By integrating structural bioinformatics with experimental data, scientists can enhance the accuracy of protein models and streamline the drug discovery process. The synergy between computational and experimental approaches is driving innovation in structural biology, allowing for more efficient identification of drug targets and the development of novel therapeutics. As the demand for high-resolution protein structures grows, the adoption of advanced bioinformatics software is expected to rise, further propelling the market forward.
Regionally, North America con
Facebook
TwitterSARS-cov-2 is the causative agent in the current global pandemic. SARS-cov-2, also called novel Coronavirus, is related to both SARS and bat SARS. Many datasets exist on kaggle related to this epidemic, however genomics data had yet to be added. NCBI is an open repository of biomedical data including sequencing data from laboratories around the world. Many sequences have been collected for all three families of viruses mentioned, however the data is presented in an easy to use format for data scientists. This dataset is a collection of those sequences, which will be updated periodically as new sequencing data is added.
This dataset contains sequence data obtained from NCBI for various coronaviridae. Specifically of interest at this time are the causative agents of SARS and COVID-19 and the related family that causes bat SARS. The data specific to those three groups is contained with a CSV file along with the full text description and NCBI accession number. Additional information about each can be obtained by searching NCBI for the specific accession number.
In addition to the csv file are the original FASTA files for those sequence data, along with another for related coronavirus.
These FASTA files were collected using a script maintained by the BioStars Handbook authors. The actual sequence data has been generated by various research and clinical groups around the world dealing with infectious diseases.
The BioStars Handbook nCov Analysis text is a great starting point to look at these data from a general bioinformatics perspective. However of interest is how we can look beyond those methods to incorporate general data science techniques to gain more insight into these agents.
Sequence similarity is a good place to start to understand the evolutionary history of these organisms. This is well studied in the literature, however it can be useful as a starting point.
For features I would recommend looking into kmer counts as well as one hot encoding the sequence. To help one hot encode the sequences might need to have their length padded, and the classic placeholder in bioinformatics is the character N.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is supplemental to the article "BBGD: an online database for blueberry genomic data," (2007); it is titled "list of genes printed on microarray slides."
The article, "BBGD: an online database for blueberry genomic data," (2007) involving blueberry cold hardiness experiments has a list of all the genes that were printed on microarray slides. This dataset, supplemental to the article, is called: "list of genes printed on microarray slides." 1471-2229-7-5-s1.xls 663k.
By using the BBGD database, researchers developed EST-based markers for mapping, and have identified a number of "candidate" cold tolerance genes that are highly expressed in blueberry flower buds after exposure to low temperatures.
BBGD (http://bioinformatics.towson.edu/BBGD/) is a public online database, and was developed for blueberry genomics. BBGD is both a sequence and gene expression database: it stores both EST and microarray data, and allows scientists to correlate expression profiles with gene function. Presently, the main focus of the database is the identification of genes in blueberry that are significantly induced or suppressed after low temperature exposure.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a high-throughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset focuses on BRCA gene mutations linked to breast and ovarian cancer.
It contains mutation information sourced from ClinVar, a public database of clinically relevant variants.
Variants are filtered and annotated using Ensembl Variant Effect Predictor (VEP).
The dataset includes information about mutation types, clinical significance, and genomic coordinates.
It is suitable for bioinformatics analysis, variant interpretation, and cancer research.
Researchers and data scientists can use this dataset to explore pathogenicity of BRCA variants.
The dataset supports reproducible workflows for variant filtering and annotation in Python.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Median science identity and intent to pursue bioinformatics for the Virtual BUILD Research Collaboratory 2020.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the Greengenes database, OTU definitions for thresholds α = 0.015 and α = 0.045 are not available.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Table 3.Profiles of individuals in the eLIFE cohort who replied "No have never tried reproducing any published results" stratified by how they responded in each of the questions 13, 15, 16 and 17.