Facebook
TwitterModules showing how the NCBI database classifies and organizes information on DNA sequences, evolutionary relationships, and scientific publications. And a module working to identify a nucleotide sequence from an insect endosymbiont by using BLAST
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
a,bThe sequences ADY42356 and ADY43217 share 99% sequence identity within the predicted NodB homology domain, but only 32% sequence identity outside this region. ADY42356 has a truncated portion of the catalytic domain. We are unable to discern these as 1 or 2 homologs, but given the catalytic domain identity we use only ADY43217 for alignments.cBased on our sequencing of this region of the gene we have shown that predicted residues 1534–1556 (relative to predicted sequence associated with Accession No. CCD67046) would not be present in the protein sequence.dM. incognita Msp9 is a potential homolog which has sequence verification from two clones available from www.nematode.net. Msp30 (AY142120) shows some homology as well, but there is less supporting data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file "viral.genomic.gbk.tar.gz" contains all the RefSeq viral database information in GenBank format, used as the gold standard for the comparisons. In such a way, it should be run as is when using the script "genecounter.py" to count the number of genes, while it is the second (mandatory) input file for the counting of true positives (TP), false positives (FP) and false negatives (FN) via "coordinateschecker.py". In any case, it could also be used for other evaluation purposes.
Facebook
Twitter(A) Bioinformatics Summary statistics and (B) Sequence identity matrix between strains. (XLSX)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NCBI taxonomy, from a database dump downloaded from the NCBI FTP server on 2017-02-3, imported in a SQLite database table as a phylogenetic tree.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This FASTA file is the NCBI Nt (Nucleotide) database (public domain) used for holistic metagenomic screening of ancient DNA data at the Department of Archaeogenetics at the Max Planck Institute for the Science of Human History. We offer here the FASTA file used to construct MALT databases (https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/malt/), which are generally too large for uploading. Please see each relevent publications that use the database for MALT database construction commands.
NCBI does not retain older versions of this database which is why this has been uploaded here. It was downloaded on 2017-10-26 12:39 from: ftp://ftp-trace.ncbi.nih.gov/blast/db/FASTA/nt.gz. The NCBI Nt database is released into the public domain as per https://www.ncbi.nlm.nih.gov/home/about/policies/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Five known sequences are accurately classified by ORFanID.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotation of 112 NCBI GenBank/RefSeq genomes for the preprint "The high-throughput gene prediction of more than 1,700 eukaryote genomes using the software package EukMetaSanity" by Neely, Hu, Alexander, and Tully, 2021Extract files with the command:cat eukmetasanity.20210722.NCBI.parta* | tar -xzvf -
Facebook
TwitterThis is my first resource. Visit https://dataone.org/datasets/sha256%3A0af80cf07445202b04e935d34961b969a43ea995a5a2f323352ab2ed5915c188 for complete metadata about this dataset.
Facebook
TwitterRepository of raw sequencing data from next generation of sequencing platforms including including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT. In addition to raw sequence data, SRA now stores alignment information in form of read placements on reference sequence. Data submissions are welcome. Archive of high throughput sequencing data,part of international partnership of archives (INSDC) at NCBI, European Bioinformatics Institute and DNA Database of Japan. Data submitted to any of this three organizations are shared among them.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the raw RNA data (the genome) for HIV-1 and HIV-2 from the NCBI. Data is available here for both viruses in FASTA and GenBank formats.
Human Immunodeficiency Virus (HIV) is a virus that infects a person's immune system, which can potentially destroy said immune system and lead to immunocompromise or AIDS.
The RNA of a virus is literally its blueprint and is what gets replicated when a virus infects a cell. A virus's goal is to make millions of copies of itself by hijacking the machinery of living cells. You can think of all viruses as floating blueprints that trick a cell into making more of that blueprint, which then infect other cells to make more copies, and so on.
Human Immunodeficiency Virus (HIV) is a type of retrovirus, which not only infects cells to make copies of itself, but also inserts a copy of itself into that cell's DNA, which makes it harder to eradicate.
Like most viruses, HIV has more than one type (HIV-1 and HIV-2). There are also different strains and subtypes.
HIV-1 is more prevalent worldwide than HIV-2 and is also more deadly. HIV-2 is mostly found in West Africa, and it is less likely to progress to immune system failure and AIDS (Nyamweya et al., 2013). The two viruses share a 55% similarity in their RNA sequences (Motomura, Chen, & Hu, 2007). Both genomes are included in this dataset.
Read more about the differences between HIV-1 and HIV-2 here.
You can see how HIV-1's RNA sequence leads its ultimate physical shape here on NCBI.
GenBank and FASTA are two of the most popular file formats in Bioinformatics. They both have DNA or RNA, as well as an accession number (or ID), and the virus's name.
However, GenBank is a more detailed bioinformatics-type file than FASTA. While FASTA only has a name, accession number/ID, and the RNA itself, GenBank also has named gene or protein sequences, which is crucial to understanding what the RNA actually makes and thus how the virus actually works.
FASTA files technically have all of this info, since we can deduce the genes and/or proteins from the RNA, but GenBank files already contain the work of other scientists who have done this gene/protein-identification for us.
This data was downloaded from the National Center for Biotechnology Information (NCBI) by the bio command line tool.
This data is in the public domain per the NCBI. Their statement on data licensing and copyright is as follows:
Databases of molecular data on the NCBI Web site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. Nor do we accept data when the submitter has requested restrictions on reuse or redistribution. However, some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data (that has been submitted). NCBI is not in a position to assess the validity of such claims and since there is no transfer of rights from submitters to NCBI, NCBI has no rights to transfer to a third party. Therefore, NCBI cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.
Thank you to the National Cancer Institute on Unsplash for the banner image.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.
To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.
We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.
Facebook
TwitterThe dbRBC database provides an open, publicly accessible platform for DNA and clinical data related to the human Red Blood Cells (RBC). A new bioinformatics resource, dbRBC, has been installed at the National Center of Biotechnology Information (NCBI). This resource combines the well established Blood Group Antigen Gene Mutation Database (BGMUT) with tools and interlinked resources developed at the NCBI. The main task of dbRBC is to provide access to publicly available genomic, protein and structural information linked to the red blood cell antigens. The site offers a number of resources: * BGMUT Database * Alignment Viewer * SBT Tool * Probe/Primer Resource * Typing Kit Interface * Obstacle
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset consists of whole genome DNA sequences, generated from invertebrate species from the Gulf of Mexico during the Benthic Invertebrate Taxonomy, Metagenomics, and Bioinformatics Workshop (BITMaB) in 2017 in Corpus Christi, Texas, USA. All genomic data sets were deposited in and distributed by GenBank (NCBI), the European Nucleotide Archive (ENA)- European Bioinformatics Institute (EMBL-EBI), DNA Data Bank of Japan, NemATOL, the Global Genome Initiative, and Ocean Genome Legacy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NCBI plasmids database for the HyAsP plasmid detetcion and binning tool.
Facebook
TwitterSee the Armadillo website for the complete list of included applicationsa.aUp-to-date list of included applications is available at: http://adn.bioinfo.uqam.ca/armadillo/included.html.bNCBI EUtil is available at: http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html.
Facebook
TwitterBackground The application of reduced metagenomic sequencing approaches holds promise as a middle ground between targeted amplicon sequencing and whole metagenome sequencing approaches but has not been widely adopted as a technique. A major barrier to adoption is the lack of read simulation software built to handle characteristic features of these novel approaches. Reduced metagenomic sequencing (RMS) produces unique patterns of fragmentation per genome that are sensitive to restriction enzyme choice, and the non-uniform size selection of these fragments may introduce novel challenges to taxonomic assignment as well as relative abundance estimates. Results Through the development and application of simulation software, readsynth, we compare simulated metagenomic sequencing libraries with existing RMS data to assess the influence of multiple library preparation and sequencing steps on downstream analytical results. Based on read depth per position, readsynth achieved 0.79 Pearson’s corre..., Sequence data were collected and aggregated from publicly available NCBI SRA databases for raw sequence data (https://www.ncbi.nlm.nih.gov/sra) and NCBI RefSeq databases for reference genome assemblies (https://www.ncbi.nlm.nih.gov/refseq/). Downloaded reference genomes have been concatenated and indexed using command line "cat" command and the bwa index command., , # readsynth_analysis
https://doi.org/10.5061/dryad.nzs7h44zk
The dataset contained here provides the necessary raw sequence data to perform analyses for the simulation software readsynth.
The dataset includes the genomes and databases necessary to reproduce the steps in the github repository readsynth_analysis and correspond with that repository's "raw_data" directory.
The genome directory "raw_data" is broken into the following subdirectories (further descriptions below):
.
├── helius
│  └── all_2084
│  ├── genomes
│  └── genomes_combined
├── kraken_dbs
│  ├── k2_pluspfp_20220607
│  ├── snipen_bei_db
│  │  └── library
│  │  └── added
│  └── sun_atcc_db
│  └── library
│  └── added
├── liu_RMS
│  └── mock_community_estimate
│  ├── 10M_bracken_profile
│ ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete fileset for analysis of GC content and genome size in completed Bacteria and Euryarchaeota genomes from the NCBI database.
Facebook
TwitterRNA expression analysis was performed on the corpus luteum tissue at five time points after prostaglandin F2 alpha treatment of midcycle cows using an Affymetrix Bovine Gene v1 Array. The normalized linear microarray data was uploaded to the NCBI GEO repository (GSE94069). Subsequent statistical analysis determined differentially expressed transcripts ± 1.5-fold change from saline control with P ≤ 0.05. Gene ontology of differentially expressed transcripts was annotated by DAVID and Panther. Physiological characteristics of the study animals are presented in a figure. Bioinformatic analysis by Ingenuity Pathway Analysis was curated, compiled, and presented in tables. A dataset comparison with similar microarray analyses was performed and bioinformatics analysis by Ingenuity Pathway Analysis, DAVID, Panther, and String of differentially expressed genes from each dataset as well as the differentially expressed genes common to all three datasets were curated, compiled, and presented in tables. Finally, a table comparing four bioinformatics tools' predictions of functions associated with genes common to all three datasets is presented. These data have been further analyzed and interpreted in the companion article "Early transcriptome responses of the bovine mid-cycle corpus luteum to prostaglandin F2 alpha includes cytokine signaling". Resources in this dataset:Resource Title: Supporting information as Excel spreadsheets and tables. File Name: Web Page, url: http://www.sciencedirect.com/science/article/pii/S2352340917304031?via=ihub#s0070
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.
When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).
https://en.wikipedia.org/wiki/Drosophila_melanogaster
This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.
![D. melanogaster chromosomes][1]
The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly
Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.
Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].
Of course, if you've got some idea of the basics already - don't be afraid to jump right in!
There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.
Drosophila Melanogaster Genome
The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.
Meta InformationThere are 3 additional files with meta information about the genome.
This file contains descriptive information about CpG Islands in the genome.
https://en.wikipedia.org/wiki/CpG_site
This file describes the positions of cytogenic bands on each chromosome.
https://en.wikipedia.org/wiki/Cytogenetics
This file describes simple tandem repeats in the genome.
https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat
Drosophila Melanogaster mRNA SequencesMessenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.
https://en.wikipedia.org/wiki/Messenger_RNA
This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.
http://www.ncbi.nlm.nih.gov/genbank/
This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.
http://www.ncbi.nlm.nih.gov/refseq/
Gene PredictionsA gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...
Facebook
TwitterModules showing how the NCBI database classifies and organizes information on DNA sequences, evolutionary relationships, and scientific publications. And a module working to identify a nucleotide sequence from an insect endosymbiont by using BLAST