100+ datasets found
  1. q

    Bioinformatics: An Interactive Introduction to NCBI

    • qubeshub.org
    Updated Jan 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seth Bordenstein (2019). Bioinformatics: An Interactive Introduction to NCBI [Dataset]. http://doi.org/10.25334/Q4915C
    Explore at:
    Dataset updated
    Jan 3, 2019
    Dataset provided by
    QUBES
    Authors
    Seth Bordenstein
    Description

    Modules showing how the NCBI database classifies and organizes information on DNA sequences, evolutionary relationships, and scientific publications. And a module working to identify a nucleotide sequence from an insect endosymbiont by using BLAST

  2. Nematode PDAs Identified Through Bioinformatics using resources from NCBI,...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronald J. Heustis; Hong K. Ng; Kenneth J. Brand; Meredith C. Rogers; Linda T. Le; Charles A. Specht; Juliet A. Fuhrman (2023). Nematode PDAs Identified Through Bioinformatics using resources from NCBI, Sanger, and www.nematode.net. [Dataset]. http://doi.org/10.1371/journal.pone.0040426.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ronald J. Heustis; Hong K. Ng; Kenneth J. Brand; Meredith C. Rogers; Linda T. Le; Charles A. Specht; Juliet A. Fuhrman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    a,bThe sequences ADY42356 and ADY43217 share 99% sequence identity within the predicted NodB homology domain, but only 32% sequence identity outside this region. ADY42356 has a truncated portion of the catalytic domain. We are unable to discern these as 1 or 2 homologs, but given the catalytic domain identity we use only ADY43217 for alignments.cBased on our sequencing of this region of the gene we have shown that predicted residues 1534–1556 (relative to predicted sequence associated with Accession No. CCD67046) would not be present in the protein sequence.dM. incognita Msp9 is a potential homolog which has sequence verification from two clones available from www.nematode.net. Msp30 (AY142120) shows some homology as well, but there is less supporting data.

  3. Viral genomes from GenBank (reference) - Comparative analysis of gene...

    • figshare.com
    application/x-gzip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrique Gonzalez Tortuero; Revathy Krishnamurthi; Heather Allison; Ian Goodhead; Chloë James (2023). Viral genomes from GenBank (reference) - Comparative analysis of gene prediction tools for viral genome annotation [Dataset]. http://doi.org/10.6084/m9.figshare.21353829.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Enrique Gonzalez Tortuero; Revathy Krishnamurthi; Heather Allison; Ian Goodhead; Chloë James
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file "viral.genomic.gbk.tar.gz" contains all the RefSeq viral database information in GenBank format, used as the gold standard for the comparisons. In such a way, it should be run as is when using the script "genecounter.py" to count the number of genes, while it is the second (mandatory) input file for the counting of true positives (TP), false positives (FP) and false negatives (FN) via "coordinateschecker.py". In any case, it could also be used for other evaluation purposes.

  4. f

    Bioinformatics Summary statistics together with NCBI accession numbers.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tapia, Sebastián M.; Saenz-Agudelo, Pablo; Nespolo, Roberto F.; Villarroel, Carlos A.; Thompson, Dawn; Mikhalev, Ekaterina; Liti, Gianni; De Chiara, Matteo; Cubillos, Francisco A.; Urbina, Kamila; Mozzachiodi, Simone; Larrondo, Luis F.; Vega-Macaya, Franco; Oporto, Christian I. (2020). Bioinformatics Summary statistics together with NCBI accession numbers. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000455946
    Explore at:
    Dataset updated
    May 1, 2020
    Authors
    Tapia, Sebastián M.; Saenz-Agudelo, Pablo; Nespolo, Roberto F.; Villarroel, Carlos A.; Thompson, Dawn; Mikhalev, Ekaterina; Liti, Gianni; De Chiara, Matteo; Cubillos, Francisco A.; Urbina, Kamila; Mozzachiodi, Simone; Larrondo, Luis F.; Vega-Macaya, Franco; Oporto, Christian I.
    Description

    (A) Bioinformatics Summary statistics and (B) Sequence identity matrix between strains. (XLSX)

  5. NCBI taxonomy database files

    • figshare.com
    Updated Oct 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rutger Vos (2017). NCBI taxonomy database files [Dataset]. http://doi.org/10.6084/m9.figshare.4620733.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Oct 11, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Rutger Vos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The NCBI taxonomy, from a database dump downloaded from the NCBI FTP server on 2017-02-3, imported in a SQLite database table as a phylogenetic tree.

  6. NCBI Nt (Nucleotide) database FASTA file from 2017-10-26

    • zenodo.org
    application/gzip
    Updated Dec 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Fellows Yates; James Fellows Yates (2020). NCBI Nt (Nucleotide) database FASTA file from 2017-10-26 [Dataset]. http://doi.org/10.5281/zenodo.4382154
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 23, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    James Fellows Yates; James Fellows Yates
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    This FASTA file is the NCBI Nt (Nucleotide) database (public domain) used for holistic metagenomic screening of ancient DNA data at the Department of Archaeogenetics at the Max Planck Institute for the Science of Human History. We offer here the FASTA file used to construct MALT databases (https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/malt/), which are generally too large for uploading. Please see each relevent publications that use the database for MALT database construction commands.

    NCBI does not retain older versions of this database which is why this has been uploaded here. It was downloaded on 2017-10-26 12:39 from: ftp://ftp-trace.ncbi.nih.gov/blast/db/FASTA/nt.gz. The NCBI Nt database is released into the public domain as per https://www.ncbi.nlm.nih.gov/home/about/policies/.

  7. f

    Table showing the ORFanID classification of different genes.

    • plos.figshare.com
    xls
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard S. Gunasekera; Komal K. B. Raja; Suresh Hewapathirana; Emanuel Tundrea; Vinodh Gunasekera; Thushara Galbadage; Paul A. Nelson (2023). Table showing the ORFanID classification of different genes. [Dataset]. http://doi.org/10.1371/journal.pone.0291260.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 25, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Richard S. Gunasekera; Komal K. B. Raja; Suresh Hewapathirana; Emanuel Tundrea; Vinodh Gunasekera; Thushara Galbadage; Paul A. Nelson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Five known sequences are accurately classified by ORFanID.

  8. NCBI Annotation Results

    • figshare.com
    bin
    Updated Jul 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Neely; Benjamin Tully (2021). NCBI Annotation Results [Dataset]. http://doi.org/10.6084/m9.figshare.15040554.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 23, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Christopher Neely; Benjamin Tully
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotation of 112 NCBI GenBank/RefSeq genomes for the preprint "The high-throughput gene prediction of more than 1,700 eukaryote genomes using the software package EukMetaSanity" by Neely, Hu, Alexander, and Tully, 2021Extract files with the command:cat eukmetasanity.20210722.NCBI.parta* | tar -xzvf -

  9. d

    UNC-workshop

    • search.dataone.org
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ANDRES ESPINDOLA (2021). UNC-workshop [Dataset]. https://search.dataone.org/view/sha256:0af80cf07445202b04e935d34961b969a43ea995a5a2f323352ab2ed5915c188
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    ANDRES ESPINDOLA
    Description

    This is my first resource. Visit https://dataone.org/datasets/sha256%3A0af80cf07445202b04e935d34961b969a43ea995a5a2f323352ab2ed5915c188 for complete metadata about this dataset.

  10. n

    NCBI Sequence Read Archive (SRA)

    • neuinfo.org
    • rrid.site
    Updated Oct 7, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). NCBI Sequence Read Archive (SRA) [Dataset]. http://identifiers.org/RRID:SCR_004891
    Explore at:
    Dataset updated
    Oct 7, 2019
    Description

    Repository of raw sequencing data from next generation of sequencing platforms including including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT. In addition to raw sequence data, SRA now stores alignment information in form of read placements on reference sequence. Data submissions are welcome. Archive of high throughput sequencing data,part of international partnership of archives (INSDC) at NCBI, European Bioinformatics Institute and DNA Database of Japan. Data submitted to any of this three organizations are shared among them.

  11. HIV-1 and HIV-2 RNA Sequences

    • kaggle.com
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Proto Bioengineering (2024). HIV-1 and HIV-2 RNA Sequences [Dataset]. http://doi.org/10.34740/kaggle/dsv/7846544
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Kaggle
    Authors
    Proto Bioengineering
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the raw RNA data (the genome) for HIV-1 and HIV-2 from the NCBI. Data is available here for both viruses in FASTA and GenBank formats.

    What is HIV?

    Human Immunodeficiency Virus (HIV) is a virus that infects a person's immune system, which can potentially destroy said immune system and lead to immunocompromise or AIDS.

    What is RNA?

    The RNA of a virus is literally its blueprint and is what gets replicated when a virus infects a cell. A virus's goal is to make millions of copies of itself by hijacking the machinery of living cells. You can think of all viruses as floating blueprints that trick a cell into making more of that blueprint, which then infect other cells to make more copies, and so on.

    Human Immunodeficiency Virus (HIV) is a type of retrovirus, which not only infects cells to make copies of itself, but also inserts a copy of itself into that cell's DNA, which makes it harder to eradicate.

    HIV-1 vs. HIV-2

    Like most viruses, HIV has more than one type (HIV-1 and HIV-2). There are also different strains and subtypes.

    HIV-1 is more prevalent worldwide than HIV-2 and is also more deadly. HIV-2 is mostly found in West Africa, and it is less likely to progress to immune system failure and AIDS (Nyamweya et al., 2013). The two viruses share a 55% similarity in their RNA sequences (Motomura, Chen, & Hu, 2007). Both genomes are included in this dataset.

    Read more about the differences between HIV-1 and HIV-2 here.

    What HIV looks like

    You can see how HIV-1's RNA sequence leads its ultimate physical shape here on NCBI.

    GenBank vs. FASTA files

    GenBank and FASTA are two of the most popular file formats in Bioinformatics. They both have DNA or RNA, as well as an accession number (or ID), and the virus's name.

    However, GenBank is a more detailed bioinformatics-type file than FASTA. While FASTA only has a name, accession number/ID, and the RNA itself, GenBank also has named gene or protein sequences, which is crucial to understanding what the RNA actually makes and thus how the virus actually works.

    FASTA files technically have all of this info, since we can deduce the genes and/or proteins from the RNA, but GenBank files already contain the work of other scientists who have done this gene/protein-identification for us.

    Data Source

    This data was downloaded from the National Center for Biotechnology Information (NCBI) by the bio command line tool.

    License

    This data is in the public domain per the NCBI. Their statement on data licensing and copyright is as follows:

    Databases of molecular data on the NCBI Web site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. Nor do we accept data when the submitter has requested restrictions on reuse or redistribution. However, some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data (that has been submitted). NCBI is not in a position to assess the validity of such claims and since there is no transfer of rights from submitters to NCBI, NCBI has no rights to transfer to a third party. Therefore, NCBI cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.

    Thank you to the National Cancer Institute on Unsplash for the banner image.

  12. Z

    Scorpio Gene-Taxa Benchmark Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Refahi, Mohammad Saleh (2025). Scorpio Gene-Taxa Benchmark Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12175912
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset provided by
    Drexel University
    Authors
    Refahi, Mohammad Saleh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.

    To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.

    We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.

  13. n

    NCBI dbRBC

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Jul 1, 2002
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2002). NCBI dbRBC [Dataset]. http://identifiers.org/RRID:SCR_005959
    Explore at:
    Dataset updated
    Jul 1, 2002
    Description

    The dbRBC database provides an open, publicly accessible platform for DNA and clinical data related to the human Red Blood Cells (RBC). A new bioinformatics resource, dbRBC, has been installed at the National Center of Biotechnology Information (NCBI). This resource combines the well established Blood Group Antigen Gene Mutation Database (BGMUT) with tools and interlinked resources developed at the NCBI. The main task of dbRBC is to provide access to publicly available genomic, protein and structural information linked to the red blood cell antigens. The site offers a number of resources: * BGMUT Database * Alignment Viewer * SBT Tool * Probe/Primer Resource * Typing Kit Interface * Obstacle

  14. g

    Whole genome DNA sequences of Gulf of Mexico invertebrates

    • data.griidc.org
    • search.dataone.org
    Updated Aug 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    W. Kelley Thomas (2020). Whole genome DNA sequences of Gulf of Mexico invertebrates [Dataset]. http://doi.org/10.7266/n7-pchj-dh15
    Explore at:
    Dataset updated
    Aug 5, 2020
    Dataset provided by
    GRIIDC
    Authors
    W. Kelley Thomas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    The dataset consists of whole genome DNA sequences, generated from invertebrate species from the Gulf of Mexico during the Benthic Invertebrate Taxonomy, Metagenomics, and Bioinformatics Workshop (BITMaB) in 2017 in Corpus Christi, Texas, USA. All genomic data sets were deposited in and distributed by GenBank (NCBI), the European Nucleotide Archive (ENA)- European Bioinformatics Institute (EMBL-EBI), DNA Data Bank of Japan, NemATOL, the Global Genome Initiative, and Ocean Genome Legacy.

  15. HyAsP NCBI RefSeq database

    • figshare.com
    txt
    Updated Apr 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cedric chauve (2019). HyAsP NCBI RefSeq database [Dataset]. http://doi.org/10.6084/m9.figshare.8001827.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 16, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    cedric chauve
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NCBI plasmids database for the HyAsP plasmid detetcion and binning tool.

  16. f

    Bioinformatics applications and services included in Armadillo v1.1.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Feb 20, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lord, Etienne; Leclercq, Mickael; Boc, Alix; Diallo, Abdoulaye Baniré; Makarenkov, Vladimir (2013). Bioinformatics applications and services included in Armadillo v1.1. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001687054
    Explore at:
    Dataset updated
    Feb 20, 2013
    Authors
    Lord, Etienne; Leclercq, Mickael; Boc, Alix; Diallo, Abdoulaye Baniré; Makarenkov, Vladimir
    Description

    See the Armadillo website for the complete list of included applicationsa.aUp-to-date list of included applications is available at: http://adn.bioinfo.uqam.ca/armadillo/included.html.bNCBI EUtil is available at: http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html.

  17. d

    Data from Readsynth: short-read simulation for consideration of...

    • search.dataone.org
    • datadryad.org
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Kuster (2025). Data from Readsynth: short-read simulation for consideration of composition-biases in reduced metagenome sequencing approaches [Dataset]. http://doi.org/10.5061/dryad.nzs7h44zk
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Ryan Kuster
    Description

    Background The application of reduced metagenomic sequencing approaches holds promise as a middle ground between targeted amplicon sequencing and whole metagenome sequencing approaches but has not been widely adopted as a technique. A major barrier to adoption is the lack of read simulation software built to handle characteristic features of these novel approaches. Reduced metagenomic sequencing (RMS) produces unique patterns of fragmentation per genome that are sensitive to restriction enzyme choice, and the non-uniform size selection of these fragments may introduce novel challenges to taxonomic assignment as well as relative abundance estimates. Results Through the development and application of simulation software, readsynth, we compare simulated metagenomic sequencing libraries with existing RMS data to assess the influence of multiple library preparation and sequencing steps on downstream analytical results. Based on read depth per position, readsynth achieved 0.79 Pearson’s corre..., Sequence data were collected and aggregated from publicly available NCBI SRA databases for raw sequence data (https://www.ncbi.nlm.nih.gov/sra) and NCBI RefSeq databases for reference genome assemblies (https://www.ncbi.nlm.nih.gov/refseq/). Downloaded reference genomes have been concatenated and indexed using command line "cat" command and the bwa index command., , # readsynth_analysis

    https://doi.org/10.5061/dryad.nzs7h44zk

    The dataset contained here provides the necessary raw sequence data to perform analyses for the simulation software readsynth.

    The dataset includes the genomes and databases necessary to reproduce the steps in the github repository readsynth_analysis and correspond with that repository's "raw_data" directory.

    Description of the data and file structure

    The genome directory "raw_data" is broken into the following subdirectories (further descriptions below):

    .
    ├── helius
    │  └── all_2084
    │    ├── genomes
    │    └── genomes_combined
    ├── kraken_dbs
    │  ├── k2_pluspfp_20220607
    │  ├── snipen_bei_db
    │  │  └── library
    │  │    └── added
    │  └── sun_atcc_db
    │    └── library
    │      └── added
    ├── liu_RMS
    │  └── mock_community_estimate
    │    ├── 10M_bracken_profile
    │ ...
    
  18. GC content and genome size - Completed Bacteria and Euryarchaeota Genomes

    • figshare.com
    pdf
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ara Kooser (2016). GC content and genome size - Completed Bacteria and Euryarchaeota Genomes [Dataset]. http://doi.org/10.6084/m9.figshare.1022807.v7
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ara Kooser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The complete fileset for analysis of GC content and genome size in completed Bacteria and Euryarchaeota genomes from the NCBI database.

  19. d

    Data from: Transcriptomic and bioinformatics analysis of the early...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Transcriptomic and bioinformatics analysis of the early time-course of the response to prostaglandin F2 alpha in the bovine corpus luteum [Dataset]. https://catalog.data.gov/dataset/data-from-transcriptomic-and-bioinformatics-analysis-of-the-early-time-course-of-the-respo-cd938
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    RNA expression analysis was performed on the corpus luteum tissue at five time points after prostaglandin F2 alpha treatment of midcycle cows using an Affymetrix Bovine Gene v1 Array. The normalized linear microarray data was uploaded to the NCBI GEO repository (GSE94069). Subsequent statistical analysis determined differentially expressed transcripts ± 1.5-fold change from saline control with P ≤ 0.05. Gene ontology of differentially expressed transcripts was annotated by DAVID and Panther. Physiological characteristics of the study animals are presented in a figure. Bioinformatic analysis by Ingenuity Pathway Analysis was curated, compiled, and presented in tables. A dataset comparison with similar microarray analyses was performed and bioinformatics analysis by Ingenuity Pathway Analysis, DAVID, Panther, and String of differentially expressed genes from each dataset as well as the differentially expressed genes common to all three datasets were curated, compiled, and presented in tables. Finally, a table comparing four bioinformatics tools' predictions of functions associated with genes common to all three datasets is presented. These data have been further analyzed and interpreted in the companion article "Early transcriptome responses of the bovine mid-cycle corpus luteum to prostaglandin F2 alpha includes cytokine signaling". Resources in this dataset:Resource Title: Supporting information as Excel spreadsheets and tables. File Name: Web Page, url: http://www.sciencedirect.com/science/article/pii/S2352340917304031?via=ihub#s0070

  20. Drosophila Melanogaster Genome

    • kaggle.com
    • ieee-dataport.org
    zip
    Updated Nov 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Myles O'Neill (2019). Drosophila Melanogaster Genome [Dataset]. https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome
    Explore at:
    zip(136202106 bytes)Available download formats
    Dataset updated
    Nov 17, 2019
    Authors
    Myles O'Neill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Drosophila Melanogaster

    Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

    When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

    https://en.wikipedia.org/wiki/Drosophila_melanogaster

    About the Genome

    This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

    ![D. melanogaster chromosomes][1]

    The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

    Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

    Bioinformatics

    Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics]4, [Chromosomes][7], [DNA][8], [RNA]9, [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23].

    Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

    Learning Bioinformatics

    There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference.

    Files in this Dataset

    Drosophila Melanogaster Genome

    • genome.fa

    The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.

    Meta Information

    There are 3 additional files with meta information about the genome.

    • meta-cpg-island-ext-unmasked.csv

    This file contains descriptive information about CpG Islands in the genome.

    https://en.wikipedia.org/wiki/CpG_site

    • meta-cytoband.csv

    This file describes the positions of cytogenic bands on each chromosome.

    https://en.wikipedia.org/wiki/Cytogenetics

    • meta-simple-repeat.csv

    This file describes simple tandem repeats in the genome.

    https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat

    Drosophila Melanogaster mRNA Sequences

    Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

    https://en.wikipedia.org/wiki/Messenger_RNA

    • mrna-genbank.fa

    This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/genbank/

    • mrna-refseq.fa

    This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

    http://www.ncbi.nlm.nih.gov/refseq/

    Gene Predictions

    A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This da...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Seth Bordenstein (2019). Bioinformatics: An Interactive Introduction to NCBI [Dataset]. http://doi.org/10.25334/Q4915C

Bioinformatics: An Interactive Introduction to NCBI

Explore at:
Dataset updated
Jan 3, 2019
Dataset provided by
QUBES
Authors
Seth Bordenstein
Description

Modules showing how the NCBI database classifies and organizes information on DNA sequences, evolutionary relationships, and scientific publications. And a module working to identify a nucleotide sequence from an insect endosymbiont by using BLAST

Search
Clear search
Close search
Google apps
Main menu