28 datasets found

n
Abstracts
data.niaid.nih.gov
datadryad.org
zip
Updated Dec 8, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro Russo; Cesar Prada (2016). Abstracts [Dataset]. http://doi.org/10.15146/R3TP4D
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.15146/R3TP4D
Dataset updated
Dec 8, 2016
Dataset provided by
Universidade de São Paulo
Authors
Pedro Russo; Cesar Prada
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The large amount of available literature in academic research imposes a considerable challenge for researchers aiming to establish a thorough understanding of a given subject. In this project, we collected abstracts from studies pertaining Juvenile Idiopathic Arthritis in the PubMed database and aimed to establish a method for finding clusters of closely related studies. Methods This dataset was collected using the esearch NCBI API. It was processed by selecting relevant fields (title, authors and abstract) and inserted into a large text file (.txt) using a custom Bash script.
Field-wide assessment of differential HT-seq from NCBI GEO database
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taavi Päll; Taavi Päll; Hannes Luidalepp; Tanel Tenson; Tanel Tenson; Ülo Maiväli; Ülo Maiväli; Hannes Luidalepp (2023). Field-wide assessment of differential HT-seq from NCBI GEO database [Dataset]. http://doi.org/10.5281/zenodo.7529832
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7529832
Dataset updated
Jan 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Taavi Päll; Taavi Päll; Hannes Luidalepp; Tanel Tenson; Tanel Tenson; Ülo Maiväli; Ülo Maiväli; Hannes Luidalepp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We analysed the field of expression profiling by high throughput sequencing, or HT-seq, in terms of replicability and reproducibility, using data from the NCBI GEO (Gene Expression Omnibus) repository.

- This release includes GEO series published up to Dec-31, 2020;

geo-htseq.tar.gz archive contains following files:

- output/parsed_suppfiles.csv, p-value histograms, histogram classes, estimated number of true null hypotheses (pi0).

- output/document_summaries.csv, document summaries of NCBI GEO series.

- output/suppfilenames.txt, list of all supplementary file names of NCBI GEO submissions.

- output/suppfilenames_filtered.txt, list of supplementary file names used for downloading files from NCBI GEO.

- output/publications.csv, publication info of NCBI GEO series.

- output/scopus_citedbycount.csv, Scopus citation info of NCBI GEO series

- output/spots.csv, NCBI SRA sequencing run metadata.

- output/cancer.csv, cancer related experiment accessions.

- output/transcription_factor.csv, TF related experiment accessions.

- output/single-cell.csv, single cell experiment accessions.

- blacklist.txt, list of supplementary files that were either too large to import or were causing computing environment crash during import.

Workflow to produce this dataset is available on Github at rstats-tartu/geo-htseq.

geo-htseq-updates.tar.gz archive contains files:

- results/detools_from_pmc.csv, differential expression analysis programs inferred from published articles

- results/n_data.csv, manually curated sample size info for NCBI GEO HT-seq series

- results/simres_df_parsed.csv, pi0 values estimated from differential expression results obtained from simulated RNA-seq data

- results/data/parsed_suppfiles_rerun.csv, pi0 values estimated using smoother method from anti-conservative p-value sets
E
[Metabarcoding zooplankton at station ALOHA: NCBI SRA accession numbers] -...
erddap.bco-dmo.org
Updated Aug 2, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2017). [Metabarcoding zooplankton at station ALOHA: NCBI SRA accession numbers] - NCBI Sequence Read Archive (SRA) accession numbers for fastq sequence files for each zooplankton community sample (Plankton Population Genetics project) (Basin-scale genetics of marine zooplankton) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_700961/index.html
Explore at:
Dataset updated
Aug 2, 2017
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/700961/licensehttps://www.bco-dmo.org/dataset/700961/license
Area covered

Variables measured
title, latitude, platform, longitude, library_ID, analysis_name, biosample_link, library_layout, library_source, bioproject_link, and 6 more
Description
These data include sample information and accession links to sequence data at The National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA).

This data submission consists of metabarcoding data for the zooplankton community in the epipelagic, mesopelagic and upper bathypelagic zones (0-1500m) of the North Pacific Subtropical Gyre. The goal of this study was to assess the hidden diversity present in zooplankton assemblages in midwaters, and detect vertical gradients in species richness, depth distributions, and community composition of the full zooplankton assemblage. Samples were collected in June 2014 from Station ALOHA (22.75, -158) using a 1 meter square Multiple Opening and Closing Nets and Environmental Sampling System (MOCNESS, 200um mesh), on R/V Falkor cruise FK140613. \u00a0Next generation sequence data (Illumina MiSeq, V3 chemistry, 300-bp paired-end) of the zooplankton assemblage derive from amplicons of the V1-V2 region of 18S rRNA (primers described in Fonseca et al. 2010). The data includes sequences and read count abundance information for molecular OTUs from both holoplanktonic and meroplanktonic taxa

Related dataset containing OTU tables and fasta sequences (representative / most abundance read for each OTU):
Metabarcoding zooplankton at station ALOHA: OTU tables and fasta files access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=SAMPLE INFORMATION

Sample identifiers include the following codes.

MOCNESS tow
FA3: Night sampling
FA4: Day sampling

Depth range:
N1: 1500-1000m
N2: 1000-700m
N3: 700-500m
N4: 500-300m
N5: 300-200m
N6: 200-150m
N7: 150-100m
N8: 100-50m
N9: 50m-0m

Wet-sieved zooplankton size fractions
SF1: 0.2-0.5 mm
SF2: 0.5-1.0 mm
SF3: 1.0-2.0 mm awards_0_award_nid=473046 awards_0_award_number=OCE-1255697 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward?AWD_ID=1255697 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=David L. Garrison awards_0_program_manager_nid=50534 awards_1_award_nid=537990 awards_1_award_number=OCE-1338959 awards_1_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1338959 awards_1_funder_name=NSF Division of Ocean Sciences awards_1_funding_acronym=NSF OCE awards_1_funding_source_nid=355 awards_1_program_manager=David L. Garrison awards_1_program_manager_nid=50534 awards_2_award_nid=539716 awards_2_award_number=OCE-1029478 awards_2_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1029478 awards_2_funder_name=NSF Division of Ocean Sciences awards_2_funding_acronym=NSF OCE awards_2_funding_source_nid=355 awards_2_program_manager=David L. Garrison awards_2_program_manager_nid=50534 cdm_data_type=Other comment=ALOHA Zooplankton metabarcoding: SRA PI: Erica Goetze data version: 07 Jun 2017 Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.704665 Easternmost_Easting=-158.0 geospatial_lat_max=22.75 geospatial_lat_min=22.75 geospatial_lat_units=degrees_north geospatial_lon_max=-158.0 geospatial_lon_min=-158.0 geospatial_lon_units=degrees_east infoUrl=https://www.bco-dmo.org/dataset/700961 institution=BCO-DMO instruments_0_acronym=Automated Sequencer instruments_0_dataset_instrument_description=Illumina MiSeq using V3 chemistry (300-bp, paired-end) instruments_0_dataset_instrument_nid=700970 instruments_0_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. instruments_0_instrument_name=Automated DNA Sequencer instruments_0_instrument_nid=649 instruments_0_supplied_name=Illumina MiSeq instruments_1_acronym=Thermal Cycler instruments_1_dataset_instrument_nid=700972 instruments_1_description=General term for a laboratory apparatus commonly used for performing polymerase chain reaction (PCR). The device has a thermal block with holes where tubes with the PCR reaction mixtures can be inserted. The cycler then raises and lowers the temperature of the block in discrete, pre-programmed steps.

(adapted from http://serc.carleton.edu/microbelife/research_methods/genomics/pcr.html) instruments_1_instrument_name=PCR Thermal Cycler instruments_1_instrument_nid=471582 instruments_1_supplied_name=quantitative PCR by the Evolutionary Genetics Core Facility (Hawaii Institute of Marine Biology) instruments_2_acronym=Bioanalyzer instruments_2_dataset_instrument_nid=700971 instruments_2_description=A Bioanalyzer is a laboratory instrument that provides the sizing and quantification of DNA, RNA, and proteins. One example is the Agilent Bioanalyzer 2100. instruments_2_instrument_name=Bioanalyzer instruments_2_instrument_nid=626182 instruments_2_supplied_name=Agilent 2100 Bioanalyzer metadata_source=https://www.bco-dmo.org/api/dataset/700961 Northernmost_Northing=22.75 param_mapping={'700961': {'lat': 'master - latitude', 'lon': 'master - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/700961/parameters people_0_affiliation=University of Hawaii people_0_person_name=Erica Goetze people_0_person_nid=473048 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Hawaii people_1_person_name=Erica Goetze people_1_person_nid=473048 people_1_role=Contact people_1_role_type=related people_2_affiliation=Woods Hole Oceanographic Institution people_2_affiliation_acronym=WHOI BCO-DMO people_2_person_name=Amber York people_2_person_nid=643627 people_2_role=BCO-DMO Data Manager people_2_role_type=related project=Plankton Population Genetics projects_0_acronym=Plankton Population Genetics projects_0_description=Description from NSF award abstract: Marine zooplankton show strong ecological responses to climate change, but little is known about their capacity for evolutionary response. Many authors have assumed that the evolutionary potential of zooplankton is limited. However, recent studies provide circumstantial evidence for the idea that selection is a dominant evolutionary force acting on these species, and that genetic isolation can be achieved at regional spatial scales in pelagic habitats. This RAPID project will take advantage of a unique opportunity for basin-scale transect sampling through participation in the Atlantic Meridional Transect (AMT) cruise in 2014. The cruise will traverse more than 90 degrees of latitude in the Atlantic Ocean and include boreal-temperate, subtropical and tropical waters. Zooplankton samples will be collected along the transect, and mitochondrial and microsatellite markers will be used to identify the geographic location of strong genetic breaks within three copepod species. Bayesian and coalescent analytical techniques will test if these regions act as dispersal barriers. The physiological condition of animals collected in distinct ocean habitats will be assessed by measurements of egg production (at sea) as well as body size (condition index), dry weight, and carbon and nitrogen content. The PI will test the prediction that ocean regions that serve as dispersal barriers for marine holoplankton are areas of poor-quality habitat for the target species, and that this is a dominant mechanism driving population genetic structure in oceanic zooplankton. Note: This project is funded by an NSF RAPID award. This RAPID grant supported the shiptime costs, and all the sampling reported in the AMT24 zooplankton ecology cruise report (PDF). Online science outreach blog at: https://atlanticplankton.wordpress.com projects_0_end_date=2015-11 projects_0_geolocation=Atlantic Ocean, 46 N - 46 S projects_0_name=Basin-scale genetics of marine zooplankton projects_0_project_nid=537991 projects_0_start_date=2013-12 sourceUrl=(local files) Southernmost_Northing=22.75 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=latitude,longitude,bioproject_accession,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,bioproject_link version=1 Westernmost_Easting=-158.0 xml_source=osprey2erddap.update_xml() v1.3
E
[18S rRNA gene tag sequences from DNA and RNA] - NCBI accession metadata for...
erddap.bco-dmo.org
Updated Mar 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2020). [18S rRNA gene tag sequences from DNA and RNA] - NCBI accession metadata for 18S rRNA gene tag sequences from DNA and RNA from samples collected in coastal California in 2013 and 2014 (Protistan, prokaryotic, and viral processes at the San Pedro Ocean Time-series) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_745527/index.html
Explore at:
Dataset updated
Mar 6, 2020
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/745527/licensehttps://www.bco-dmo.org/dataset/745527/license
Area covered

Variables measured
depth2, filename, filetpe2, filetype, latitude, platform, SRA_title, filename2, longitude, SRA_run_ID, and 11 more
Description
Raw DNA and RNA V4 tag sequences include spatially and temporally distinct samples from coastal California. Samples were collected in Niskin bottles with a CTD rosette at the San Pedro Ocean Time-series (SPOT) between April of 2013 and January of 2014. This dataset contains sequence data accession numbers and metadata for the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (SRA Study ID: SRP070577, BioProject: PRJNA311248). access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Samples were collected seasonally at the San Pedro Ocean Time-series (SPOT) station at four depths (surface, subsurface chlorophyll maximum, 150 and 890 m).\u00a0The SPOT station was sampled from 5 m, the subsurface chlorophyll maximum (SCM), 150 and 890 m using 10 L Niskin bottles mounted on a CTD rosette, during regularly scheduled cruises (https://dornsife.usc.edu/spot/).

Seawater from all samples was sequentially pre-filtered through 200 \u03bcm and 80 \u03bcm Nitex mesh to reduce abundances of multicellular eukaryotes (metazoa). Near-surface and SCM seawater (2 L) and 150 and 890 m seawater (4 L) was filtered onto GF/F filters (nominal pore size 0.7 \u03bcm; Whatman, International Ltd, Florham Park, NJ, USA) and immediately flash frozen in liquid nitrogen for later DNA and RNA extraction.

Total DNA and RNA were extracted simultaneously from each sample using the All Prep DNA/RNA Mini kit (Qiagen, Valencia, CA, USA, #80204). Genomic DNA was removed during the RNA extraction with RNase-Free DNase reagents (Qiagen,

79254). Total extracted RNA was checked for residual genomic DNA by

performing a polymerase chain reaction (PCR) using DNA specific primers to ensure that no amplified products appeared when run on an agarose gel. RNA was reverse transcribed into cDNA using iScript Reverse Transcription Supermix with random hexamers (Bio-Rad Laboratories, Hercules, CA, USA, #170-8840).

The resulting cDNA and DNA from each sample were PCR amplified using V4 forward (5\u2032\u00a0-CCAGCA[GC]C[CT]GCGGTA ATTCC-3\u2032\u00a0) and reverse (5\u2032\u00a0-ACTTTCGTTCTTGAT[CT][AG]A-3\u2032\u00a0) primers (Stoeck et al. 2010). Duplicate PCR reactions were performed in 50 \u03bcL volumes of: 1X Phusion High-Fidelity DNA buffer, 1 unit of Phusion DNA polymerase (New England Biolabs, Ipswich, MA, USA, #M0530S), 200 \u03bcM of dNTPs, 0.5 \u03bcM of each V4 forward and reverse primer, 3% DMSO, 50 mM of MgCl and 5 ng of either DNA or cDNA template per reaction. The PCR thermal cycler program consisted of a 98\u25e6C denaturation step for 30 s, followed by 10 cycles of 10 s at 98\u25e6C, 30 s at 53\u25e6C and 30 s at 72\u25e6C, and then 15 cycles of 10 s at 98\u25e6C, 30 s at 48\u25e6C and 30 s at 72\u25e6C, and a final elongation step at 72\u25e6C for 10 min, as described in Rodr\u0131 \u0301guez-Mart\u0131 \u0301nez et al. (2012). PCR products were purified (Qiagen, #28104) and duplicate samples were pooled. The \u223c400 bp cDNA and DNA PCR products were quality checked on an Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA).

Sampling Locations:
SPOT (33\u25e6 33\u2032 N, 118\u25e6 24\u2032 W) - surface, DCM, 150 m, and 890 m
Port of LA (33\u25e6 42.75\u2032 N, 118\u25e6 15.55\u2032 W) - surface
Catalina (33\u25e6 27.17\u2032 N, 118\u25e6 28.51\u2032 W)-surface

For protocols see:
"%5C%22https://www.protocols.io/view/sample-collection-%0Afrom-the-field-for-downstream-mo-hisb4ee%5C%22">https://www.protocols.io/view/sample-collection-from-the-field-for- downstream-mo-hisb4ee
"%5C%22https://www.protocols.io/view/rna-and-optional-dna-%0Aextraction-from-environmental-hk3b4yn%5C%22">https://www.protocols.io/view/rna-and-optional-dna-extraction-from- environmental-hk3b4yn
"%5C%22https://www.protocols.io/view/18s-v4-tag-sequencing-pcr-%0Aamplification-and-librar-hdmb246%5C%22">https://www.protocols.io/view/18s-v4-tag-sequencing-pcr-amplification-and- librar-hdmb246 awards_0_award_nid=743048 awards_0_award_number=OCE-1737409 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1737409 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=David L. Garrison awards_0_program_manager_nid=50534 cdm_data_type=Other comment=18S rRNA gene tag sequences PI: David Caron Data version 1: 2018-09-05 Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.745527.1 Easternmost_Easting=-118.26 geospatial_lat_max=33.72 geospatial_lat_min=33.45 geospatial_lat_units=degrees_north geospatial_lon_max=-118.26 geospatial_lon_min=-118.48 geospatial_lon_units=degrees_east infoUrl=https://www.bco-dmo.org/dataset/745527 institution=BCO-DMO instruments_0_acronym=Niskin bottle instruments_0_dataset_instrument_nid=745591 instruments_0_description=A Niskin bottle (a next generation water sampler based on the Nansen bottle) is a cylindrical, non-metallic water collection device with stoppers at both ends. The bottles can be attached individually on a hydrowire or deployed in 12, 24 or 36 bottle Rosette systems mounted on a frame and combined with a CTD. Niskin bottles are used to collect discrete water samples for a range of measurements including pigments, nutrients, plankton, etc. instruments_0_instrument_external_identifier=https://vocab.nerc.ac.uk/collection/L22/current/TOOL0412/ instruments_0_instrument_name=Niskin bottle instruments_0_instrument_nid=413 instruments_1_acronym=CTD SBE 911plus instruments_1_dataset_instrument_nid=745592 instruments_1_description=The Sea-Bird SBE 911plus is a type of CTD instrument package for continuous measurement of conductivity, temperature and pressure. The SBE 911plus includes the SBE 9plus Underwater Unit and the SBE 11plus Deck Unit (for real-time readout using conductive wire) for deployment from a vessel. The combination of the SBE 9plus and SBE 11plus is called a SBE 911plus. The SBE 9plus uses Sea-Bird's standard modular temperature and conductivity sensors (SBE 3plus and SBE 4). The SBE 9plus CTD can be configured with up to eight auxiliary sensors to measure other parameters including dissolved oxygen, pH, turbidity, fluorescence, light (PAR), light transmission, etc.). more information from Sea-Bird Electronics instruments_1_instrument_external_identifier=https://vocab.nerc.ac.uk/collection/L22/current/TOOL0058/ instruments_1_instrument_name=CTD Sea-Bird SBE 911plus instruments_1_instrument_nid=591 instruments_2_acronym=Automated Sequencer instruments_2_dataset_instrument_nid=745568 instruments_2_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. instruments_2_instrument_name=Automated DNA Sequencer instruments_2_instrument_nid=649 instruments_2_supplied_name=Illumina MiSeq metadata_source=https://www.bco-dmo.org/api/dataset/745527 Northernmost_Northing=33.72 param_mapping={'745527': {'lat': 'master - latitude', 'lon': 'master - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/745527/parameters people_0_affiliation=University of Southern California people_0_affiliation_acronym=USC people_0_person_name=David Caron people_0_person_nid=50524 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Southern California people_1_affiliation_acronym=USC people_1_person_name=Sarah K Hu people_1_person_nid=745520 people_1_role=Co-Principal Investigator people_1_role_type=originator people_2_affiliation=University of Southern California people_2_affiliation_acronym=USC people_2_person_name=Sarah K Hu people_2_person_nid=745520 people_2_role=Contact people_2_role_type=related people_3_affiliation=Woods Hole Oceanographic Institution people_3_affiliation_acronym=WHOI BCO-DMO people_3_person_name=Amber York people_3_person_nid=643627 people_3_role=BCO-DMO Data Manager people_3_role_type=related project=SPOT projects_0_acronym=SPOT projects_0_description=Planktonic marine microbial communities consist of a diverse collection of bacteria, archaea, viruses, protists (phytoplankton and protozoa) and small animals (metazoan). Collectively, these species are responsible for virtually all marine pelagic primary production where they form the basis of food webs and carry out a large fraction of respiratory processes. Microbial interactions include the traditional role of predation, but recent research recognizes the importance of parasitism, symbiosis and viral infection. Characterizing the response of pelagic microbial communities and processes to environmental influences is fundamental to understanding and modeling carbon flow and energy utilization in the ocean, but very few studies have attempted to study all of these assemblages in the same study. This project is comprised of long-term (monthly) and short-term (daily) sampling at the San Pedro Ocean Time-series (SPOT) site. Analysis of the resulting datasets investigates co-occurrence patterns of microbial taxa (e.g. protist-virus and protist-prokaryote interactions, both positive and negative) indicating which species consistently co-occur and potentially interact, followed by examination gene expression to help define the underlying mechanisms. This study augments 20 years of baseline studies of microbial abundance, diversity, rates at the site, and will enable
d
Pathogen Detection (BETA)
catalog.data.gov
datadiscovery.nlm.nih.gov
+2more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Pathogen Detection (BETA) [Dataset]. https://catalog.data.gov/dataset/pathogen-detection-beta
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
NCBI Pathogen Detection integrates bacterial pathogen genomic sequences originating in food, environmental sources, and patients. It quickly clusters and identifies related sequences to uncover potential food contamination sources, helping public health scientists investigate foodborne disease outbreaks.
1000 Cannabis Genomes Project
kaggle.com
zip
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 26, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Cannabis is a genus of flowering plants in the family Cannabaceae.

Source: https://en.wikipedia.org/wiki/Cannabis

Content

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

Banner Photo by Rick Proctor from Unplash.

Inspiration

Which Cannabis samples are included in the variants table?

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
U
Occurrences of Apis mellifera filamentous virus (AmFV) sequences in public...
data.usgs.gov
catalog.data.gov
Updated Sep 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Cornman (2024). Occurrences of Apis mellifera filamentous virus (AmFV) sequences in public accessions of Apis mellifera and Varroa destructor [Dataset]. http://doi.org/10.5066/P9XT6GBE
Explore at:
Unique identifier
https://doi.org/10.5066/P9XT6GBE
Dataset updated
Sep 3, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Robert Cornman
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Honey bees (Apis mellifera), a critical agricultural pollinator in many areas, have a high rate of infection with a large DNA virus, Apis mellifera filamentous virus (AmFV), yet little is known about its ecology or impact on honey bee colonies, other than its ubiquity and apparent low virulence. This study scanned over 5,000 public data sets to detect AmFV sequences in honey bees as well as a parasitic mite of honey bees, Varroa destructor, that is a potential vector of AmFV. The data release consists of these files: 1. AmFV.genome.assemblies.aligned.fas, which contains new AmFV draft genome sequences generated by this study aligned with existing reference genome accessions downloaded from the National Center for Biotechnology Information (NCBI). 2. kmer.list.txt, a list of kmers that were extracted from reference sequences and searched for in Sequence Read Archive (SRA) accessions. 3. sample.metadata.txt, which lists all accessions of the SRA, and NCBI database of high-throughp ...
e
NCBIFAM
ebi.ac.uk
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Dec 16, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
E
[Heterosigma akashiwo acclimation] - NCBI accessions of the harmful alga...
erddap.bco-dmo.org
Updated Mar 18, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2019). [Heterosigma akashiwo acclimation] - NCBI accessions of the harmful alga Heterosigma akashiwo (CCMP2393) grown under a range of CO2 concentrations from 200-1000 ppm (Impacts of Evolution on the Response of Phytoplankton Populations to Rising CO2) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_747872/index.html
Explore at:
Dataset updated
Mar 18, 2019
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/747872/licensehttps://www.bco-dmo.org/dataset/747872/license
Time period covered
Jun 21, 2017 - Jul 13, 2017
Variables measured
pH, host, time, Media, depth, strain, temp_C, CO2_ppm, isolate, lat_lon, and 16 more
Description
This dataset includes metadata associated with NCBI BioProject PRJNA377729 \Impacts of Evolution on the Response of Phytoplankton Populations to Rising CO2\ PRJNA377729: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA377729. The alga Heterosigma akashiwo was grown at CO2 levels from about 200 to 1000 ppm and then the DNA and RNA were sequenced. access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv acquisition_description=Uni-algal, non-axenic cultures of Heterosigma akashiwo (CCMP2393) were grown in L1 medium (without silicate) made with a Long Island Sound seawater base collected from Avery Point, CT, USA (salinity 32) at 18\u00b0C with a 14:10 (light:dark) cycle with an irradiance of approximately 100 \u00b5mol m-2 s-1 . Cells were acclimated in exponential growth phase to different carbonate chemistries in 1.2 L of L1 media in 2.5-L polycarbonate bottles. To control the carbonate chemistry of the water, the headspace of each bottle was purged continuously with a custom gas mixture of ~21% oxygen, ~79% nitrogen and either 200, 400, 600, 800 or 1000 ppmv CO2 (TechAir, NY).

At the point of harvest, 150 mL (~6 x 106 cells) were filtered on to 5 \u00b5m pore size, 25 mm polycarbonate filter and flash frozen in liquid nitrogen. Genetic material from samples was extracted with the RNeasy Mini kit (Qiagen, Valencia, CA) and DNA was removed on-column using the RNase-free DNase Set (Qiagen), yielding total RNA. Total RNA extracts of the triplicate cultures were quantified on a 2100 Bioanalyzer (Agilent, Santa Clara, CA). Libraries were prepared using poly-A pull down with the TruSeq Stranded mRNA Library Prep kit (Illumina, San Diego, CA). Library preparation, barcoding, and sequencing from each library was performed by the JP Sulzberger Columbia University Genome Center (New York, NY).

Sequence reads were de-multiplexed and trimmed to remove sequencing barcodes. Reads were aligned using Bowtie2 (Langmead and Salzberg 2012) to the MMETSP consensus contigs for Heterosigma akashiwo CCMP2393 ("%5C%22https://omictools.com/marine-microbial-eukaryotic-%0Atranscriptome-sequencing-project-tool%5C%22">https://omictools.com /marine-microbial-eukaryotic-transcriptome- sequenci...).

Significant differences between physiological parameters by CO2 treatment were assessed with analysis of variance (ANOVA) and Tukey\u2019s honestly significant differences test (aov and TukeyHSD, stats, R). Differential expression of genes in any CO2 treatment compared to modern was determined using the general linear model (GLM) exact test (edgeR, R). Briefly, the read counts were normalized by trimmed mean of M-values (TMM) using the function calcNormFactors, tagwise dispersions were calculated with the function estimateGLMTagwiseDisp, a GLM was fit using glmFit, and log2 fold change (logFC) for each treatment was calculated relative to average expression at modern CO2. P-values from likelihood ratio tests were corrected for multiple testing using the false discovery method (fdr). awards_0_award_nid=55197 awards_0_award_number=OCE-1314336 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward?AWD_ID=1314336 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=David L. Garrison awards_0_program_manager_nid=50534 cdm_data_type=Other comment=Hak_acclim The harmful alga Heterosigma akashiwo (CCMP2393) grown under a range of CO2 concentrations from 200-1000 ppm. PI's: S. Dyhrman (LDEO), J. Morris (U Alabama) version: 2018-10-11 See also: https://www.ncbi.nlm.nih.gov/bioproject/377729 Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.747872.1 geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=https://www.bco-dmo.org/dataset/747872 institution=BCO-DMO instruments_0_acronym=Automated Sequencer instruments_0_dataset_instrument_description=Used to prepare the mRNA libraries. Samples were barcoded for multiplex sequencing and run on in a single lane by the Columbia University Genome Center (CUGC) (New York, NY). instruments_0_dataset_instrument_nid=747879 instruments_0_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. instruments_0_instrument_name=Automated DNA Sequencer instruments_0_instrument_nid=649 instruments_0_supplied_name=Illumina Hi-seq 2500 paired-end sequencing (PE100) with TruSeq RNA sample Prep Kit (Illumina, San Diego, CA) keywords_vocabulary=GCMD Science Keywords metadata_source=https://www.bco-dmo.org/api/dataset/747872 param_mapping={'747872': {'collection_date': 'flag - time', 'depth': 'master - depth'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/747872/parameters people_0_affiliation=Lamont-Doherty Earth Observatory people_0_affiliation_acronym=LDEO people_0_person_name=Sonya T. Dyhrman people_0_person_nid=51101 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Alabama at Birmingham people_1_affiliation_acronym=UA/Birmingham people_1_person_name=James Jeffrey Morris people_1_person_nid=51678 people_1_role=Co-Principal Investigator people_1_role_type=originator people_2_affiliation=Lamont-Doherty Earth Observatory people_2_affiliation_acronym=LDEO people_2_person_name=Gwenn Hennon people_2_person_nid=546456 people_2_role=Scientist people_2_role_type=originator people_3_affiliation=Woods Hole Oceanographic Institution people_3_affiliation_acronym=WHOI BCO-DMO people_3_person_name=Nancy Copley people_3_person_nid=50396 people_3_role=BCO-DMO Data Manager people_3_role_type=related project=P-ExpEv projects_0_acronym=P-ExpEv projects_0_description=Note: This project is also affiliated with the NSF BEACON Center for the Study of Evolution in Action. Project Description from NSF Award: Human activities are driving up atmospheric carbon dioxide concentrations at an unprecedented rate, perturbing the ocean's carbonate buffering system, lowering oceanic pH, and changing the concentration and composition of dissolved inorganic carbon. Recent studies have shown that this ocean acidification has many short-term effects on phytoplankton, including changes in carbon fixation among others. These physiological changes could have profound effects on phytoplankton metabolism and community structure, with concomitant effects on Earth's carbon cycle and, hence, global climate. However, extrapolation of present understanding to the field are complicated by the possibility that natural populations might evolve in response to their changing environments, leading to different outcomes than those predicted from short-term studies. Indeed, evolution experiments demonstrate that microbes are often able to rapidly adapt to changes in the environment, and that beneficial mutations are capable of sweeping large populations on time scales relevant to predictions of environmental dynamics in the coming decades. This project addresses two major areas of uncertainty for phytoplankton populations with the following questions: 1) What adaptive mutations to elevated CO2 are easily accessible to extant species, how often do they arise, and how large are their effects on fitness? 2) How will physical and ecological interactions affect the expansion of those mutations into standing populations? This study will address these questions by coupling experimental evolution with computational modeling of ocean biogeochemical cycles. First, cultured unicellular phytoplankton, representative of major functional groups (e.g. cyanobacteria, diatoms, coccolithophores), will be evolved under simulated year 2100 CO2 concentrations. From these experiments, estimates will be made of a) the rate of beneficial mutations, b) the magnitude of fitness gains conferred by these mutations, and c) secondary phenotypes (i.e., trade-offs) associated with these mutations, assayed using both physiological and genetic approaches. Second, an existing numerical model of the global ocean system will be modified to a) simulate the effects of changing atmospheric CO2 concentrations on ocean chemistry, and b) allow the introduction of CO2-specific adaptive mutants into the extant populations of virtual phytoplankton. The model will be used to explore the ecological and biogeochemical impacts of beneficial mutations in realistic environmental situations (e.g. resource availability, predation, etc.). Initially, the model will be applied to idealized sensitivity studies; then, as experimental results become available, the implications of the specific beneficial mutations observed in our experiments will be explored. This interdisciplinary study will provide novel, transformative understanding of the extent to which evolutionary processes influence phytoplankton diversity, physiological ecology, and carbon cycling in the near-future ocean. One of many important outcomes will be the development and testing of nearly-neutral genetic markers useful for competition studies in major phytoplankton functional groups, which has applications well beyond the current proposal. projects_0_end_date=2017-05 projects_0_geolocation=Experiment housed in laboratories at Michigan State University projects_0_name=Impacts of Evolution on the Response of Phytoplankton Populations to Rising CO2 projects_0_project_nid=2276 projects_0_start_date=2013-06 sourceUrl=(local
n
Data from: NLMChem a new resource for chemical entity recognition in PubMed...
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rezarta Islamaj; Robert Leaman; Zhiyong Lu (2021). NLMChem a new resource for chemical entity recognition in PubMed full-text literature [Dataset]. http://doi.org/10.5061/dryad.3tx95x6dz
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3tx95x6dz
Dataset updated
Mar 22, 2021
Dataset provided by
United States National Library of Medicine
Authors
Rezarta Islamaj; Robert Leaman; Zhiyong Lu
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects, and interactions with diseases, genes, and other chemicals.

We, therefore, present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. Using this corpus, we built a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API.

Methods NLM-Chem corpus consists of 150 full-text articles from the PubMed Central Open Access dataset, comprising 67 different chemical journals, aiming to cover a general distribution of usage of chemical names in the biomedical literature. Articles were selected so that human annotation was most valuable (meaning that they were rich in bio-entities, and current state-of-the-art named entity recognition systems disagreed on bio-entity recognition.

Ten indexing experts at the National Library of Medicine manually annotated the corpus using the TeamTat annotation system that allows swift annotation project management. The corpus was annotated in three batches and each batch of articles was annotated in three annotation rounds. Annotators were randomly paired for each article, and pairings were randomly shuffled for each subsequent batch. In this manner, the workload was distributed fairly. To control for bias, annotator identities were hidden the first two annotation rounds. In the final annotation rounds, annotators worked collaboratively to resolve the final few annotation disagreements and reach a 100% consensus.

The full-text articles were fully annotated for all chemical name occurrences in text, and the chemicals were mapped to Medical Subject Heading (MeSH) entries to facilitate indexing and other downstream article processing tasks at the National Library of Medicine. MeSH is part of the UMLS and as such, chemical entities can be mapped to other standard vocabularies.

The data has been evaluated for high annotation quality, and its use as training data has already improved chemical named entity recognition in PubMed. The newly improved system has already been incorporated in the PubTator API tools (https://www.ncbi.nlm.nih.gov/research/pubtator/api.html).
E
[Synechococcus accessions] - NCBI accessions for raw genomic sequence data...
erddap.bco-dmo.org
Updated Mar 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2020). [Synechococcus accessions] - NCBI accessions for raw genomic sequence data of 11 new isolates of marine Synechococcus from Naragansett Bay, July 2017 (Dimensions: Collaborative Research: Genetic, functional and phylogenetic diversity determines marine phytoplankton community responses to changing temperature and nutrients) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_782301/index.html
Explore at:
Dataset updated
Mar 9, 2020
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/782301/licensehttps://www.bco-dmo.org/dataset/782301/license
Area covered

Variables measured
SPUID, depth, Tax_ID, Isolate, Organism, latitude, Accession, longitude, Sample_Name
Description
NCBI accessions for raw genomic sequence data of 11 new isolates of marine Synechococcus from Naragansett Bay. access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Natural seawater was enriched for photoautotrophs and split into multiple temperatures for two weeks. After the enrichment period, Synechococcus was isolated from each temperature. Each isolate's thermal niche was measured through a series of lab experiments and sequenced.

The culture of each isolate was filtered onto 0.22 um PES filters and genomic DNA extracted using Qiagen\u2019s (CA) DNeasy Power Soil Extraction kit. Sequencing was done by Novogene (Beijing, China) on an Illumina 1500 making 2x150 pe reads. awards_0_award_nid=712792 awards_0_award_number=OCE-1638804 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1638804 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=Michael E. Sieracki awards_0_program_manager_nid=50446 cdm_data_type=Other comment=Synechococcus accessions NCBI accessions for genomic sequence data of 11 new isolates of marine Synechococcus PI: D. Hutchins (USC) version date: 2019-11-20 Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 dataset_current_state=Final and no updates defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.782301.1 Easternmost_Easting=-71.4 geospatial_lat_max=41.47 geospatial_lat_min=41.47 geospatial_lat_units=degrees_north geospatial_lon_max=-71.4 geospatial_lon_min=-71.4 geospatial_lon_units=degrees_east geospatial_vertical_max=0.0 geospatial_vertical_min=0.0 geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=https://www.bco-dmo.org/dataset/782301 institution=BCO-DMO instruments_0_acronym=Automated Sequencer instruments_0_dataset_instrument_description=Sequencing was done by Novogene (Beijing, China) on an Illumina 1500 making 2x150 pe reads. instruments_0_dataset_instrument_nid=782307 instruments_0_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. instruments_0_instrument_name=Automated DNA Sequencer instruments_0_instrument_nid=649 instruments_0_supplied_name=Illumina 1500 metadata_source=https://www.bco-dmo.org/api/dataset/782301 Northernmost_Northing=41.47 param_mapping={'782301': {'Lat': 'flag - latitude', 'Depth': 'flag - depth', 'Long': 'flag - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/782301/parameters people_0_affiliation=University of Southern California people_0_affiliation_acronym=USC people_0_person_name=David A. Hutchins people_0_person_nid=51048 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=Woods Hole Oceanographic Institution people_1_affiliation_acronym=WHOI BCO-DMO people_1_person_name=Nancy Copley people_1_person_nid=50396 people_1_role=BCO-DMO Data Manager people_1_role_type=related project=Phytoplankton Community Responses projects_0_acronym=Phytoplankton Community Responses projects_0_description=NSF Award Abstract: Photosynthetic marine microbes, phytoplankton, contribute half of global primary production, form the base of most aquatic food webs and are major players in global biogeochemical cycles. Understanding their community composition is important because it affects higher trophic levels, the cycling of energy and elements and is sensitive to global environmental change. This project will investigate how phytoplankton communities respond to two major global change stressors in aquatic systems: warming and changes in nutrient availability. The researchers will work in two marine systems with a long history of environmental monitoring, the temperate Narragansett Bay estuary in Rhode Island and a subtropical North Atlantic site near Bermuda. They will use field sampling and laboratory experiments with multiple species and varieties of phytoplankton to assess the diversity in their responses to different temperatures under high and low nutrient concentrations. If the diversity of responses is high within species, then that species may have a better chance to adapt to rising temperatures and persist in the future. Some species may already be able to grow at high temperatures; consequently, they may become more abundant as the ocean warms. The researchers will incorporate this response information in mathematical models to predict how phytoplankton assemblages would reorganize under future climate scenarios. Graduate students and postdoctoral associates will be trained in diverse scientific approaches and techniques such as shipboard sampling, laboratory experiments, genomic analyses and mathematical modeling. The results of the project will be incorporated into K-12 teaching, including an advanced placement environmental science class for underrepresented minorities in Los Angeles, data exercises for rural schools in Michigan and disseminated to the public through an environmental journalism institute based in Rhode Island. Predicting how ecological communities will respond to a changing environment requires knowledge of genetic, phylogenetic and functional diversity within and across species. This project will investigate how the interaction of phylogenetic, genetic and functional diversity in thermal traits within and across a broad range of species determines the responses of marine phytoplankton communities to rising temperature and changing nutrient regimes. High genetic and functional diversity within a species may allow evolutionary adaptation of that species to warming. If the phylogenetic and functional diversity is higher across species, species sorting and ecological community reorganization is likely. Different marine sites may have a different balance of genetic and functional diversity within and across species and, thus, different contribution of evolutionary and ecological responses to changing climate. The research will be conducted at two long-term time series sites in the Atlantic Ocean, the Narragansett Bay Long-Term Plankton Time Series and the Bermuda Atlantic Time Series (BATS) station. The goal is to assess intra- and inter-specific genetic and functional diversity in thermal responses at contrasting nutrient concentrations for a representative range of species in communities at the two sites in different seasons, and use this information to parameterize eco-evolutionary models embedded into biogeochemical ocean models to predict responses of phytoplankton communities to projected rising temperatures under realistic nutrient conditions. Model predictions will be informed by and tested with field data, including the long-term data series available for both sites and in community temperature manipulation experiments. This project will provide novel information on existing intraspecific genetic and functional thermal diversity for many ecologically and biogeochemically important phytoplankton species, estimate generation of new genetic and functional diversity in evolution experiments, and develop and parameterize novel eco-evolutionary models interfaced with ocean biogeochemical models to predict future phytoplankton community structure. The project will also characterize the interaction of two major global change stressors, warming and changing nutrient concentrations, as they affect phytoplankton diversity at functional, genetic, and phylogenetic levels. In addition, the project will develop novel modeling methodology that will be broadly applicable to understanding how other types of complex ecological communities may adapt to a rapidly warming world. projects_0_end_date=2020-09 projects_0_geolocation=Narragansett Bay, RI and Bermuda, Bermuda Atlantic Time-series Study (BATS) projects_0_name=Dimensions: Collaborative Research: Genetic, functional and phylogenetic diversity determines marine phytoplankton community responses to changing temperature and nutrients projects_0_project_nid=712787 projects_0_start_date=2016-10 sourceUrl=(local files) Southernmost_Northing=41.47 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=Organism,Tax_ID,latitude,longitude,depth version=1 Westernmost_Easting=-71.4 xml_source=osprey2erddap.update_xml() v1.5
z
Genomic sequences and annotations for Solanum lycopersicum, Solanum...
zenodo.org
explore.openaire.eu
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Galland; Marc Galland; Petra Bleeker; Petra Bleeker (2023). Genomic sequences and annotations for Solanum lycopersicum, Solanum pennellii and Solanum habrochaites [Dataset]. http://doi.org/10.5281/zenodo.3952453
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3952453
Dataset updated
Mar 22, 2023
Dataset provided by
Zenodo
Authors
Marc Galland; Marc Galland; Petra Bleeker; Petra Bleeker
Description
Genome sequences

These are the different genome references (fasta formats) available for:

Solanum lycopersicum (assemblies 3.0 and 4.0):

S_lycopersicum_chromosomes.3.00.fa.tar.gz

S_lycopersicum_chromosomes.4.00.fa.tar.gz

Solanum pennellii (one version only from Bolger et al., 2014) : Spenn.fasta.tar.gz

Solanum habrochaites LA1777 (technology hotel project 2018): LA1777.final.fasta

Solanum habrochaites PI127826 (technology hotel project 2018): PI127826.final.fasta

Solanum habrochaites LYC4 (from the paper of Aflitos et al. 2014. 3rd assembly version): S_habrochaites_LYC4...

Solanum arcanum LA2172 (from the paper of Aflitos et al. 2014. 3rd assembly version): LA2172.fasta.tar.gz

Solanum chilense LA3111 (from the paper of Stam et al. 2019, NCBI assembly ASM601370v1): LA3111.fasta.tar.gz

Solanum lycopersicoides LA2951 (from the work of The Boyce Thompson Institute and RWTH Aachen University: link): S_lycopersicoides_LA2951_v1.0_chromosomes.fasta.tar.gz

The two genome assemblies of S. habrochaites LA1777 and PI127826 were obtained through a combination of 10X Linked-Reads and BioNano Optical Mapping. This sequencing has been funded by the DTL Technology Hotel 2018 funding scheme.

Transcriptomes and proteomes

Solanum lycopersicum (assembly 4.0):

Transcriptome: ITAG4.0_cDNA.fasta

Proteome: ITAG4.0_proteins.fasta

Solanum pennellii (one version only from Bolger et al., 2014):

Transcriptome: Spenn-v2-cds-annot.fa

Solanum lycopersicoides (version 1.0)

Transcriptome: S_lycopersicoides_LA2951_v1.0_cds.fasta

Proteome: S_lycopersicoides_LA2951_v1.0_proteins.fasta

Genome annotations files

Solanum lycopersicum

Gene File Format: ITAG4.0_gene_models.gff

MapMan annotation: S_lycopersicum_ITAG4.0_mapping_Mercator_v.3.6.tsv was obtained with Mercator 3.6 using the ITAG4.0_proteins.fasta file.

Solanum lycopersicoides:

Gene File Format: S_lycopersicoides_LA2951_v1.0_gene_models_all.gff3

Solanum habrochaites

PI127826: a GFF file was produced using RepeatMasker (see below) and Braker.
RepeatMasker -qq -e rmblast -small -xsmall -pa 10 -lib mipsREdat_9.3p_Eudicot_TEs.fasta -dir repeat_masking_run/ PI127826.fasta

LA1777

Reference:

Tomato Genome Sequencing Consortium. 2012. The tomato genome sequence provides insights into fleshy fruit evolution. Nature volume 485, pages 635–641.

Bolger et al. 2014. The genome of the stress-tolerant wild tomato species Solanum pennellii http://www.nature.com/ng/journal/v46/n9/full/ng.3046.html

Hosmani et al. 2019. An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps. https://www.biorxiv.org/content/10.1101/767764v1

Aflitos et al. 2014. Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole‐genome sequencing. https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.12616

Stam et al. 2019. The de Novo Reference Genome and Transcriptome Assemblies of the Wild Tomato Species Solanum chilense Highlights Birth and Death of NLR Genes Between Tomato Species. G3: Genes, Genomes, Genetics December 1, 2019 vol. 9 no. 12 3933-3941; https://doi.org/10.1534/g3.119.400529
e
CDD
ebi.ac.uk
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Apr 18, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.
E
[Sponge holobiont accessions and metadata] - NCBI accessions and metadata...
erddap.bco-dmo.org
Updated Aug 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2019). [Sponge holobiont accessions and metadata] - NCBI accessions and metadata associated with Caribbean sponge metagenomes collected from Curacao, Belize, Cayman Islands and St. Croix, 2009 and 2017-2018 (Collaborative Research: Dimensions: Evolutionary Ecology of Sponges and Their Microbiome Drives Sponge Diversity on Coral Reefs) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_775451/index.html
Explore at:
Dataset updated
Aug 19, 2019
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/775451/licensehttps://www.bco-dmo.org/dataset/775451/license
Area covered

Variables measured
host, depth_, latitude, BioSample, longitude, BioProject, Sample_name, sample_type, collection_date, geographic_location, and 1 more
Description
NCBI accessions and metadata associated with Caribbean sponge metagenomes collected from Curacao, Belize, Cayman Islands and St. Croix, 2009 and 2017-2018. access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description="" awards_0_award_nid=649060 awards_0_award_number=OCE-1638296 awards_0_data_url=https://www.nsf.gov/awardsearch/showAward?AWD_ID=1638296 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=Michael E. Sieracki awards_0_program_manager_nid=50446 cdm_data_type=Other comment=Sponge holobiont NCBI accessions and metadata for DoB biodiversity project collections from Curacao, Belize, Cayman Islands and St. Croix, 2009 and 2017-2018 PI: M. Lesser (UNH), M. Slattery (Umiss) version date: 2019-08-15 NOTE: These data were downloaded from NCBI BioProject PRJNA555077 on 2019-08-14 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA555077] Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/bco-dmo.775451.1 Easternmost_Easting=-64.673501 geospatial_lat_max=19.4023 geospatial_lat_min=12.0828 geospatial_lat_units=degrees_north geospatial_lon_max=-64.673501 geospatial_lon_min=-88.0878 geospatial_lon_units=degrees_east infoUrl=https://www.bco-dmo.org/dataset/775451 institution=BCO-DMO instruments_0_acronym=Automated Sequencer instruments_0_dataset_instrument_nid=775458 instruments_0_description=General term for a laboratory instrument used for deciphering the order of bases in a strand of DNA. Sanger sequencers detect fluorescence from different dyes that are used to identify the A, C, G, and T extension reactions. Contemporary or Pyrosequencer methods are based on detecting the activity of DNA polymerase (a DNA synthesizing enzyme) with another chemoluminescent enzyme. Essentially, the method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step. instruments_0_instrument_name=Automated DNA Sequencer instruments_0_instrument_nid=649 instruments_1_acronym=Light Meter instruments_1_dataset_instrument_nid=775508 instruments_1_description=Light meters are instruments that measure light intensity. Common units of measure for light intensity are umol/m2/s or uE/m2/s (micromoles per meter squared per second or microEinsteins per meter squared per second). (example: LI-COR 250A) instruments_1_instrument_name=Light Meter instruments_1_instrument_nid=703 instruments_1_supplied_name=in situ light sensor instruments_2_dataset_instrument_nid=775507 instruments_2_instrument_name=Thermometer instruments_2_instrument_nid=725867 instruments_2_supplied_name=in situ temperature sensor metadata_source=https://www.bco-dmo.org/api/dataset/775451 Northernmost_Northing=19.4023 param_mapping={'775451': {'lat': 'master - latitude', 'depth': 'flag - depth', 'lon': 'master - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/775451/parameters people_0_affiliation=University of New Hampshire people_0_affiliation_acronym=UNH people_0_person_name=Dr Michael P. Lesser people_0_person_nid=645511 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Mississippi people_1_person_name=Dr Marc Slattery people_1_person_nid=647901 people_1_role=Co-Principal Investigator people_1_role_type=originator project=DimensionsSponge projects_0_acronym=DimensionsSponge projects_0_description=NSF Award Abstract: Coral reefs, the tropical rain forests of the marine environment, are under significant threat from a variety of stressors such as pollution, overfishing, coastal development and climate change. There is increasing interest by the coral reef research community in the ecology and evolution of other groups of organisms besides corals on coral reefs with sponges being of particular interest. Sponges are a very old group of organisms essential to reef health because of their roles in nutrient cycling, providing food and homes for many other reef organisms, and their ability to synthesize diverse chemical compounds of ecological importance on the reef, and of interest to the biomedical community. Many of these important functions would not be possible without the symbiotic microbes (e.g., bacteria) that live within sponges. In this project, the investigators will examine relationships between the sponge host and its microbiome in the ecological roles described above. Like the human microbiome, understanding the sponge micobiome may be the key to understanding their ecology and biodiversity. The investigators will use a combination of classical ecological approaches combined with sophisticated biochemical and molecular analyses to unravel the role of the symbionts in the ecology and evolution of sponges. Both the University of New Hampshire and the University of Mississippi will provide training opportunities for undergraduate and graduate students as well as veterans and post-doctoral researchers, especially from underrepresented groups. Additionally, the investigators will develop unique outreach programs for public education on the importance of coral reef ecosystems. The goal of this study is to examine the relationships between marine sponges and their microbiomes, and reveal the phylogenetic, genetic, and functional biodiversity of coral reef sponges across the Caribbean basin. This research will provide a better understanding of sponges as a major functional component of the biodiversity of coral reef communities. This transformative project will examine important paradigms relative to sponge communities worldwide that will provide unique insights into the integrative biodiversity of sponges on coral reefs and enhance our understanding of the ecology and evolution of this extensive, yet understudied, group of marine organisms. This is essential because sponges continue to emerge as the dominant taxon on many coral reefs, particularly following regional declines in coral cover over the last three decades, and their ecological importance to shallow coral reef communities is unequivocal. In addition, many marine sponges host a diverse assemblage of symbiotic microorganisms that play critical functional roles in nutrient cycling within sponges themselves and in the coral reef communities where they reside, and many sponges can potentially transfer photoautotrophically derived energy to higher trophic levels. As shallow coral reefs continue to decline, the phylogenetic, genetic, and functional diversity of coral reefs will increasingly be found in taxa other than scleractinian corals, such as soft corals and sponges. The investigators predict that co-evolution of the sponge host and microbiome leads to emergent functional properties that result in niche diversification and speciation of sponges. To assess this, they will quantify trophic modes (e.g., DOM and POC uptake, photo-autotrophy) of sponges in the Caribbean, as well as the production of chemical defenses. These character states will be analyzed in the context of the phylogenetic composition of the sponge hosts and their microbiomes, and the functional activities of the host and symbionts at the genetic level (i.e., transcriptomics and metatranscriptomics). These data will provide unique insights into the co-evolution of sponges and their microbiomes, and how these symbioses influence the functional attributes of sponges within coral reef communities. projects_0_end_date=2019-09 projects_0_geolocation=Curacao, Belize, Florida, Cayman Islands projects_0_name=Collaborative Research: Dimensions: Evolutionary Ecology of Sponges and Their Microbiome Drives Sponge Diversity on Coral Reefs projects_0_project_nid=649062 projects_0_start_date=2016-10 sourceUrl=(local files) Southernmost_Northing=12.0828 standard_name_vocabulary=CF Standard Name Table v55 subsetVariables=BioProject version=1 Westernmost_Easting=-88.0878 xml_source=osprey2erddap.update_xml() v1.3
Z
Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with...
data.niaid.nih.gov
zenodo.org
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirgh, Sumeet (2024). Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with Acute Leukemia and Sepsis, Including Those Undergoing Bone Marrow Transplantation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13356510
Explore at:
Dataset updated
Aug 24, 2024
Dataset provided by
Jindal, Nishant
Jain, Hasmukh
Terse, Vishram
Nayak, Lingaraj
Sengar, Manju
Gujral, Sumeet
Bhanshe, Prasanna
Gawde, Vaibhav
Joshi, Swapnali
Mathur, Arpit
Tembhare, Prashant
Bagal, Bhausaheb
Gokarn, Anant
Punatar, Sachin
Anam, Karishma
Chatterjee, Gaurav
Subramanian, PG
Mirgh, Sumeet
Chaudhary, Shruti
Khattry, Navin
Shetty, Dhanalaxmi
Shetty, Alok
Rajpal, Sweta
Patkar, Nikhil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Next Generation Sequencing (NGS) analysis of Cell-Free DNA provides valuable insights into a spectrum of pathogenic species (particularly bacterial) in blood. Patients with Sepsis often face problems like delays in treatment regimens (combination or cocktail of antibiotics) due to the long turnaround time (TAT) of classical and standard blood culture procedures. NGS gives results with lower TAT along with high-depth coverage. The use of NGS may be a possible solution to deciding treatment regimens for patients without losing precious time and more accurately possibly saving lives.

Our curated dataset is of bacterial species or strains detected along with their genome size in 107 AML patients diagnosed with Sepsis clinically. Cell-free DNA profiles of patients were built and sequencing was done in Illumina (NovaSeq and NextSeq). Bioinformatic analysis was performed using two classification algorithms namely kraken2 and kaiju. For kraken2 based classification reference bacterial index developed by Carlo Ferravante et al (Zenodo 2020) (link: https://zenodo.org/records/4055180) was used, while for kaiju-based classification reference database named "nr_euk" dated "2023-05-10" (link: https://bioinformatics-centre.github.io/kaiju/downloads.html) was used.

Genome size annotation is important in metagenomics since for the use of depth of coverage (abundance), genome size is required. In metagenomic classification algorithms like kraken/kraken2 and kaiju output computes reads assigned only and not abundance. In kaiju, the problem is more complicated since the reference database does not have a fasta file but only an index file from which alignment is done.

To address the above challenges to compute "depth of coverage" or simply abundance, we build a Genome size annotator tool (https://github.com/patkarlab/Genome-Size-Annotation) which provides genome size for each species detected given its taxid is available. In this tool, the NCBI Datasets tool, NCBI Genome API check tool, and Data Mining from AI search engines like perplexity.ai are used.

We have curated two datasets

Kraken2 dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kraken_genome_annotation"Kaiju dataset named "FINAL METAGENOMIC DATA MASTERSHEET - kaiju_genome_annotation"

*Please note that for kraken2 curated dataset, we used data mining from the AI search engine perplexity.ai while for kaiju we did not use perplexity, ai, and any species whose genome size was not found was labeled "NA"
Pelagomonas calceolata gene expression levels in different nitrogen...
zenodo.org
tsv
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Guérin; Nina Guérin; Chloé Seyman; Chloé Seyman; Quentin Carradec; Quentin Carradec (2024). Pelagomonas calceolata gene expression levels in different nitrogen conditions and differential expression analysis. [Dataset]. http://doi.org/10.5281/zenodo.12726053
Explore at:
tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12726053
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nina Guérin; Nina Guérin; Chloé Seyman; Chloé Seyman; Quentin Carradec; Quentin Carradec
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These files contains the expression levels and DESeq2 results of Pelagomonas calceolata genes cultivated with different nitrate conditions. Two strains of P. calceolata (RCC100 and RCC697) were cultivated and their RNAs reads were aligned on the predicted genes of P. calceolata RCC100 genome: https://www.ncbi.nlm.nih.gov/Traces/wgs/CAKKNE01?display=download

The following culture conditions were analysed :

882 µM of Nitrate (RCC100 and RCC697)

441 µM of Nitrate (RCC100)

220 µM of Nitrate (RCC100 and RCC697)

50 µM of Nitrate (RCC697)

882 µM Cyanate (RCC100)

882 µM Ammonia (RCC100)

441 µM Urea (RCC100)

20230427_RCC100-Nitrate_transcriptomes_rawcounts.tsv : the file contains the raw read counts of RCC100 in 6 culture conditions in triplicate + the gene names = 19 columns.

20230427_RCC100-Nitrate_transcriptomes_TPM.tsv : same data normalized in transcript per kb per million mapped reads (TPM).

20230427_RCC697-Nitrate_transcriptomes_rawcounts.tsv : the file contains the raw read counts of RCC697 of 3 culture conditions in triplicate + the gene names = 10 columns.

20230427_RCC697-Nitrate_transcriptomes_TPM.tsv : same data normalized in transcript per kb per million mapped reads (TPM).

Differential expression analysis (DESeq2) was performed by pairwise comparisons between the standard condition (882 µM nitrate) and low-nitrate conditions (50, 220 or 441 µM nitrate) or changing nitrogen sources (882 µM ammonium, 882 µM cyanate and 441 µM urea). Each DESeq-results_RCCxxx_xxx.tsv file contains 6 columns : P.calceolata gene name, base Mean, log2 Fold Change, standard error value (lfcSE), pvalue and adjusted pvalue (padj).
h
tagged_articles
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abreham Tadesse, tagged_articles [Dataset]. https://huggingface.co/datasets/AbrehamT/tagged_articles
Explore at:
Authors
Abreham Tadesse
Description
Alzheimer's Disease Tagged Articles Dataset

This dataset contains a collection of scientific articles related to Alzheimer's disease, annotated with biomedical entities (such as diseases, genes, and species) using NCBI’s PubTator tools.

Data Sources and Annotation Tools

Entity annotations were generated using NCBI's standard models and API calls to the PubTator3 API: TaggerOne and GNORM for gene_species_tagged_articles.json TaggerOne alone for… See the full description on the dataset page: https://huggingface.co/datasets/AbrehamT/tagged_articles.
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
e
SMART
ebi.ac.uk
Updated Feb 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 14, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

Facebook

Twitter

Click to copy link

Link copied

Cite

Pedro Russo; Cesar Prada (2016). Abstracts [Dataset]. http://doi.org/10.15146/R3TP4D

Abstracts

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.15146/R3TP4D

Dataset updated

Dec 8, 2016

Dataset provided by

Universidade de São Paulo

Authors

Pedro Russo; Cesar Prada

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

The large amount of available literature in academic research imposes a considerable challenge for researchers aiming to establish a thorough understanding of a given subject. In this project, we collected abstracts from studies pertaining Juvenile Idiopathic Arthritis in the PubMed database and aimed to establish a method for finding clusters of closely related studies. Methods This dataset was collected using the esearch NCBI API. It was processed by selecting relevant fields (title, authors and abstract) and inserted into a large text file (.txt) using a custom Bash script.

Clear search

Close search

Google apps

Main menu

Abstracts

Field-wide assessment of differential HT-seq from NCBI GEO database

[Metabarcoding zooplankton at station ALOHA: NCBI SRA accession numbers] -...

[18S rRNA gene tag sequences from DNA and RNA] - NCBI accession metadata for...

79254). Total extracted RNA was checked for residual genomic DNA by

Pathogen Detection (BETA)

1000 Cannabis Genomes Project

Context

Content

Acknowledgements

Inspiration

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

Occurrences of Apis mellifera filamentous virus (AmFV) sequences in public...

NCBIFAM

[Heterosigma akashiwo acclimation] - NCBI accessions of the harmful alga...

Data from: NLMChem a new resource for chemical entity recognition in PubMed...

[Synechococcus accessions] - NCBI accessions for raw genomic sequence data...

Genomic sequences and annotations for Solanum lycopersicum, Solanum...

CDD

[Sponge holobiont accessions and metadata] - NCBI accessions and metadata...

Genome Sizes of Bacterial Species Detected in Cell-Free DNA of Patients with...

Pelagomonas calceolata gene expression levels in different nitrogen...

tagged_articles

CATH-Gene3D

SMART

Abstracts