100+ datasets found

Data from: Standards Incorporated by Reference (SIBR) Database
catalog.data.gov
data.nist.gov
+2more
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Standards Incorporated by Reference (SIBR) Database [Dataset]. https://catalog.data.gov/dataset/standards-incorporated-by-reference-sibr-database
Explore at:
Dataset updated
Sep 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).
Identifiers for the 21st century: How to design, provision, and reuse...
plos.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie A. McMurry; Nick Juty; Niklas Blomberg; Tony Burdett; Tom Conlin; Nathalie Conte; Mélanie Courtot; John Deck; Michel Dumontier; Donal K. Fellows; Alejandra Gonzalez-Beltran; Philipp Gormanns; Jeffrey Grethe; Janna Hastings; Jean-Karim Hériché; Henning Hermjakob; Jon C. Ison; Rafael C. Jimenez; Simon Jupp; John Kunze; Camille Laibe; Nicolas Le Novère; James Malone; Maria Jesus Martin; Johanna R. McEntyre; Chris Morris; Juha Muilu; Wolfgang Müller; Philippe Rocca-Serra; Susanna-Assunta Sansone; Murat Sariyar; Jacky L. Snoep; Stian Soiland-Reyes; Natalie J. Stanford; Neil Swainston; Nicole Washington; Alan R. Williams; Sarala M. Wimalaratne; Lilly M. Winfree; Katherine Wolstencroft; Carole Goble; Christopher J. Mungall; Melissa A. Haendel; Helen Parkinson (2023). Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data [Dataset]. http://doi.org/10.1371/journal.pbio.2001414
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.2001414
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Julie A. McMurry; Nick Juty; Niklas Blomberg; Tony Burdett; Tom Conlin; Nathalie Conte; Mélanie Courtot; John Deck; Michel Dumontier; Donal K. Fellows; Alejandra Gonzalez-Beltran; Philipp Gormanns; Jeffrey Grethe; Janna Hastings; Jean-Karim Hériché; Henning Hermjakob; Jon C. Ison; Rafael C. Jimenez; Simon Jupp; John Kunze; Camille Laibe; Nicolas Le Novère; James Malone; Maria Jesus Martin; Johanna R. McEntyre; Chris Morris; Juha Muilu; Wolfgang Müller; Philippe Rocca-Serra; Susanna-Assunta Sansone; Murat Sariyar; Jacky L. Snoep; Stian Soiland-Reyes; Natalie J. Stanford; Neil Swainston; Nicole Washington; Alan R. Williams; Sarala M. Wimalaratne; Lilly M. Winfree; Katherine Wolstencroft; Carole Goble; Christopher J. Mungall; Melissa A. Haendel; Helen Parkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
Toxicity Reference Database
catalog.data.gov
datasets.ai
+2more
Updated Dec 3, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) - National Center for Computational Toxicology (NCCT) (2020). Toxicity Reference Database [Dataset]. https://catalog.data.gov/dataset/toxicity-reference-database
Explore at:
Dataset updated
Dec 3, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The Toxicity Reference Database (ToxRefDB) contains approximately 30 years and $2 billion worth of animal studies. ToxRefDB allows scientists and the interested public to search and download thousands of animal toxicity testing results for hundreds of chemicals that were previously found only in paper documents. Currently, there are 474 chemicals in ToxRefDB, primarily the data rich pesticide active ingredients, but the number will continue to expand.
Data from: The Brill Knowledge Graph: A Database of Bibliographic References...
zenodo.org
bin, txt
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natallia Kokash; Natallia Kokash; Matteo Romanello; Matteo Romanello; Ernest Suyver; Ernest Suyver; Giovanni Colavizza; Giovanni Colavizza (2024). The Brill Knowledge Graph: A Database of Bibliographic References and Index Terms extracted from Books in Humanities and Social Sciences [Dataset]. http://doi.org/10.5281/zenodo.7691771
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7691771
Dataset updated
Mar 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natallia Kokash; Natallia Kokash; Matteo Romanello; Matteo Romanello; Ernest Suyver; Ernest Suyver; Giovanni Colavizza; Giovanni Colavizza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a complete dataset of linked bibliography and index data, partially disambiguated and augmented with references to external resources, extracted from the Brill’s archive in the field of Classics. Processed book identifiers are listed in a separate text file. Text fragments extracted from different books via this process are then parsed and compared using a string-based similarity metric to form clusters of bibliographic references to the same published work or (variants of) the same subjects discussed in these books. The entire set of references was then disambiguated using Google Books and Crossref APIs.

Paper about extraction pipeline

Paper about extracted KG
Data from: Reference Windfarm database PDk 90
data.europa.eu
zenodo.org
unknown
Updated Feb 25, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2020). Reference Windfarm database PDk 90 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3688716?locale=en
Explore at:
unknown(1776)Available download formats
Dataset updated
Feb 25, 2020
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for TotalControl reference windfarm database simulation of a pressure-driven high Reynolds number boundary layer flow with 90 degree inflow wind direction angle (Casename PDk 90) Included Python files for loading and visualizing the data. Use the plot_*.py files. Further information, including description of the case and dataset can be found in the deliverable report at: https://cordis.europa.eu/project/id/727680/results "Database for reference wind farms part 2: windfarm simulations"
b
Data from: Human Protein Reference Database
bioregistry.io
Updated Aug 18, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Human Protein Reference Database [Dataset]. http://identifiers.org/re3data:r3d100010978
Explore at:
Unique identifier
https://identifiers.org/re3data:r3d100010978
Dataset updated
Aug 18, 2021
Description
The Human Protein Reference Database (HPRD) represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.
Z
MDMcleaner reference database
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Vollmers (2022). MDMcleaner reference database [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5698994
Explore at:
Dataset updated
Jan 28, 2022
Dataset provided by
Karlsruhe Institute of Technology
Authors
John Vollmers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MDMcleaner reference database used at time of publication.

based on:

GTDB release r95

RefSeq release 203

Silva version 138.1
Kraken2 mouse reference database for GL
figshare.com
application/x-gzip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Lee (2023). Kraken2 mouse reference database for GL [Dataset]. http://doi.org/10.6084/m9.figshare.19074188.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19074188.v3
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Michael Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
built as described here: https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/SW_MGEstHostReads/reference-database-info.md

Test fastq files hold 4 read pairs: 1 phage, 1 e. coli, 1 human, 1 mouse
d
A reference database for tumor-related genes co-expressed with interleukin-8...
catalog.data.gov
odgavaprod.ogopendata.com
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). A reference database for tumor-related genes co-expressed with interleukin-8 using genome-scale [Dataset]. https://catalog.data.gov/dataset/a-reference-database-for-tumor-related-genes-co-expressed-with-interleukin-8-using-genome-
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
National Institutes of Health
Description
Background The EST database provides a rich resource for gene discovery and in silico expression analysis. We report a novel computational approach to identify co-expressed genes using EST database, and its application to IL-8. Results IL-8 is represented in 53 dbEST cDNA libraries. We calculated the frequency of occurrence of all the genes represented in these cDNA libraries, and ranked the candidates based on a Z-score. Additional analysis suggests that most IL-8 related genes are differentially expressed between non-tumor and tumor tissues. To focus on IL-8's function in tumor tissues, we further analyzed and ranked the genes in 16 IL-8 related tumor libraries. Conclusions This method generated a reference database for genes co-expressed with IL-8 and could facilitate further characterization of functional association among genes.
H
Replication data for: Citations
dataverse.harvard.edu
search.dataone.org
Updated Oct 19, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elaine Lasda Bergman (2014). Replication data for: Citations [Dataset]. http://doi.org/10.7910/DVN/27655
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/27655
Dataset updated
Oct 19, 2014
Dataset provided by
Harvard Dataverse
Authors
Elaine Lasda Bergman
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Microsoft Access Database for bibliometric analysis found in the article: Elaine M. Lasda Bergman, Finding Citations to Social Work Literature: The Relative Benefits of Using Web of Science, Scopus, or Google Scholar, The Journal of Academic Librarianship, Volume 38, Issue 6, November 2012, Pages 370-379, ISSN 0099-1333, http://dx.doi.org/10.1016/j.acalib.2012.08.002. (http://www.sciencedirect.com/science/article/pii/S009913331200119X) Abstract: Past studies of citation coverage of Web of Science, Scopus, and Google Scholar do not demonstrate a consistent pattern that can be applied to the interdisciplinary mix of resources used in social work research. To determine the utility of these tools to social work researchers, an analysis of citing references to well-known social work journals was conducted. Web of Science had the fewest citing references and almost no variety in source format. Scopus provided higher citation counts, but the pattern of coverage was similar to Web of Science. Google Scholar provided substantially more citing references, but only a relatively small percentage of them were unique scholarly journal articles. The patterns of database coverage were replicated when the citations were broken out for each journal separately. The results of this analysis demonstrate the need to determine what resources constitute scholarly research and reflect the need for future researchers to consider the merits of each database before undertaking their research. This study will be of interest to scholars in library and information science as well as social work, as it facilitates a greater understanding of the strengths and limitations of each database and brings to light important considerations for conducting future research. Keywords: Citation analysis; Social work; Scopus; Web of Science; Google Scholar
d
SOLPRINT Thermodynamic Database User Reference for SOLTHERM
datasets.ai
data.amerigeoss.org
0, 47
Updated Jun 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2021). SOLPRINT Thermodynamic Database User Reference for SOLTHERM [Dataset]. https://datasets.ai/datasets/solprint-thermodynamic-database-user-reference-for-soltherm-c518d
Explore at:
0, 47Available download formats
Dataset updated
Jun 23, 2021
Dataset authored and provided by
Department of Energy
Description
This data submission is link to a user reference guide for the SOLTHERM thermodynamic database maintained by the University of Oregon. The data at this link are not 'data results' from sampling. These data are derived from SOLTHERM as a reference for the user, showing balanced reactions and equilibrium constants log K(T,P) along the liquid-vapor saturation curve only, up to 350 degrees C, for aqueous species and minerals including REE, and gases. These data are more easily read by the user than the those in the SOLTHERM thermodynamic database.
Revision Date of Geo-Reference Database - iG1000 | DATA.GOV.HK
data.gov.hk
Updated Nov 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.gov.hk (2025). Revision Date of Geo-Reference Database - iG1000 | DATA.GOV.HK [Dataset]. https://data.gov.hk/en-data/dataset/hk-landsd-openmap-revision-date-of-geo-reference-database-ig1000
Explore at:
Dataset updated
Nov 24, 2025
Dataset provided by
data.gov.hk
Description
Revision Date of Geo-Reference Database - iG1000
a
Alabama Metropolitan Statistical Areas
hub.arcgis.com
data-algeohub.opendata.arcgis.com
+1more
Updated Jun 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alabama GeoHub (2021). Alabama Metropolitan Statistical Areas [Dataset]. https://hub.arcgis.com/maps/b590e56104894f4a81dd63bfec294069
Explore at:
Dataset updated
Jun 22, 2021
Dataset authored and provided by
Alabama GeoHub
Area covered

Description
The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB).The TIGER/Line Shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) System. The MAF/TIGER System represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line Shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The TIGERweb REST Services allows users to integrate the Census Bureau's Topologically Integrated Geographic Encoding and Referencing database (TIGER) data into their own GIS or custom web-based applications.For a more detailed description of the areas listed or terms below, refer to TIGER/Line documentation or the Geographic Areas Reference Manual, (GARM).Metadata
MARMICRODB database for taxonomic classification of (marine) metagenomes
zenodo.org
application/gzip, bin +3
Updated Mar 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
Explore at:
bin, application/gzip, tsv, html, bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.3520509
Dataset updated
Mar 20, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shane L Hogle; Shane L Hogle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]

Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
V
Weekly Transit Service Reference Data
data.virginia.gov
data.transportation.gov
csv, json, rdf, xsl
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S Department of Transportation (2025). Weekly Transit Service Reference Data [Dataset]. https://data.virginia.gov/dataset/weekly-transit-service-reference-data
Explore at:
xsl, json, csv, rdfAvailable download formats
Dataset updated
Oct 6, 2025
Dataset provided by
Federal Transit Administration
Authors
U.S Department of Transportation
Description
Beginning in 2023, certain agencies are required to submit one week of service data on a monthly basis to comply with FTA’s Weekly Reference reporting requirement on form WE-20. This data release will therefore present the limited set of key indicators reported by transit agencies on this form and will be updated each month with the most current data.

The resulting dataset provides data users with data shortly after the transit service was provided and consumed, over one month in advance of FTA’s routine update to the Monthly Ridership Time Series dataset. One use of this data is for reference in understanding ridership patterns (e.g., to develop to a full month estimate ahead of when the data reflecting the given month of service is released by FTA at the end of the following month).

Generally, FTA has defined the reference week to be the second or third full week of the month. All sampled agencies will report data referencing the same reference week.

The form collects the following service data points, as described in the metadata below: • Weekday 5-day UPT total for the reference week; • Weekday 5-day VRM total for the reference week; • Weekend 2-day UPT total for either the weekend preceding or following the reference week; and • Weekend 2-day VRM total for either the weekend preceding or following the reference week. • Vehicles Operated in Maximum Service (vanpool mode only) for the reference week.

FTA has also derived the change from the prior month for the same agency/mode/type of service/data point. Users should take caution when aggregating this measure and are encouraged to use the dataset export to measure service trends at a higher level (i.e., by reporter or nationally).

For any questions regarding this dataset, please contact the NTD helpdesk at ntdhelp@dot.gov .
u
Meta16S Custom Reference Database Mexico City Fish & Herpetofauna
rdr.ucl.ac.uk
txt
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Maeda Obregon; Julia Day (2025). Meta16S Custom Reference Database Mexico City Fish & Herpetofauna [Dataset]. http://doi.org/10.5522/04/27931947.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5522/04/27931947.v1
Dataset updated
Jun 20, 2025
Dataset provided by
University College London
Authors
Alejandro Maeda Obregon; Julia Day
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Mexico, Mexico City
Description
FASTA file with the sequences for amphibian, fish and reptile species registered to occur in Mexico City, functioning as a custom reference database for the Meta16S metabarcoding library. Sanger sequences were obtained through DNA extractions from preserved tissues (obtained through the project's collaborators) using the QIAGEN DNeasy Blood & Tissue kit. The processing of the sequences (including de novo assembly) was done using Geneious Prime. Sequences were used during the Taxonomy Assignment step of the bioinformatic pipeline using Python packages BLAST+ and BASTA.
Using Open Citation Databases for Snowballing in Software Engineering...
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, zip
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leif Bonorden; Leif Bonorden (2024). Using Open Citation Databases for Snowballing in Software Engineering Research [Dataset]. http://doi.org/10.5281/zenodo.7938497
Explore at:
csv, bin, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7938497
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leif Bonorden; Leif Bonorden
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for our study on the coverage of software engineering articles in open citation databases:

a list of the 23 sampled venues with their respective CORE ranks and publishers,

01-venues.csv,

a list of the 204 sampled articles with their respective number of references/citations per citation database,

02-articles.csv (articles with publication information),

03-references-absolute.csv (number of references in published PDF & absolute numbers for reference coverage in databases),

04-references-relative.csv (relative numbers for reference coverage in databases),

05-citations-absolute.csv (absolute numbers for citation coverage in databases),

06-citations relative.csv (relative numbers for citation coverage in databases),

a list of the 8 articles analyzed in more detail with complete references data from the citation databases,

07-selected-articles.csv (articles with publication information),

08A–08H (comparison of references found in databases for each article),

and additional statistical measures and plots

09-Statistics.{pdf,xlsx} (statistical measures – i.e., minimum, maximum, median, average, variance – for the whole dataset and for subsets by publisher, CORE rank, or year of publication),

10-Figures.zip (figures for references as shown in the study and additional figures for citations – each in EPS and PNG format).
Z
EvoBib: A Bibliographic Database and Quote Collection for Historical...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johann-Mattis List (2024). EvoBib: A Bibliographic Database and Quote Collection for Historical Linguistics [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_1181952
Explore at:
Dataset updated
Aug 22, 2024
Dataset provided by
Max Planck Institute for the Science of Human History
Authors
Johann-Mattis List
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This databases offers 4564 references dealing with computer-assisted language comparison in a broad sense. In addition, the database offers 8364 distinct quotes collected from 5063 references. The majority of the references in the quote database overlaps with those in the bibliographic database. The quotes are organized by keywords and can browsed with a full text and a keyword search.

The data (references and quotes) underlying each new release are provided here, the data can be browsed at https://evobib.digling.org/.

If you use the database, I would appreciate if you could this in your research:

List, Johann-Mattis (2024): EvoBib: A bibliographic database and quote collection [Database, Version 1.8.0]. Passau: Chair for Multilingual Computational Linguistics. URL: https://evobib.digling.org/
c
Street Network Database SND
s.cnmilf.com
data.seattle.gov
+2more
Updated Oct 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2025). Street Network Database SND [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/street-network-database-snd-1712b
Explore at:
Dataset updated
Oct 4, 2025
Dataset provided by
City of Seattle ArcGIS Online
Description
The pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer. Appropriate uses include: Cartography - Used to depict the City's transportation network _location and connections, typically on smaller scaled maps or images where a single line representation is appropriate. Used to depict specific classifications of roadway use, also typically at smaller scales. Used to label transportation network feature names typically on larger scaled maps. Used to label address ranges with associated transportation network features typically on larger scaled maps. Geocode reference - Used as a source for derived reference data for address validation and theoretical address _location Address Range data repository - This data store is the City's address range repository defining address ranges in association with transportation network features. Polygon boundary reference - Used to define various area boundaries is other feature classes where coincident with the transportation network. Does not contain polygon features. Address based extracts - Used to create flat-file extracts typically indexed by address with reference to business data typically associated with transportation network features. Thematic linear _location reference - By providing unique, stable identifiers for each linear feature, thematic data is associated to specific transportation network features via these identifiers. Thematic intersection _location reference - By providing unique, stable identifiers for each intersection feature, thematic data is associated to specific transportation network features via these identifiers. Network route tracing - Used as source for derived reference data used to determine point to point travel paths or determine optimal stop allocation along a travel path. Topological connections with segments - Used to provide a specific definition of _location for each transportation network feature. Also provides a specific definition of connection between each transportation network feature. (defines where the streets are and the relationship between them ie. 4th Ave is west of 5th Ave and 4th Ave does intersect with Cherry St) Event _location reference - Used as source for derived reference data used to locate event and linear referencing.Data source is TRANSPO.SNDSEG_PV. Updated weekly.
n
UniRef at the EBI
neuinfo.org
scicrunch.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). UniRef at the EBI [Dataset]. http://identifiers.org/RRID:SCR_004972
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004972
Dataset updated
Jan 29, 2022
Description
Various non-redundant databases with different sequence identity cut-offs created by clustering closely similar sequences to yield a representative subset of sequences. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. You may access NREF via the FTP server.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2023). Standards Incorporated by Reference (SIBR) Database [Dataset]. https://catalog.data.gov/dataset/standards-incorporated-by-reference-sibr-database

Data from: Standards Incorporated by Reference (SIBR) Database

Explore at:

Dataset updated

Sep 30, 2023

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

This is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).

Clear search

Close search

Google apps

Main menu

Data from: Standards Incorporated by Reference (SIBR) Database

Identifiers for the 21st century: How to design, provision, and reuse...

Toxicity Reference Database

Data from: The Brill Knowledge Graph: A Database of Bibliographic References...

Data from: Reference Windfarm database PDk 90

Data from: Human Protein Reference Database

MDMcleaner reference database

Kraken2 mouse reference database for GL

A reference database for tumor-related genes co-expressed with interleukin-8...

Replication data for: Citations

SOLPRINT Thermodynamic Database User Reference for SOLTHERM

Revision Date of Geo-Reference Database - iG1000 | DATA.GOV.HK

Alabama Metropolitan Statistical Areas

MARMICRODB database for taxonomic classification of (marine) metagenomes

Weekly Transit Service Reference Data

Meta16S Custom Reference Database Mexico City Fish & Herpetofauna

Using Open Citation Databases for Snowballing in Software Engineering...

EvoBib: A Bibliographic Database and Quote Collection for Historical...

Street Network Database SND

UniRef at the EBI

Data from: Standards Incorporated by Reference (SIBR) Database