Facebook
TwitterThis is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
Facebook
TwitterThe Toxicity Reference Database (ToxRefDB) contains approximately 30 years and $2 billion worth of animal studies. ToxRefDB allows scientists and the interested public to search and download thousands of animal toxicity testing results for hundreds of chemicals that were previously found only in paper documents. Currently, there are 474 chemicals in ToxRefDB, primarily the data rich pesticide active ingredients, but the number will continue to expand.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a complete dataset of linked bibliography and index data, partially disambiguated and augmented with references to external resources, extracted from the Brill’s archive in the field of Classics. Processed book identifiers are listed in a separate text file. Text fragments extracted from different books via this process are then parsed and compared using a string-based similarity metric to form clusters of bibliographic references to the same published work or (variants of) the same subjects discussed in these books. The entire set of references was then disambiguated using Google Books and Crossref APIs.
Paper about extraction pipeline
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for TotalControl reference windfarm database simulation of a pressure-driven high Reynolds number boundary layer flow with 90 degree inflow wind direction angle (Casename PDk 90) Included Python files for loading and visualizing the data. Use the plot_*.py files. Further information, including description of the case and dataset can be found in the deliverable report at: https://cordis.europa.eu/project/id/727680/results "Database for reference wind farms part 2: windfarm simulations"
Facebook
TwitterThe Human Protein Reference Database (HPRD) represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MDMcleaner reference database used at time of publication.
based on:
GTDB release r95
RefSeq release 203
Silva version 138.1
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
built as described here: https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/SW_MGEstHostReads/reference-database-info.md
Test fastq files hold 4 read pairs: 1 phage, 1 e. coli, 1 human, 1 mouse
Facebook
TwitterBackground The EST database provides a rich resource for gene discovery and in silico expression analysis. We report a novel computational approach to identify co-expressed genes using EST database, and its application to IL-8. Results IL-8 is represented in 53 dbEST cDNA libraries. We calculated the frequency of occurrence of all the genes represented in these cDNA libraries, and ranked the candidates based on a Z-score. Additional analysis suggests that most IL-8 related genes are differentially expressed between non-tumor and tumor tissues. To focus on IL-8's function in tumor tissues, we further analyzed and ranked the genes in 16 IL-8 related tumor libraries. Conclusions This method generated a reference database for genes co-expressed with IL-8 and could facilitate further characterization of functional association among genes.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Microsoft Access Database for bibliometric analysis found in the article: Elaine M. Lasda Bergman, Finding Citations to Social Work Literature: The Relative Benefits of Using Web of Science, Scopus, or Google Scholar, The Journal of Academic Librarianship, Volume 38, Issue 6, November 2012, Pages 370-379, ISSN 0099-1333, http://dx.doi.org/10.1016/j.acalib.2012.08.002. (http://www.sciencedirect.com/science/article/pii/S009913331200119X) Abstract: Past studies of citation coverage of Web of Science, Scopus, and Google Scholar do not demonstrate a consistent pattern that can be applied to the interdisciplinary mix of resources used in social work research. To determine the utility of these tools to social work researchers, an analysis of citing references to well-known social work journals was conducted. Web of Science had the fewest citing references and almost no variety in source format. Scopus provided higher citation counts, but the pattern of coverage was similar to Web of Science. Google Scholar provided substantially more citing references, but only a relatively small percentage of them were unique scholarly journal articles. The patterns of database coverage were replicated when the citations were broken out for each journal separately. The results of this analysis demonstrate the need to determine what resources constitute scholarly research and reflect the need for future researchers to consider the merits of each database before undertaking their research. This study will be of interest to scholars in library and information science as well as social work, as it facilitates a greater understanding of the strengths and limitations of each database and brings to light important considerations for conducting future research. Keywords: Citation analysis; Social work; Scopus; Web of Science; Google Scholar
Facebook
TwitterThis data submission is link to a user reference guide for the SOLTHERM thermodynamic database maintained by the University of Oregon. The data at this link are not 'data results' from sampling. These data are derived from SOLTHERM as a reference for the user, showing balanced reactions and equilibrium constants log K(T,P) along the liquid-vapor saturation curve only, up to 350 degrees C, for aqueous species and minerals including REE, and gases. These data are more easily read by the user than the those in the SOLTHERM thermodynamic database.
Facebook
TwitterRevision Date of Geo-Reference Database - iG1000
Facebook
TwitterThe TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB).The TIGER/Line Shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) System. The MAF/TIGER System represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line Shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The TIGERweb REST Services allows users to integrate the Census Bureau's Topologically Integrated Geographic Encoding and Referencing database (TIGER) data into their own GIS or custom web-based applications.For a more detailed description of the areas listed or terms below, refer to TIGER/Line documentation or the Geographic Areas Reference Manual, (GARM).Metadata
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction:
This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.
Motivation:
We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.
Results/Description:
MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.
Methods:
The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.
The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.
Software/databases used:
checkM v1.0.11[16]
HMMERv3.1b2 (http://hmmer.org/)
prodigal v2.6.3 [22]
trimAl v1.4.rev22 [24]
AliView v1.18.1 [33] [34]
Phyx v0.1 [35]
RAxML v8.2.12 [36]
Pplacer v1.1alpha [28]
GTDB-Tk v0.1.3 [19]
Kaiju v1.6.0 [34]
GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
NCBI Taxonomy (accessed 2018-07-02) [23]
TIGRFAM v14.0 [37]
PFAM v31.0 [38]
Discussion/Caveats:
MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high
Facebook
TwitterBeginning in 2023, certain agencies are required to submit one week of service data on a monthly basis to comply with FTA’s Weekly Reference reporting requirement on form WE-20. This data release will therefore present the limited set of key indicators reported by transit agencies on this form and will be updated each month with the most current data.
The resulting dataset provides data users with data shortly after the transit service was provided and consumed, over one month in advance of FTA’s routine update to the Monthly Ridership Time Series dataset. One use of this data is for reference in understanding ridership patterns (e.g., to develop to a full month estimate ahead of when the data reflecting the given month of service is released by FTA at the end of the following month).
Generally, FTA has defined the reference week to be the second or third full week of the month. All sampled agencies will report data referencing the same reference week.
The form collects the following service data points, as described in the metadata below: • Weekday 5-day UPT total for the reference week; • Weekday 5-day VRM total for the reference week; • Weekend 2-day UPT total for either the weekend preceding or following the reference week; and • Weekend 2-day VRM total for either the weekend preceding or following the reference week. • Vehicles Operated in Maximum Service (vanpool mode only) for the reference week.
FTA has also derived the change from the prior month for the same agency/mode/type of service/data point. Users should take caution when aggregating this measure and are encouraged to use the dataset export to measure service trends at a higher level (i.e., by reporter or nationally).
For any questions regarding this dataset, please contact the NTD helpdesk at ntdhelp@dot.gov .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FASTA file with the sequences for amphibian, fish and reptile species registered to occur in Mexico City, functioning as a custom reference database for the Meta16S metabarcoding library. Sanger sequences were obtained through DNA extractions from preserved tissues (obtained through the project's collaborators) using the QIAGEN DNeasy Blood & Tissue kit. The processing of the sequences (including de novo assembly) was done using Geneious Prime. Sequences were used during the Taxonomy Assignment step of the bioinformatic pipeline using Python packages BLAST+ and BASTA.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for our study on the coverage of software engineering articles in open citation databases:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This databases offers 4564 references dealing with computer-assisted language comparison in a broad sense. In addition, the database offers 8364 distinct quotes collected from 5063 references. The majority of the references in the quote database overlaps with those in the bibliographic database. The quotes are organized by keywords and can browsed with a full text and a keyword search.
The data (references and quotes) underlying each new release are provided here, the data can be browsed at https://evobib.digling.org/.
If you use the database, I would appreciate if you could this in your research:
List, Johann-Mattis (2024): EvoBib: A bibliographic database and quote collection [Database, Version 1.8.0]. Passau: Chair for Multilingual Computational Linguistics. URL: https://evobib.digling.org/
Facebook
TwitterThe pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer. Appropriate uses include: Cartography - Used to depict the City's transportation network _location and connections, typically on smaller scaled maps or images where a single line representation is appropriate. Used to depict specific classifications of roadway use, also typically at smaller scales. Used to label transportation network feature names typically on larger scaled maps. Used to label address ranges with associated transportation network features typically on larger scaled maps. Geocode reference - Used as a source for derived reference data for address validation and theoretical address _location Address Range data repository - This data store is the City's address range repository defining address ranges in association with transportation network features. Polygon boundary reference - Used to define various area boundaries is other feature classes where coincident with the transportation network. Does not contain polygon features. Address based extracts - Used to create flat-file extracts typically indexed by address with reference to business data typically associated with transportation network features. Thematic linear _location reference - By providing unique, stable identifiers for each linear feature, thematic data is associated to specific transportation network features via these identifiers. Thematic intersection _location reference - By providing unique, stable identifiers for each intersection feature, thematic data is associated to specific transportation network features via these identifiers. Network route tracing - Used as source for derived reference data used to determine point to point travel paths or determine optimal stop allocation along a travel path. Topological connections with segments - Used to provide a specific definition of _location for each transportation network feature. Also provides a specific definition of connection between each transportation network feature. (defines where the streets are and the relationship between them ie. 4th Ave is west of 5th Ave and 4th Ave does intersect with Cherry St) Event _location reference - Used as source for derived reference data used to locate event and linear referencing.Data source is TRANSPO.SNDSEG_PV. Updated weekly.
Facebook
TwitterVarious non-redundant databases with different sequence identity cut-offs created by clustering closely similar sequences to yield a representative subset of sequences. In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases. The two major objectives of UniRef are: (i) to facilitate sequence merging in UniProt, and (ii) to allow faster and more informative sequence similarity searches. Although the UniProt Knowledgebase is much less redundant than UniParc, it still contains a certain level of redundancy because it is not possible to use fully automatic merging without risking merging of similar sequences from different proteins. However, such automatic procedures are extremely useful in compiling the UniRef databases to obtain complete coverage of sequence space while hiding redundant sequences (but not their descriptions) from view. A high level of redundancy results in several problems, including slow database searches and long lists of similar or identical alignments that can obscure novel matches in the output. Thus, a more even sampling of sequence space is advantageous. You may access NREF via the FTP server.
Facebook
TwitterThis is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).