Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput sequencing has created an exponential increase in the amount of gene expression data, much of which is freely, publicly available in repositories such as NCBI's Gene Expression Omnibus (GEO). Querying this data for patterns such as similarity and distance, however, becomes increasingly challenging as the total amount of data increases. Furthermore, vectorization of the data is commonly required in Artificial Intelligence and Machine Learning (AI/ML) approaches. We present BioVDB, a vector database for storage and analysis of gene expression data, which enhances the potential for integrating biological studies with AI/ML tools. We used a previously developed approach called Automatic Label Extraction (ALE) to extract sample labels from metadata, including age, sex, and tissue/cell-line. BioVDB stores 438,562 samples from eight microarray GEO platforms. We show that it allows for efficient querying of data using similarity search, which can also be useful for identifying and inferring missing labels of samples, and for rapid similarity analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic biology seeks to create new biological parts, devices, and systems, and to reconfigure existing natural biological systems for custom-designed purposes. The standardized BioBrick parts are the foundation of synthetic biology. The incomplete and flawed metadata of BioBrick parts, however, are a major obstacle for designing genetic circuit easily, quickly, and accurately. Here, a database termed BioMaster http://www.biomaster-uestc.cn was developed to extensively complement information about BioBrick parts, which includes 47,934 items of BioBrick parts from the international Genetically Engineered Machine (iGEM) Registry with more comprehensive information integrated from 10 databases, providing corresponding information about functions, activities, interactions, and related literature. Moreover, BioMaster is also a user-friendly platform for retrieval and analyses of relevant information on BioBrick parts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Of interest to pharmaceutical, nutritional, and biomedical researchers, as well as individuals and companies involved with alternative therapies and and herbal products, this database is one of the world's leading repositories of ethnobotanical data, evolving out of the extensive compilations by the former Chief of USDA's Economic Botany Laboratory in the Agricultural Research Service in Beltsville, Maryland, in particular his popular Handbook of phytochemical constituents of GRAS herbs and other economic plants (CRC Press, Boca Raton, FL, 1992). In addition to Duke's own publications, the database documents phytochemical information and quantitative data collected over many years through research results presented at meetings and symposia, and findings from the published scientific literature. The current Phytochemical and Ethnobotanical databases facilitate plant, chemical, bioactivity, and ethnobotany searches. A large number of plants and their chemical profiles are covered, and data are structured to support browsing and searching in several user-focused ways. For example, users can get a list of chemicals and activities for a specific plant of interest, using either its scientific or common name download a list of chemicals and their known activities in PDF or spreadsheet form find plants with chemicals known for a specific biological activity display a list of chemicals with their LD toxicity data find plants with potential cancer-preventing activity display a list of plants for a given ethnobotanical use find out which plants have the highest levels of a specific chemical References to the supporting scientific publications are provided for each specific result. Resources in this dataset:Resource Title: Duke-Source-CSV.zip. File Name: Duke-Source-CSV.zipResource Description: Dr. Duke's Phytochemistry and Ethnobotany - raw database tables for archival purposes. Visit https://phytochem.nal.usda.gov/phytochem/search for the interactive web version of the database.Resource Title: Data Dictionary (preliminary). File Name: DrDukesDatabaseDataDictionary-prelim.csvResource Description: This Data Dictionary describes the columns for each table. [Note that this is in progress and some variables are yet to be defined or are unused in the current implementation. Please send comments/suggestions to nal-adc-curator@ars.usda.gov ]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cancer is a complex disease with a high rate of mortality. The characteristics of tumor masses are very heterogeneous; thus, the appropriate classification of tumors is a critical point in the effective treatment. A high level of heterogeneity has also been observed in breast cancer. Therefore, detecting the molecular subtypes of this disease is an essential issue for medicine that could be facilitated using bioinformatics. This study aims to discover the molecular subtypes of breast cancer using somatic mutation profiles of tumors. Nonetheless, the somatic mutation profiles are very sparse. Therefore, a network propagation method is used in the gene interaction network to make the mutation profiles dense. Afterward, the deep embedded clustering (DEC) method is used to classify the breast tumors into four subtypes. In the next step, gene signature of each subtype is obtained using Fisher's exact test. Besides the enrichment of gene signatures in numerous biological databases, clinical and molecular analyses verify that the proposed method using mutation profiles can efficiently detect the molecular subtypes of breast cancer. Finally, a supervised classifier is trained based on the discovered subtypes to predict the molecular subtype of a new patient. The code and material of the method are available at: https://github.com/nrohani/MolecularSubtypes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.
This is a database of marine invertebrate dispersal parameters and species ranges along East Coast of North America with latitude and longitude calculated and added programmatically.
The raw data for range was gathered from occurrence data in the GBIF dataset.
Life history was gathered from a Literature Review.
The complete dataset methodology is detailed in Pappalardo P, Pringle J, Wares J, and J Byers (2015): The location, strength, and mechanisms behind marine biogeographic boundaries of the east coast of North America. Ecography 38: 001–010, 2015
There are two other datasets associated with this coordinate system:
http://www.bco-dmo.org/dataset/554871: Database of marine invertebrate dispersal parameters and species ranges (NE Coast N. America)
and
http://www.bco-dmo.org/dataset/554893: A series of coordinates and ranges from South and North America to which species occurrences are mapped according to a model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a database of marine invertebrate dispersal parameters and species ranges along the East Coast of North America.
The raw data for range was gathered from occurrence data in the GBIF dataset.
Life history was gathered from a Literature Review.
The complete dataset methodology is detailed in Pappalardo P, Pringle J, Wares J, and J Byers (2015): The location, strength, and mechanisms behind marine biogeographic boundaries of the east coast of North America. Ecography 38: 001–010, 2015
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Review_MNNP_Fish
Data, Rcode and Supplementary Materials for: Effects of virgin micro- and nano-plastics on fish: Trends, meta-analysis and perspectives
Supplementary_Information_1.pdf:
This document contains a list of names that are used in the database (see Supplementary_Information_2) to describe the biological endpoints investigated within the 46 studies that we have reviewed
Supplementary_Information_2.xlsx:
This document contains 6 sheets:
Plabib_Fig2.xlsx:
This document contains the data necessary to build Figure 2 of this review paper. Data has been extracted from Supplementary_Information_2 database.
Plabib_Fig3_pie.txt:
This document contains the data necessary to build Figure 3 pie chart. Data has been extracted from Supplementary_Information_2 database.
PlaBib_Rcode_forFigures.R:
This document contains the Rcode necessary to build all figures from this manuscript
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tables S1–S5, Report for each genome used in this article the most similar genome based on which the provisional genome code was assigned, the ANIb% value, the % of aligned fragments, and the assigned genome code. (PDF)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Jellyfish Database Initiative (JeDI) is a scientifically-coordinated global database dedicated to gelatinous zooplankton (members of the Cnidaria, Ctenophora and Thaliacea) and associated environmental data. The database holds 476,000 quantitative, categorical, presence-absence and presence only records of gelatinous zooplankton spanning the past four centuries (1790-2011) assembled from a variety of published and unpublished sources. Gelatinous zooplankton data are reported to species level, where identified, but taxonomic information on phylum, family and order are reported for all records. Other auxiliary metadata, such as physical, environmental and biometric information relating to the gelatinous zooplankton metadata, are included with each respective entry. JeDI has been developed and designed as an open access research tool for the scientific community to quantitatively define the global baseline of gelatinous zooplankton populations and to describe long-term and large-scale trends in gelatinous zooplankton populations and blooms. It has also been constructed as a future repository of datasets, thus allowing retrospective analyses of the baseline and trends in global gelatinous zooplankton populations to be conducted in the future.
References:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Albatross and petrel populations have declined globally due to interactions with fishing operations. The survival of four albatross and two giant petrel species breeding on Macquarie Island is threatened and ongoing monitoring is essential to assess their conservation status and mitigate negative influences. Long-term studies are required to obtain reliable information on population size and productivity and age- and sex- related survival parameters. The birds' oceanic movements is also being investigated so that questions regarding temporal and spatial overlap with fisheries can be addressed.
Demographic and population data collected for the 2012-13 breeding season on Macquarie Island for 4 species of albatross and 2 species of giant petrel are summarised in the annual report (pdf) and all data contained in tables therein or attached xlxs spreadsheets and access database. Data collected includes breeding census, breeding success, nest location, banding and resight data for the 2012-13 season. The Access database contains data from 1950-2012.
2013-2014 information are held in the 2013-2014 folder, which includes several excel spreadsheets, an updated access database, and a copy of the final report.
2014-2015 information are held in the 2014-2015 folder, which includes several excel spreadsheets, a copy of the report, and updated database tables.
2015-2016 information are held in the 2015-2016 folder, which includes several excel spreadsheets, a copy of the report, and updated database tables.
2016-2017 information are held in the 2016-2017 folder, which includes several excel spreadsheets.
2017-2018 information are held in the 2017-2018 folder, which includes several excel spreadsheets and a pdf document showing the location of nesting sites (waypoints provided in the excel files).
2018-2019 information are held in the 2018-2019 folder, which includes several excel spreadsheets and a pdf document showing the location of nesting sites (waypoints provided in the excel files).
This project has replaced project 2569 (which in turn replaced project 751).
This database was designed to provide a coherent, single source of NMR data of lignin and other plant cell wall model compounds. The database exists as an Adobe pdf cross-platform file for viewing and printing. This is the latest public version of the Database, version 2024/08 updated from the 2009 version.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The White House Subcommittee on Biodiversity and Ecosystem Dynamics has identified systematics as a research priority that is fundamental to ecosystem management and biodiversity conservation. This primary need identified by the Subcommittee requires improvements in the organization of, and access to, standardized nomenclature. ITIS (originally referred to as the Interagency Taxonomic Information System) was designed to fulfill these requirements. In the future, the ITIS will provide taxonomic data and a directory of taxonomic expertise that will support the system. The ITIS is the result of a partnership of federal agencies formed to satisfy their mutual needs for scientifically credible taxonomic information. Since its inception, ITIS has gained valuable new partners and undergone a name change; ITIS now stands for the Integrated Taxonomic Information System. The goal is to create an easily accessible database with reliable information on species names and their hierarchical classification. The database will be reviewed periodically to ensure high quality with valid classifications, revisions, and additions of newly described species. The ITIS includes documented taxonomic information of flora and fauna from both aquatic and terrestrial habitats. The original ITIS partners include: Department of Commerce National Oceanic and Atmospheric Administration (NOAA) Department of Interior (DOI) Geological Survey (USGS) Environmental Protection Agency (EPA) Department of Agriculture (USDA) Agriculture Research Service (ARS) Natural Resources Conservation Service (NRCS) Smithsonian Institution National Museum of Natural History (NMNH) These agencies signed a Memorandum of Understanding and have formed a Steering Committee that directs two technical work groups - the Database Work Group (DWG) and the Taxonomy Work Group (TWG). The DWG is responsible for the database design and overseeing development of the system to meet the requirements of the ITIS partners. The TWG is responsible for the quality and integrity of the database information. In addition to the database, the working groups have created "Taxonomic Workbench" software designed for easy entry and manipulation of taxonomic data. Primary objectives of the TWG include the review of data prior to incorporation into the ITIS and the establishment of a process for periodic peer review to ensure data quality. The TWG has evaluated the taxonomic information priorities of the agencies and is locating data sources for the highest priority groups. Efforts to gather data are helping to identify gaps in taxonomic coverage in both scientific expertise and available information. The TWG hopes to promote collaboration among, and provide a point of focus for, taxonomists, scientific institutions, and taxonomic information users. For each scientific name, ITIS will include the authority (author and date), taxonomic rank, associated synonyms and vernacular names where available, a unique taxonomic serial number, data source information (publications, experts, etc.) and data quality indicators. Expert reviews and changes to taxonomic information in the database will be tracked. Geographic coverage will be worldwide with initial emphasis on North American taxa. The TWG is coordinating its efforts with several national and international biodiversity programs. ITIS will be a significant contribution to the scientific infrastructure that is fundamental to the description, conservation, and management of the nation's biodiversity. Use of the ITIS and the taxonomic serial numbers will facilitate sharing of biological information among researchers and cooperating agencies by providing a common framework for taxonomic data. Agencies that typically cannot afford to maintain taxonomic data will have access to high quality taxonomic information through ITIS. This project allows the coordination of efforts among federal agencies, thereby increasing productivity and saving resources. Status reports on ITIS system development may be found in the What's New section. You can also contact Gerald Guala, Ph.D., Director, Integrated Taxonomic Information System (ITIS) at U.S. Geological Survey, 12201 Sunrise Valley Drive, MS 302, Reston, VA 20192 or via email at itiswebmaster@itis.gov .
An outline of the blue whale voyages of 2012 can be found here: http://www.marinemammals.gov.au/sorp/antarctic-blue-whale-project/bonney-upwelling-acoustic-testing-expeditions with further information here: http://www.marinemammals.gov.au/_data/assets/pdf_file/0005/135617/SC-64-SH11.pdf
The 'Logger' data entry system was developed by the International Fund for Animal Welfare (IFAW) and is a flexible system to record information during a voyage. This system was the primary data entry system for the voyage and all events were recorded in Logger’s database.
Blue whale voyage 1 datasets: 12 - 25 January 2012 Sightings from the first blue whale voyage are recorded across three access databases: 20120117LoggerFinalPart1Updated.mdb 20120121LoggerFinalPart2Updated.mdb 20120125LoggerFinalPart3Updated.mdb
These databases contain tables describing: Comments: details additional to sightings entered or data entry omissions, time stamped (UTC) Observer effort - codes found in lookup table, date/time in UTC GPS data (time stamped, UTC) and heading Lookup - contains all topic codes to apply to all other tables Resights: resighting details for sightings already recorded, time/date in UTC, initial sighting number, blow count and notes Cetacean sightings - date/time in UTC, sighting number, observer name, vessel, estimate of distance, bearing, heading, species code, sighting cue code, estimate of number of individuals (low, best and high), group behaviour, pod compaction, surface synchronicity and comments Weather: Date/time in UTC, sightability, glare, sea state, wind strength, swell, weather, cloud cover, cloud height, notes
Blue whale voyage 2 datasets: 13 - 30 March 2012 GPS data is stored in the file called 'gps_meld_data_exp.csv'. This is an amalgam dataset of two GPS data streams, that has been checked and corrected (see 'Quality' for further details. Date time is stored in two formats. The first is %Y-%m-%d %H-%M-%S format, as in "2012-03-16 17:54:32". The second format is a concatenated, orderable numeric string, as in 20120316175432.
### The small file 'trip_db.csv' contains a quick reference as to when the four trips of blue whale voyage 2 started, to the minute. These times have been corrected for the minor (i.e, 2 mins 15 second) error (see 'Quality' below).
### Effort database is contained in the file 'VWhale2_database_effort_corrected.csv'. A fair amount of 'correction' has gone on with this data as there were great variations in the way different people were adding new information into Logger. Furthermore, there were 'innovations' made to the Logger system, particularly after the first couple of trips. In particular, the effort was added to Logger in the first trip was exactly as it was in the first voyage (the VL was too seasick to make any amendments). So, according to the older effort classification, effort for the first trip started and ended, but there were no observer rotations or notes taken as to what platform the observers were perched on. Given there was quite a bit of seasickness that first day, the only observers likely to be working would have been PE, PO and DD. These observers favoured the Fly Bridge so all sighting effort for the first trip has been allocated to these observers on the Fly Bridge.
The subsequent innovations were: observers were not told how far away a potential calling whale was. If, however, the acousticians thought that we were almost upon the animal(s), they will indicate this to the observing team.
Acoustic.search == 1 indicates when the acousticians have notified observers that there was a group of blue whales in the area.
Local.Search == 1 indicates that after an initial sighting was made, sighting effort and boat movement converted into a search to get closer to the animal(s) in order to confirm their species (not usually such a huge issue with blue whales, admittedly), group size and to get photo-ID.
FD == 1 when effort on the foredeck either started or continued. FB == 1 when effort on the fly bridge either started or continued.
For the effort types, the effort interval is defined as the time between the row the '1' value first appears and the date/time of the next row of the similar effort type.
Index.new: Because two databases were merged to form the one effort dataset (the first trip had its own Logger MS-Access database), an overall index, Index.new, was created for continuity. Index: Effort index as it appears in the original Logger MS-Access databases.
GpsIndex: In Logger, each Effort (or sighting) row is tagged with the accompanying GPS index number. This ties an effort event with the date/time and geographical location information displayed in the GPS data. GPSIndex.cor: As with GpsIndex but, again, as the databases were merged, a new GPSIndex value was created (.cor == corrected) to account for this, and for the added BPM GPS data. GpsTime: Date (only), as derived from GPS. Has been abbreviate to only date due to the joys of how Microsoft packages deal with date/time objects; full date/time value for each effort row can be derived from the GPS data, via the GPSIndex.cor value EffortNo: Each effort row has been assigned a unique number within each respective MS-Access Logger file. This is somewhat redundant with the Index value. Local time: When Logger records an event, it also takes a date/time value from the local computer. It's not really clear to me what this value actually represents. Observer: The head observer at the time the effort event was logged. Basically, just means the person driving the Logger computer (i.e., physically entering values and making weather obs) Event: Each event has a unique descriptor number. See the 'Lookup' table in the MS-Access database. Event.cor: This column should be completely ignored. Notes: Any comments that accompanied particular effort entries. See also the Comments table for notes not specifically related to any Effort entries. Platform: Which sighting platforms observers either started or stopped effort on, or rotated through. Unfortunately, this information wasn't always consistently recorded. See the FB and FD columns for a more correct record of when sighting effort was on and off. Platform.cor: This column should be ignored. Observers: All observers on rotation. Sonobuoy: when the launching of a sonobuoy was noted in Logger, here are the numbers (this is not a complete list) Trip: which trip it was
##### Sightings for all species are given in 'sightings.csv'.
##### Weather observations are in 'weather.csv'. Recording of glare angles (i.e., start and end bearing) started on third trip.
##### Comments in 'comments.csv'. Please note there were no comments recorded during the first trip.
Location of RYR2 Associated CPVT Variants Dataset Catecholaminergic polymorphic ventricular tachycardia (CPVT) is a rare inherited arrhythmia caused by pathogenic RYR2 variants. CPVT is characterized by exercise/stress-induced syncope and cardiac arrest in the absence of resting ECG and structural cardiac abnormalities. Here, we present a database collected from 225 clinical papers, published from 2001-October 2020, about CPVT associated RYR2 variants. 1355 patients, both with and without CPVT, with RYR2 variants are in the database. There are a total of 968 CPVT patients or suspected CPVT patients in the database. The database includes information regarding genetic diagnosis, location of the RYR2 variant(s), clinical history and presentation, and treatment strategies for each patient. Patients will have a varying depth of information in each of the provided fields. Database website: https://cpvtdb.port5000.com/ Dataset Information This dataset includes: eTable2.xlsx Tabular version of the database Most relevant tables in the PostgreSQL database regarding patient sex, conditions, treatments, family history, and variant information were joined to create this database Views calculating the affected RYR2 exons, domains and subdomains have been joined to patient information m-n tables for patient's conditions and treatments have been converted to pivot tables - every condition and treatment that has at least 1 person with that condition or treatment is a column. NOTE: This was created using a LEFT JOIN of individuals and individual_variants tables. Individuals with more than 1 recorded variant will be listed on multiple rows. There is only 1 person in this database as of the current version with multiple recorded variants _.gz.sql PostgreSQL database dump Expands to about 4.1 GB after loading the database dump The database includes two schemas: public: Includes all information in patients and variants Also includes all RYR2 variants in ClinVar uta: Contains the biocommons/uta database required to make the hgvs Python package to work locally See https://github.com/biocommons/uta for more information NOTE: It is recommended to use this version of the database only for development or analysis purposes database_tables.pdf Contains information on most of the database tables in the public schema 00_globals.sql Required to load the PostgreSQL database dump Creates a user named anonymous for the uta schema How To Load Database Using Docker First, download the 00_globals.sql and _.gz.sql file and move it into a directory. The default postgres image will load files from the /docker-entrypoint-initdb.d directory if the database is empty. See Docker Hub for more information. Example using docker compose with pgadmin and a volume to persist the data. # Use postgres/example user/password credentials version: '3.9' volumes: mydatabasevolume: null services: db: image: postgres:16 restart: always environment: POSTGRES_PASSWORD: mysecretpassword POSTGRES_USER: postgres volumes: - ':/docker-entrypoint-initdb.d/' - 'mydatabasevolume:/var/lib/postgresql/data' pgadmin: image: dpage/pgadmin4 environment: PGADMIN_DEFAULT_EMAIL: user@domain.com PGADMIN_DEFAULT_PASSWORD: SuperSecret Creating the Database from Scratch See https://github.com/alexdaiii/cpvt-database-loader for source code to create the database from scratch.
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput sequencing has created an exponential increase in the amount of gene expression data, much of which is freely, publicly available in repositories such as NCBI's Gene Expression Omnibus (GEO). Querying this data for patterns such as similarity and distance, however, becomes increasingly challenging as the total amount of data increases. Furthermore, vectorization of the data is commonly required in Artificial Intelligence and Machine Learning (AI/ML) approaches. We present BioVDB, a vector database for storage and analysis of gene expression data, which enhances the potential for integrating biological studies with AI/ML tools. We used a previously developed approach called Automatic Label Extraction (ALE) to extract sample labels from metadata, including age, sex, and tissue/cell-line. BioVDB stores 438,562 samples from eight microarray GEO platforms. We show that it allows for efficient querying of data using similarity search, which can also be useful for identifying and inferring missing labels of samples, and for rapid similarity analysis.