Database of non-redundant sets of protein - small-molecule complexes that are especially suitable for structure-based drug design and protein - small-molecule interaction research. PSMB supports: * Support frequent updates - The number of new structures in the PDB is growing rapidly. In order to utilize these structures, frequent updates are required. In contrast to manual procedures which require significant time and effort per update, generation of the PSMDB database is fully automatic thereby facilitating frequent database updates. * Consider both protein and ligand structural redundancy - In the database, two complexes are considered redundant if they share a similar protein and ligand (the protein - small-molecule non-redundant set). This allows the database to contain structural information for the same protein bound to several different ligands (and vice-versa). Additionally, for completeness, the database contains a set of non-redundant complexes when only protein structural redundancy is considered (our protein non-redundant set). The following images demonstrate the structural redundancy of the protein complexes in the PDB compared to the PSMDB. * Efficient handling of covalent bonds -Many protein complexes contain covalently bound ligands. Typically, protein-ligand databases discard these complexes; however, the PSMDB simply removes the covalently bound ligand from the complex, retaining any non-covalently bound ligands. This increases the number of usable complexes in the database. * Separate complexes into protein and ligand files -The PSMDB contains individual structure files for both the protein and all non-covalently bound ligands. The unbound proteins are in PDB format while the individual ligands are in SDF format (in their native coordinate frame).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our research reports a systematic literature review of 49 publications on security studies with software developer participants. These attached files are: - A BibTeX file: includes all 49 references in BibTex format. - An Excel spreadsheet: our analysis of each publication. Each row represents a publication and columns represent features that we analysed such as number of participants, whether there was a clear research question, or whether the paper reports ethics. - Database queries: actual queries that we executed on databases.
The scientific community has entered an era of big data. However, with big data comes big responsibilities, and best practices for how data are contributed to databases have not kept pace with the collection, aggregation, and analysis of big data. Here, we rigorously assess the quantity of data for specific leaf area (SLA) available within the largest and most frequently used global plant trait database, the TRY Plant Trait Database, exploring how much of the data were applicable (i.e., original, representative, logical, and comparable) and traceable (i.e., published, cited, and consistent). Over three-quarters of the SLA data in TRY either lacked applicability or traceability, leaving only 22.9% of the original data usable compared to the 64.9% typically deemed usable by standard data cleaning protocols. The remaining usable data differed markedly from the original for many species, which led to altered interpretation of ecological analyses. Though the data we consider here make up onl..., SLA data was downlaoded from TRY (traits 3115, 3116, and 3117) for all conifer (Araucariaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, and Taxaceae), Plantago, Poa, and Quercus species. The data has not been processed in any way, but additional columns have been added to the datset that provide the viewer with information about where each data point came from, how it was cited, how it was measured, whether it was uploaded correctly, whether it had already been uploaded to TRY, and whether it was uploaded by the individual who collected the data., , There are two additional documents associated with this publication. One is a word document that includes a description of each of the 120 datasets that contained SLA data for the four plant groups within the study (conifers, Plantago, Poa, and Quercus). The second is an excel document that contains the SLA data that was downloaded from TRY and all associated metadata.
Missing data codes: NA and N/A
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this research the characteristics of a usable Graphical User Interface (GUI) are determined in the context of a historical database. A GUI is an interface that enables users to directly interact with the content the GUI is build upon and the functionalities the GUI offers. The historical database is about former German citizens residing in the Netherlands, in the process of removing their Enemy of the state status. This status was given by the Dutch government in the aftermath of WWII, as a retribution for the German atrocities during WWII. The operation ended due to resistance amongst the Dutch citizens, after which the citizens could remove their Enemy of the State status. The mockup GUI incorporated the following usable characteristics; giving users the information they seek with justification, clear and useful functionalities of the GUI, simple in its use, and a structured layout. The mockup GUI was evaluated by average internet users, that tested the mockup GUI version interactively and reviewed their experience with usability statements. The mockup GUI was evaluated as good, so the given usable characteristics make the GUI usable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These databases consolidate a variety of datasets related to the model organism Ruegeria pomeroyi DSS-3. The data were primarily generated by members of the Moran Lab at the University of Georgia, and put together in this format using anvi'o v7.1-dev through the collaborative efforts of Zac Cooper, Sam Miller, and Iva Veseli (special thanks to Christa Smith and Lidimarie Trujillo Rodriguez for their help with gene annotations). The data includes:
(R_POM_DSS3-contigs.db) the complete genome and megaplasmid sequence of R. pomeroyi, along with highly-curated gene annotations established by the Moran Lab and automatically-generated annotations from NCBI COGs, KEGG KOfam/BRITE, Pfams, and anvi'o single-copy core gene sets. It also contains annotations for the Moran Lab's TnSeq mutant library (https://doi.org/10.1101/2022.09.11.507510; https://doi.org/10.1038/s43705-023-00244-6).
(PROFILE-VER_01.db) read-mapping data from multiple transcriptome and metatranscriptome samples generated by the Moran lab to the R. pomeroyi genome. Some coverage data is stored in the AUXILIARY-DATA.db file. This data can be visualized using anvi-interactive. Publicly-available samples are labeled with their SRA accession number.
(DEFAULT-EVERYTHING.db) gene-level coverage data from the transcriptome and meta-transcriptomes samples stored in the profile database, as well as per-gene normalized spectral abundance counts from proteomes matched to a subset of the transcriptomes and gene mutant fitness data from https://doi.org/10.1073/pnas.2217200120. This data can also be visualized using anvi-interactive (see instructions below). The proteome data layers are labeled according to their matching transcriptome samples.
(R_pom_reproducible_workflow.md) a reproducible workflow describing how the databases were generated.
Please note that using these databases requires the development version of anvi'o v8-dev
, or a later version of anvi'o if available. They are not usable with anvi'o v8
or earlier.
Instructions for visualizing the genes database in the anvi'o interactive interface: Anvi'o expects genes databases to be located in a folder called GENES
, so in order to use the specific database included in this datapack, you must move it to the expected location by running the following commands in your terminal:
mkdir GENESmv DEFAULT-EVERYTHING.db GENES/
Once that is done, you can use the following command to visualize the gene-level information:
anvi-interactive -c R_POM_DSS3-contigs.db -p PROFILE-VER_01.db -C DEFAULT -b EVERYTHING --gene-mode
To view only the proteomic data and its matched transcriptomes, you can add the flag --state-autoload proteomes
to the above command.
To view all transcriptomes and the proteomes organized by study of origin, you can add the flag --state-autoload figure
to the above command.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The FOPPA database (French Open Public Procurement Award notices) is a database constituted in the framework of the ANR DeCoMaP project (ANR-19-CE38-0004). It contains public procurement notices published in France from 2010 to 2020. It relies on a subset of the TED database. These data have a number of issues, the most serious being that the unique ID of most involved agents are missing. We performed a number of operations to solve these issues and obtain a usable database. These operations and their outcomes are described in detail in the following technical report:
L. Potin, V. Labatut, R. Figueiredo, C. Largeron & P. H. Morand. FOPPA: A database of French Open Public Procurement Award notices, Technical Report, Avignon Université, 2022. ⟨hal-03796734⟩
The report also describes the database itself.
Source code: The source code implementing these operations is publicly available as a GitHub repository:
https://github.com/CompNet/FoppaInit/releases/tag/v1.0.3
Cite as: If you use these data or the associated source code, please cite the following article :
Potin, L.; Labatut, V.; Morand, P.-H. & Largeron, C. FOPPA: an Open Database of French Public Procurement Award Notices From 2010-2020. Scientific Data 10:303, 2023. DOI: 10.1038/s41597-023-02213-z ⟨hal-04101350⟩
@Article{Potin2023, author = {Potin, Lucas and Labatut, Vincent and Morand, Pierre-Henri and Largeron, Christine}, title = {{FOPPA}: an Open Database of French Public Procurement Award Notices From 2010-2020}, journal = {Scientific Data}, year = {2023}, volume = {10}, pages = {303}, doi = {10.1038/s41597-023-02213-z},}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The FOPPA database (French Open Public Procurement Award notices) is a database constituted in the framework of the ANR DeCoMaP project (ANR-19-CE38-0004). It contains public procurement notices published in France from 2010 to 2020. It relies on a subset of the TED database. These data have a number of issues, the most serious being that the unique ID of most involved agents are missing. We performed a number of operations to solve these issues and obtain a usable database. These operations and their outcomes are described in detail in the following technical report:
L. Potin, V. Labatut, R. Figueiredo, C. Largeron & P. H. Morand. FOPPA: A database of French Open Public Procurement Award notices, Technical Report, Avignon Université, 2022. 〈hal-03796734〉
The report also describes the database itself.
Source code: The source code implementing these operations is publicly available as a GitHub repository:
https://github.com/CompNet/FoppaInit/releases/tag/v1.0.0
Cite as: If you use these data or the associated source code, please cite the above report.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Automatically extracted Catalan word database using alignment techniques (Montreal Forced Alignment, MFA) from speech databases with transcriptions. Precisely: Mozilla Common Voice, ParlamentParla, and OpenSLR-69. Usable for training keyword spotting models for home automation. MFA leverages algorithms to accurately synchronize speech signals with the corresponding text at the phoneme level. Two versions of the database have been created: general: This version encompasses all data, providing a comprehensive dataset for various analyses and applications. split: This version is divided into train, dev, and test to ease the task of training a keyword spotting model. Speaker-wise, It is divided by 80%, 10%, and 10%.
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
(:unav)...........................................
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The (polyphone-like) English SpeechDat(M) database was recorded within the framework of the SPEECHDAT(M) Project. It consists of 1,000 speakers, chosen according to their individual demographics, who were recorded over digital telephone lines using fixed telephone sets. The material to be spoken was provided to the caller via a prompt sheet. The database is divided into two sub-sets: the phonetically rich sentences (one CD) known as DB2, and the application-oriented utterances (two CDs) known as DB1.The recorded material in DB1 comprises immediately usable and relevant speech, including number and letter sequences, common control keywords, dates, times, money amounts, etc. This provides a realistic basis for using these resources for the training and assessment of speaker-independent recognition of both isolated and continuous speech utterances, employing either whole-word modeling and/or phoneme based approaches.The sample rate for speech is 8 KHz, quantisation is 8 bit, and a-law encoding is used. This results in a data rate of 64 kB/s.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe ability to apply standard and interoperable solutions for implementing and managing medical registries as well as aggregate, reproduce, and access data sets from legacy formats and platforms to advanced standard formats and operating systems are crucial for both clinical healthcare and biomedical research settings.PurposeOur study describes a reproducible, highly scalable, standard framework for a device registry implementation addressing both local data quality components and global linking problems.Methods and ResultsWe developed a device registry framework involving the following steps: (1) Data standards definition and representation of the research workflow, (2) Development of electronic case report forms using REDCap (Research Electronic Data Capture), (3) Data collection according to the clinical research workflow and, (4) Data augmentation by enriching the registry database with local electronic health records, governmental database and linked open data collections, (5) Data quality control and (6) Data dissemination through the registry Web site. Our registry adopted all applicable standardized data elements proposed by American College Cardiology / American Heart Association Clinical Data Standards, as well as variables derived from cardiac devices randomized trials and Clinical Data Interchange Standards Consortium. Local interoperability was performed between REDCap and data derived from Electronic Health Record system. The original data set was also augmented by incorporating the reimbursed values paid by the Brazilian government during a hospitalization for pacemaker implantation. By linking our registry to the open data collection repository Linked Clinical Trials (LinkedCT) we found 130 clinical trials which are potentially correlated with our pacemaker registry.ConclusionThis study demonstrates how standard and reproducible solutions can be applied in the implementation of medical registries to constitute a re-usable framework. Such approach has the potential to facilitate data integration between healthcare and research settings, also being a useful framework to be used in other biomedical registries.
A machine usable dictionary containing thousands of words, each with linguistic and psycholinguistic attributes (psychological measures are recorded for a small percentage of words). The dictionary may be of use to researchers in psychology or linguistics to develop sets of experimental stimuli, or those in artificial intelligence and computer science who require psychological and linguistic descriptions of words.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Reaction classification has important applications, and many approaches to classification have been applied. Our own algorithm tests all maximum common substructures (MCS) between all reactant and product molecules in order to find an atom mapping containing the minimum chemical distance (MCD). Recent publications have concluded that new MCS algorithms need to be compared with existing methods in a reproducible environment, preferably on a generalized test set, yet the number of test sets available is small, and they are not truly representative of the range of reactions that occur in real reaction databases. We have designed a challenging test set of reactions and are making it publicly available and usable with InfoChem’s software or other classification algorithms. We supply a representative set of example reactions, grouped into different levels of difficulty, from a large number of reaction databases that chemists actually encounter in practice, in order to demonstrate the basic requirements for a mapping algorithm to detect the reaction centers in a consistent way. We invite the scientific community to contribute to the future extension and improvement of this data set, to achieve the goal of a common standard.
The HCV Immunology Database contains a curated inventory of immunological epitopes in HCV and their interaction with the immune system, with associated retrieval and analysis tools. The funding for the HCV database project has stopped, and this website and the HCV immunology database are no longer maintained. The site will stay up, but problems will not be fixed. The database was last updated in September 2007. The HIV immunology website contains the same tools, and may be usable for non-HCV-specific analyses. For new epitope information, users of this database can try the Immuno Epitope Database (http://www.immuneepitope.org).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here, we focus on the implementation of the Convention of Biological Biodiversity (CBD) National Biodiversity Strategy and Actions Plans (NBSAPs), for a usable set of NBSAPs covering 30% of all Parties. We analysed 58 pairs of NBSAPs and Sixth National Reports after aligning measures taken against initial plans. Fewer than half of commitments made with deadlines of 2020 or earlier were evidenced. Moreover, the largest source of missing commitment reports were “omissions” – the commitment was not referenced in the Report. The number and proportion of commitments evidenced varied between the Aichi Targets, with more losses in “high-profile” and “institutionally-challenging” targets, and between Human Development Index categories. Megadiverse countries had higher rates of reported commitment implementation and effectiveness than others, and fewer omissions. Our results are important for informing the monitoring of commitment implementation in the Kunming-Montreal “global biodiversity package”.
Under the direction of the state-owned company Geobase Information and Surveying Saxony (GeoSN), the area data of the soil estimate are made available digitally and handed over to the LfULG for further evaluations. The state of digitalisation covers about 99 % of the land estimate area (December 2022). The data stock contains the class marks of soil estimation with the values for the arable or grassland soils. In addition, the LfULG derived from this database the “usable field capacity” (nFK) and the “field capacity” (FK).
3D web repositories are a hot topic for the research community in general. In the Cultural Heritage (CH) context, 3D repositories pose a difficult challenge due to the complexity and variability of models and to the need of structured and coherent metadata for browsing and searching. This paper presents one of the efforts of the ArchAIDE project: to create a structured and semantically-rich 3D database of pottery types, usable by archaeologists and other communities. For example, researchers working on shape-based analysis and automatic classification. The automated workflow described here starts from pages of a printed catalog, extracts the textual and graphical description of a pottery type, and processes those data to produce structured metadata information and a 3D representation. These information are then ingested in the database, where they become accessible by the community using dynamically-created web presentation pages, showing in a common context: 3D, 2D and metadata information.
https://www.imperial.ac.uk/neonatal-data-analysis-unit/neonatal-data-analysis-unit/utilising-the-nnrd/https://www.imperial.ac.uk/neonatal-data-analysis-unit/neonatal-data-analysis-unit/utilising-the-nnrd/
The National Neonatal Research Database is an award-winning resource, a dynamic relational database containing information extracted from the electronic patient records of babies admitted to NHS neonatal units in England, Wales and Scotland (Northern Ireland is currently addressing regulatory requirements for participation). The NNRD-AI is a version of the NNRD curated for machine learning and artificial intelligence applications.
A team led by Professor Neena Modi at the Chelsea and Westminster Hospital campus of Imperial College London established the NNRD in 2007 as a resource to support clinical teams, managers, professional organisations, policy makers, and researchers who wish to evaluate and improve neonatal care and services. Recently, supported by an award from the Medical Research Council, the neonatal team and collaborating data scientists at the Institute for Translational Medicine and Therapeutics, Data Science Group at Imperial College London, created NNRD-AI.
The NNRD-AI is a subset of the full NNRD with around 200 baby variables, 100 daily variables and 450 additional aggregate variables. The guiding principle underpinning the creation of the NNRD-AI is to make available data that requires minimal input from domain experts. Raw electronic patient record data are heavily influenced by the collection process. Additional processing is required to construct higher-order data representations suitable for modelling and application of machine learning/artificial intelligence techniques. In NNRD-AI, data are encoded as readily usable numeric and string variables. Imputation methods, derived from domain knowledge, are utilised to reduce missingness. Out of range values are removed and clinical consistency algorithms applied. A wide range of definitions of complex major neonatal morbidities (e.g. necrotising enterocolitis, bronchopulmonary dysplasia, retinopathy of prematurity), aggregations of daily data and clinically meaningful representations of anthropometric variables and treatments are also available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Excel file is Table S1: Contamination levels for 111,088 bacterial genomes of NCBI RefSeq. This table includes all CheckM classic output fields of contamination estimation (option lineage_wf). Estimates from Physeter in k-fold mode are further provided for 12,326 genomes for which CheckM values were not usable. The reasons for the rejection of CheckM estimates in these cases are given.
Overview: The Lower Nooksack Water Budget Project involved assembling a wide range of existing data related to WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. This Data Management Plan provides an overview of the data sets, formats and collaboration environment that was used to develop the project. Use of a plan during development of the technical work products provided a forum for the data development and management to be conducted with transparent methods and processes. At project completion, the Data Management Plan provides an accessible archive of the data resources used and supporting information on the data storage, intended access, sharing and re-use guidelines.
One goal of the Lower Nooksack Water Budget project is to make this “usable technical information” as accessible as possible across technical, policy and general public users. The project data, analyses and documents will be made available through the WRIA 1 Watershed Management Project website http://wria1project.org. This information is intended for use by the WRIA 1 Joint Board and partners working to achieve the adopted goals and priorities of the WRIA 1 Watershed Management Plan.
Model outputs for the Lower Nooksack Water Budget are summarized by sub-watersheds (drainages) and point locations (nodes). In general, due to changes in land use over time and changes to available streamflow and climate data, the water budget for any watershed needs to be updated periodically. Further detailed information about data sources is provided in review packets developed for specific technical components including climate, streamflow and groundwater level, soils and land cover, and water use.
Purpose: This project involves assembling a wide range of existing data related to the WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. Data will be used as input to various hydrologic, climatic and geomorphic components of the Topnet-Water Management (WM) model, but will also be available to support other modeling efforts in WRIA 1. Much of the data used as input to the Topnet model is publicly available and maintained by others, (i.e., USGS DEMs and streamflow data, SSURGO soils data, University of Washington gridded meteorological data). Pre-processing is performed to convert these existing data into a format that can be used as input to the Topnet model. Post-processing of Topnet model ASCII-text file outputs is subsequently combined with spatial data to generate GIS data that can be used to create maps and illustrations of the spatial distribution of water information. Other products generated during this project will include documentation of methods, input by WRIA 1 Joint Board Staff Team during review and comment periods, communication tools developed for public engagement and public comment on the project.
In order to maintain an organized system of developing and distributing data, Lower Nooksack Water Budget project collaborators should be familiar with standards for data management described in this document, and the following issues related to generating and distributing data: 1. Standards for metadata and data formats 2. Plans for short-term storage and data management (i.e., file formats, local storage and back up procedures and security) 3. Legal and ethical issues (i.e., intellectual property, confidentiality of study participants) 4. Access policies and provisions (i.e., how the data will be made available to others, any restrictions needed) 5. Provisions for long-term archiving and preservation (i.e., establishment of a new data archive or utilization of an existing archive) 6. Assigned data management responsibilities (i.e., persons responsible for ensuring data Management, monitoring compliance with the Data Management Plan)
This resource is a subset of the LNWB Ch03 Data Processes Collection Resource.
Database of non-redundant sets of protein - small-molecule complexes that are especially suitable for structure-based drug design and protein - small-molecule interaction research. PSMB supports: * Support frequent updates - The number of new structures in the PDB is growing rapidly. In order to utilize these structures, frequent updates are required. In contrast to manual procedures which require significant time and effort per update, generation of the PSMDB database is fully automatic thereby facilitating frequent database updates. * Consider both protein and ligand structural redundancy - In the database, two complexes are considered redundant if they share a similar protein and ligand (the protein - small-molecule non-redundant set). This allows the database to contain structural information for the same protein bound to several different ligands (and vice-versa). Additionally, for completeness, the database contains a set of non-redundant complexes when only protein structural redundancy is considered (our protein non-redundant set). The following images demonstrate the structural redundancy of the protein complexes in the PDB compared to the PSMDB. * Efficient handling of covalent bonds -Many protein complexes contain covalently bound ligands. Typically, protein-ligand databases discard these complexes; however, the PSMDB simply removes the covalently bound ligand from the complex, retaining any non-covalently bound ligands. This increases the number of usable complexes in the database. * Separate complexes into protein and ligand files -The PSMDB contains individual structure files for both the protein and all non-covalently bound ligands. The unbound proteins are in PDB format while the individual ligands are in SDF format (in their native coordinate frame).