100+ datasets found

d
Data Management Plan Examples Database
search.dataone.org
borealisdata.ca
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Sep 4, 2024
Dataset provided by
Borealis
Authors
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak
Time period covered
Jan 1, 2011 - Jan 1, 2023
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
d
Data from: Database used for the evaluation of data used to identify...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Database used for the evaluation of data used to identify groundwater sources under the direct influence of surface water in Pennsylvania [Dataset]. https://catalog.data.gov/dataset/database-used-for-the-evaluation-of-data-used-to-identify-groundwater-sources-under-the-di
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Pennsylvania
Description
The U.S. Geological Survey (USGS), in cooperation with the Pennsylvania Department of Environmental Protection (PADEP), conducted an evaluation of data used by the PADEP to identify groundwater sources under the direct influence of surface water (GUDI) in Pennsylvania (Gross and others, 2022). The data used in this evaluation and the processes used to compile them from multiple sources are described and provided herein. Data were compiled primarily but not exclusively from PADEP resources, including (1) source-information for public water-supply systems and Microscopic Particulate Analysis (MPA) results for public water-supply system groundwater sources from the agency’s Pennsylvania Drinking Water Information System (PADWIS) database (Pennsylvania Department of Environmental Protection, 2016), and (2) results associated with MPA testing from the PADEP Bureau of Laboratories (BOL) files and water-quality analyses obtained from the PADEP BOL, Sample Information System (Pennsylvania Department of Environmental Protection, written commun., various dates). Information compiled from sources other than the PADEP includes anthropogenic (land cover and PADEP region) and naturogenic (geologic and physiographic, hydrologic, soil characterization, and topographic) spatial data. Quality control (QC) procedures were applied to the PADWIS database to verify spatial coordinates, verify collection type information, exclude sources not designated as wells, and verify or remove values that were either obvious errors or populated as zero rather than as “no data.” The QC process reduced the original PADWIS dataset to 12,147 public water-supply system wells (hereafter referred to as the PADWIS database). An initial subset of the PADWIS database, termed the PADWIS database subset, was created to include 4,018 public water-supply system community wells that have undergone the Surface Water Identification Protocol (SWIP), a protocol used by the PADEP to classify sources as GUDI or non-GUDI (Gross and others, 2022). A second subset of the PADWIS database, termed the MPA database subset, represents MPA results for 631 community and noncommunity wells and includes water-quality data (alkalinity, chloride, Escherichia coli, fecal coliform, nitrate, pH, sodium, specific conductance, sulfate, total coliform, total dissolved solids, total residue, and turbidity) associated with groundwater-quality samples typically collected concurrently with the MPA sample. The PADWIS database and two subsets (PADWIS database subset and MPA database subset) are compiled in a single data table (DR_2022_Table.xlsx), with the two subsets differentiated using attributes that are defined in the associated metadata table (DR_2022_Metadata_Table_Variables.xlsx). This metadata file (DR_2022_Metadata.xml) describes data resources, data compilation, and QC procedures in greater detail.
Z
The Surface Water Chemistry (SWatCh) database
data.niaid.nih.gov
zenodo.org
Updated Apr 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rotteveel, Lobke; Heubach, Franz (2022). The Surface Water Chemistry (SWatCh) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4559695
Explore at:
Dataset updated
Apr 26, 2022
Dataset provided by
Department of Mechanical Engineering, Dalhousie University
Sterling Hydrology Research Group, Dalhousie University
Authors
Rotteveel, Lobke; Heubach, Franz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset presented in the following manuscript: The Surface Water Chemistry (SWatCh) database: A standardized global database of water chemistry to facilitate large-sample hydrological research, which is currently under review at Earth System Science Data.

Openly accessible global scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and determine critical spatial data gaps where more monitoring is required. Existing datasets are limited in availability, sample size/sampling frequency, and geographic scope. These limitations inhibit the answering of emerging transboundary water chemistry questions, for example, the detection and understanding of delayed recovery from freshwater acidification. Here, we begin to address these limitations by compiling the global surface water chemistry (SWatCh) database. We collect, clean, standardize, and aggregate open access data provided by six national and international agencies to compile a database containing information on sites, methods, and samples, and a GIS shapefile of site locations. We remove poor quality data (for example, values flagged as “suspect” or “rejected”), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data for streams, rivers, canals, ponds, lakes, and reservoirs across seven continents, 24 variables, 33,722 sites, and over 5 million samples collected between 1960 and 2022. Similar to prior research, we identify critical spatial data gaps on the African and Asian continents, highlighting the need for more data collection and sharing initiatives in these areas, especially considering freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommended solutions. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high quality, and trans-boundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.
H
Current Population Survey (CPS)
dataverse.harvard.edu
search.dataone.org
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
d
Open access practices of selected library science journals
search.dataone.org
data.niaid.nih.gov
+2more
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer Jordan; Blair Solon; Stephanie Beene (2025). Open access practices of selected library science journals [Dataset]. http://doi.org/10.5061/dryad.pvmcvdnt3
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.pvmcvdnt3
Dataset updated
May 8, 2025
Dataset provided by
Dryad Digital Repository
Authors
Jennifer Jordan; Blair Solon; Stephanie Beene
Description
The data in this set was culled from the Directory of Open Access Journals (DOAJ), the Proquest database Library and Information Science Abstracts (LISA), and a sample of peer reviewed scholarly journals in the field of Library Science. The data include journals that are open access, which was first defined by the Budapest Open Access Initiative:Â By â€˜open accessâ€™ to [scholarly] literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. Starting with a batch of 377 journals, we focused our dataset to include journals that met the following criteria: 1) peer-reviewed 2) written in English or abstracted in English, 3) actively published at the time of..., Data Collection In the spring of 2023, researchers gathered 377 scholarly journals whose content covered the work of librarians, archivists, and affiliated information professionals. This data encompassed 221 journals from the Proquest database Library and Information Science Abstracts (LISA), widely regarded as an authoritative database in the field of librarianship. From the Directory of Open Access Journals, we included 144 LIS journals. We also included 12 other journals not indexed in DOAJ or LISA, based on the researchersâ€™ knowledge of existing OA library journals.Â The data is separated into several different sets representing the different indices and journals we searched. The first set includes journals from the database LISA. The following fields are in this dataset:

Journal: title of the journal

Publisher: title of the publishing company

Open Data Policy: lists whether an open data exists and what the policy is

Country of publication: country where the journal is publ..., , # Open access practices of selected library science journals

The data in this set was culled from the Directory of Open Access Journals (DOAJ), the Proquest database Library and Information Science Abstracts (LISA), and a sample of peer reviewed scholarly journals in the field of Library Science.

The data include journals that are open access, which was first defined by the Budapest Open Access Initiative:Â

By â€˜open accessâ€™ to [scholarly] literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Starting with a batch of 377 journals, we focused our dataset to include journals that met the following criteria: 1) peer-reviewed 2) written in Engli...
Z
Up-to-date mapping of COVID-19 treatment and vaccine development...
data.niaid.nih.gov
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wagner, Tomáš; Mišová, Ivana; Frankovský, Ján (2024). Up-to-date mapping of COVID-19 treatment and vaccine development (covid19-help.org data dump) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4601445
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Direct Impact s.r.o.
Authors
Wagner, Tomáš; Mišová, Ivana; Frankovský, Ján
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The free database mapping COVID-19 treatment and vaccine development based on the global scientific research is available at https://covid19-help.org/.

Files provided here are curated partial data exports in the form of .csv files or full data export as .sql script generated with pg_dump from our PostgreSQL 12 database. You can also find .png file with our ER diagram of tables in .sql file in this repository.

Structure of CSV files

*On our site, compounds are named as substances

compounds.csv

Id - Unique identifier in our database (unsigned integer)

Name - Name of the Substance/Compound (string)

Marketed name - The marketed name of the Substance/Compound (string)

Synonyms - Known synonyms (string)

Description - Description (HTML code)

Dietary sources - Dietary sources where the Substance/Compound can be found (string)

Dietary sources URL - Dietary sources URL (string)

Formula - Compound formula (HTML code)

Structure image URL - Url to our website with the structure image (string)

Status - Status of approval (string)

Therapeutic approach - Approach in which Substance/Compound works (string)

Drug status - Availability of Substance/Compound (string)

Additional data - Additional data in stringified JSON format with data as prescribing information and note (string)

General information - General information about Substance/Compound (HTML code)

references.csv

Id - Unique identifier in our database (unsigned integer)

Impact factor - Impact factor of the scientific article (string)

Source title - Title of the scientific article (string)

Source URL - URL link of the scientific article (string)

Tested on species - What testing model was used for the study (string)

Published at - Date of publication of the scientific article (Date in ISO 8601 format)

clinical-trials.csv

Id - Unique identifier in our database (unsigned integer)

Title - Title of the clinical trial study (string)

Acronym title - Acronym of title of the clinical trial study (string)

Source id - Unique identifier in the source database

Source id optional - Optional identifier in other databases (string)

Interventions - Description of interventions (string)

Study type - Type of the conducted study (string)

Study results - Has results? (string)

Phase - Current phase of the clinical trial (string)

Url - URL to clinical trial study page on clinicaltrials.gov (string)

Status - Status in which study currently is (string)

Start date - Date at which study was started (Date in ISO 8601 format)

Completion date - Date at which study was completed (Date in ISO 8601 format)

Additional data - Additional data in the form of stringified JSON with data as locations of study, study design, enrollment, age, outcome measures (string)

compound-reference-relations.csv

Reference id - Id of a reference in our DB (unsigned integer)

Compound id - Id of a substance in our DB (unsigned integer)

Note - Id of a substance in our DB (unsigned integer)

Is supporting - Is evidence supporting or contradictory (Boolean, true if supporting)

compound-clinical-trial.csv

Clinical trial id - Id of a clinical trial in our DB (unsigned integer)

Compound id - Id of a Substance/Compound in our DB (unsigned integer)

tags.csv

Id - Unique identifier in our database (unsigned integer)

Name - Name of the tag (string)

tags-entities.csv

Tag id - Id of a tag in our DB (unsigned integer)

Reference id - Id of a reference in our DB (unsigned integer)

API Specification

Our project also has an Open API that gives you access to our data in a format suitable for processing, particularly in JSON format.

https://covid19-help.org/api-specification

Services are split into five endpoints:

Substances - /api/substances

References - /api/references

Substance-reference relations - /api/substance-reference-relations

Clinical trials - /api/clinical-trials

Clinical trials-substances relations - /api/clinical-trials-substances

Method of providing data

All dates are text strings formatted in compliance with ISO 8601 as YYYY-MM-DD

If the syntax request is incorrect (missing or incorrectly formatted parameters) an HTTP 400 Bad Request response will be returned. The body of the response may include an explanation.

Data updated_at (used for querying changed-from) refers only to a particular entity and not its logical relations. Example: If a new substance reference relation is added, but the substance detail has not changed, this is reflected in the substance reference relation endpoint where a new entity with id and current dates in created_at and updated_at fields will be added, but in substances or references endpoint nothing has changed.

The recommended way of sequential download

During the first download, it is possible to obtain all data by entering an old enough date in the parameter value changed-from, for example: changed-from=2020-01-01 It is important to write down the date on which the receiving the data was initiated let’s say 2020-10-20

For repeated data downloads, it is sufficient to receive only the records in which something has changed. It can therefore be requested with the parameter changed-from=2020-10-20 (example from the previous bullet). Again, it is important to write down the date when the updates were downloaded (eg. 2020-10-20). This date will be used in the next update (refresh) of the data.

Services for entities

List of endpoint URLs:

/api/substances

/api/references

/api/substance-reference-relations

/api/clinical-trials

/api/clinical-trials-substances

Format of the request

All endpoints have these parameters in common:

changed-from - a parameter to return only the entities that have been modified on a given date or later.

continue-after-id - a parameter to return only the entities that have a larger ID than specified in the parameter.

limit - a parameter to return only the number of records specified (up to 1000). The preset number is 100.

Request example:

/api/references?changed-from=2020-01-01&continue-after-id=1&limit=100

Format of the response

The response format is the same for all endpoints.

number_of_remaining_ids - the number of remaining entities that meet the specified criteria but are not displayed on the page. An integer of virtually unlimited size.

entities - an array of entity details in JSON format.

Response example:

{

"number_of_remaining_ids" : 100, "entities" : [ { "id": 3, "url": "https://www.ncbi.nlm.nih.gov/pubmed/32147628", "title": "Discovering drugs to treat coronavirus disease 2019 (COVID-19).", "impact_factor": "Discovering drugs to treat coronavirus disease 2019 (COVID-19).", "tested_on_species": "in silico", "publication_date": "2020-22-02", "created_at": "2020-30-03", "updated_at": "2020-31-03", "deleted_at": null }, { "id": 4, "url": "https://www.ncbi.nlm.nih.gov/pubmed/32157862", "title": "CT Manifestations of Novel Coronavirus Pneumonia: A Case Report", "impact_factor": "CT Manifestations of Novel Coronavirus Pneumonia: A Case Report", "tested_on_species": "Patient", "publication_date": "2020-06-03", "created_at": "2020-30-03", "updated_at": "2020-30-03", "deleted_at": null }, ]

}

Endpoint details

Substances

URL: /api/substances

Substances endpoint returns data in the format specified in Response example as an array of entities in JSON format specified in the entity format section.

Entity format:

id - Unique identifier in our database (unsigned integer)

name - Name of the Substance (string)

description - Description (HTML code)

phase_of_research - Phase of research (string)

how_it_helps - How it helps (string)

drug_status - Drug status (string)

general_information - General information (HTML code)

synonyms - Synonyms (string)

marketed_as - "Marketed as" (string)

dietary_sources - Dietary sources name (string)

dietary_sources_url - Dietary sources URL (string)

prescribing_information - Prescribing information as an array of JSON objects with description and URL attributes as strings

formula - Formula (HTML code)

created_at - Date when the entity was added to our database (Date in ISO 8601 format)

updated_at - Date when the entity was last updated in our database (Date in ISO 8601 format)

deleted_at - Date when the entity was deleted in our database (Date in ISO 8601 format)

References

URL: /api/references

References endpoint returns data in the format specified in Response example as an array of entities in JSON format specified in the entity format section.

Entity format:

id - Unique identifier in our database (unsigned integer)

url - URL link of the scientific article (string)

title - Title of the scientific article (string)

impact_factor - Impact factor of the scientific article (string)

tested_on_species - What testing model was used for the study (string)

publication_date - Date of publication of the scientific article (Date in ISO 8601 format)

created_at - Date when the entity was added to our database (Date in ISO 8601 format)

updated_at - Date when the entity was last updated in our database (Date in ISO 8601

Data from: A consensus compound/bioactivity dataset for data-driven drug...

zenodo.org
data.niaid.nih.gov
+1more

zip

Updated May 13, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6320761

Dataset updated

May 13, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Information

The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

Structure and content of the dataset

**Dataset structure**
ChEMBL ID	PubChem ID	IUPHAR ID	Target	Activity type	Assay type	Unit	Mean C (0)	...	Mean PC (0)	...	Mean B (0)	...	Mean I (0)	...	Mean PD (0)	...	Activity check annotation	Ligand names	Canonical SMILES C	...	Structure check	Source

The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

Column content:

ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
Target: biological target of the molecule expressed as the HGNC gene symbol
Activity type: for example, pIC₅₀
Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
Unit: unit of bioactivity measurement
Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
- no comment: bioactivity values are within one log unit;
- check activity data: bioactivity values are not within one log unit;
- only one data point: only one value was available, no comparison and no range calculated;
- no activity value: no precise numeric activity value was available;
- no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
Ligand names: all unique names contained in the five source databases are listed
Canonical SMILES columns: Molecular structure of the compound from each database
Structure check: To denote matching or differing compound structures in different source databases
- match: molecule structures are the same between different sources;
- no match: the structures differ;
- 1 source: no structure comparison is possible, because the molecule comes from only one source database.
Source: From which databases the data come from

d
Data from: Database for the U.S. Geological Survey Woods Hole Science...
catalog.data.gov
data.usgs.gov
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Database for the U.S. Geological Survey Woods Hole Science Center's marine sediment samples, including locations, sample data and collection information (SED_ARCHIVE) [Dataset]. https://catalog.data.gov/dataset/database-for-the-u-s-geological-survey-woods-hole-science-centers-marine-sediment-samples-
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Woods Hole
Description
The U.S. Geological Survey (USGS), Woods Hole Science Center (WHSC) has been an active member of the Woods Hole research community for over 40 years. In that time there have been many sediment collection projects conducted by USGS scientists and technicians for the research and study of seabed environments and processes. These samples are collected at sea or near shore and then brought back to the WHSC for study. While at the Center, samples are stored in ambient temperature, cold or freezing conditions, depending on the best mode of preparation for the study being conducted or the duration of storage planned for the samples. Recently, storage methods and available storage space have become a major concern at the WHSC. The shapefile sed_archive.shp, gives a geographical view of the samples in the WHSC's collections, and where they were collected along with images and hyperlinks to useful resources.
E
[JeDI] - Jellyfish Database Initiative: Global records on gelatinous...
erddap.bco-dmo.org
Updated Apr 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2018). [JeDI] - Jellyfish Database Initiative: Global records on gelatinous zooplankton for the past 200 years, collected from global sources and literature (Trophic BATS project) (Plankton Community Composition and Trophic Interactions as Modifiers of Carbon Export in the Sargasso Sea ) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_526852/index.html
Explore at:
Dataset updated
Apr 3, 2018
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/526852/licensehttps://www.bco-dmo.org/dataset/526852/license
Area covered
Sargasso Sea,
Variables measured
day, date, year, depth, month, taxon, contact, density, latitude, net_mesh, and 27 more
Description
The Jellyfish Database Initiative (JeDI) is a scientifically-coordinated global database dedicated to gelatinous zooplankton (members of the Cnidaria, Ctenophora and Thaliacea) and associated environmental data. The database holds 476,000 quantitative, categorical, presence-absence and presence only records of gelatinous zooplankton spanning the past four centuries (1790-2011) assembled from a variety of published and unpublished sources. Gelatinous zooplankton data are reported to species level, where identified, but taxonomic information on phylum, family and order are reported for all records. Other auxiliary metadata, such as physical, environmental and biometric information relating to the gelatinous zooplankton metadata, are included with each respective entry. JeDI has been developed and designed as an open access research tool for the scientific community to quantitatively define the global baseline of gelatinous zooplankton populations and to describe long-term and large-scale trends in gelatinous zooplankton populations and blooms. It has also been constructed as a future repository of datasets, thus allowing retrospective analyses of the baseline and trends in global gelatinous zooplankton populations to be conducted in the future. access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=This information has been synthesized by members of the Global Jellyfish Group from online databases, unpublished and published datasets. More specific details may be found in\u00a0"%5C%22http://dmoserv3.bco-%0Admo.org/data_docs/JeDI/Lucas_et_al_2014_GEB.pdf%5C%22">Lucas, C.J., et al. 2014. Gelatinous zooplankton biomass in the global oceans: geographic variation and environmental drivers. Global Ecol. Biogeogr. (DOI: 10.1111/geb.12169) in the\u00a0methods section. awards_0_award_nid=54810 awards_0_award_number=OCE-1030149 awards_0_data_url=http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=1030149 awards_0_funder_name=NSF Division of Ocean Sciences awards_0_funding_acronym=NSF OCE awards_0_funding_source_nid=355 awards_0_program_manager=David L. Garrison awards_0_program_manager_nid=50534 cdm_data_type=Other comment=JeDI: Jellyfish Database Initiative, associated with the Trophic BATS project PIs: R. Condon, C. Lucas, C. Duarte, K. Pitt version 2015.01.08 Note: The displayed view of this dataset is subject to updates Note: Duplicate records were removed on 2015.01.08 See: Dataset term legend for full text of abbreviations. Conventions=COARDS, CF-1.6, ACDD-1.3 data_source=extract_data_as_tsv version 2.3 19 Dec 2019 defaultDataQuery=&time<now doi=10.1575/1912/7191 Easternmost_Easting=180.0 geospatial_lat_max=88.74 geospatial_lat_min=-78.5 geospatial_lat_units=degrees_north geospatial_lon_max=180.0 geospatial_lon_min=-180.0 geospatial_lon_units=degrees_east geospatial_vertical_max=7632.0 geospatial_vertical_min=-10191.48 geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=https://www.bco-dmo.org/dataset/526852 institution=BCO-DMO metadata_source=https://www.bco-dmo.org/api/dataset/526852 Northernmost_Northing=88.74 param_mapping={'526852': {'lat': 'master - latitude', 'depth': 'master - depth', 'lon': 'master - longitude'}} parameter_source=https://www.bco-dmo.org/mapserver/dataset/526852/parameters people_0_affiliation=University of North Carolina - Wilmington people_0_affiliation_acronym=UNC-Wilmington people_0_person_name=Robert Condon people_0_person_nid=51335 people_0_role=Principal Investigator people_0_role_type=originator people_1_affiliation=University of Western Australia people_1_person_name=Carlos M. Duarte people_1_person_nid=526857 people_1_role=Co-Principal Investigator people_1_role_type=originator people_2_affiliation=National Oceanography Centre people_2_affiliation_acronym=NOC people_2_person_name=Cathy Lucas people_2_person_nid=526856 people_2_role=Co-Principal Investigator people_2_role_type=originator people_3_affiliation=Griffith University people_3_person_name=Kylie Pitt people_3_person_nid=526858 people_3_role=Co-Principal Investigator people_3_role_type=originator people_4_affiliation=Woods Hole Oceanographic Institution people_4_affiliation_acronym=WHOI BCO-DMO people_4_person_name=Danie Kinkade people_4_person_nid=51549 people_4_role=BCO-DMO Data Manager people_4_role_type=related project=Trophic BATS projects_0_acronym=Trophic BATS projects_0_description=Fluxes of particulate carbon from the surface ocean are greatly influenced by the size, taxonomic composition and trophic interactions of the resident planktonic community. Large and/or heavily-ballasted phytoplankton such as diatoms and coccolithophores are key contributors to carbon export due to their high sinking rates and direct routes of export through large zooplankton. The potential contributions of small, unballasted phytoplankton, through aggregation and/or trophic re-packaging, have been recognized more recently. This recognition comes as direct observations in the field show unexpected trends. In the Sargasso Sea, for example, shallow carbon export has increased in the last decade but the corresponding shift in phytoplankton community composition during this time has not been towards larger cells like diatoms. Instead, the abundance of the picoplanktonic cyanobacterium, Synechococccus, has increased significantly. The trophic pathways that link the increased abundance of Synechococcus to carbon export have not been characterized. These observations helped to frame the overarching research question, "How do plankton size, community composition and trophic interactions modify carbon export from the euphotic zone". Since small phytoplankton are responsible for the majority of primary production in oligotrophic subtropical gyres, the trophic interactions that include them must be characterized in order to achieve a mechanistic understanding of the function of the biological pump in the oligotrophic regions of the ocean. This requires a complete characterization of the major organisms and their rates of production and consumption. Accordingly, the research objectives are: 1) to characterize (qualitatively and quantitatively) trophic interactions between major plankton groups in the euphotic zone and rates of, and contributors to, carbon export and 2) to develop a constrained food web model, based on these data, that will allow us to better understand current and predict near-future patterns in export production in the Sargasso Sea. The investigators will use a combination of field-based process studies and food web modeling to quantify rates of carbon exchange between key components of the ecosystem at the Bermuda Atlantic Time-series Study (BATS) site. Measurements will include a novel DNA-based approach to characterizing and quantifying planktonic contributors to carbon export. The well-documented seasonal variability at BATS and the occurrence of mesoscale eddies will be used as a natural laboratory in which to study ecosystems of different structure. This study is unique in that it aims to characterize multiple food web interactions and carbon export simultaneously and over similar time and space scales. A key strength of the proposed research is also the tight connection and feedback between the data collection and modeling components. Characterizing the complex interactions between the biological community and export production is critical for predicting changes in phytoplankton species dominance, trophic relationships and export production that might occur under scenarios of climate-related changes in ocean circulation and mixing. The results from this research may also contribute to understanding of the biological mechanisms that drive current regional to basin scale variability in carbon export in oligotrophic gyres. projects_0_end_date=2014-09 projects_0_geolocation=Sargasso Sea, BATS site projects_0_name=Plankton Community Composition and Trophic Interactions as Modifiers of Carbon Export in the Sargasso Sea projects_0_project_nid=2150 projects_0_start_date=2010-10 sourceUrl=(local files) Southernmost_Northing=-78.5 standard_name_vocabulary=CF Standard Name Table v55 version=1 Westernmost_Easting=-180.0 xml_source=osprey2erddap.update_xml() v1.3
d
Data from: Compiled Database and Results of the Analysis of Multiple...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Compiled Database and Results of the Analysis of Multiple Groundwater-Quality Datasets for Idaho [Dataset]. https://catalog.data.gov/dataset/compiled-database-and-results-of-the-analysis-of-multiple-groundwater-quality-datasets-for
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Idaho
Description
Groundwater is an important source of drinking and irrigation water throughout Idaho, and groundwater quality is monitored by various Federal, State, and local agencies. The historical, multi-agency records of groundwater quality include a valuable dataset that has yet to be compiled or analyzed on a statewide level. The purpose of this study is to combine groundwater-quality data from multiple sources into a single database, to summarize this dataset, and to perform bulk analyses to reveal spatial and temporal patterns of water quality throughout Idaho. Data were retrieved from the Water Quality Portal (www.waterqualitydata.us), the Idaho Department of Environmental Quality, and the Idaho Department of Water Resources. Analyses included counting the number of times a sample location had concentrations above Maximum Contaminant Levels (MCL), performing trends tests, and calculating correlations between water-quality analytes.
d
Data from: usSEABED: Offshore Surficial-Sediment Database for Samples...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). usSEABED: Offshore Surficial-Sediment Database for Samples Collected within the United States Exclusive Economic Zone [Dataset]. https://catalog.data.gov/dataset/usseabed-offshore-surficial-sediment-database-for-samples-collected-within-the-united-stat
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
Since the second half of the 20th century, there has been an increase in scientific interest, research effort, and information gathered on the geologic sedimentary character of the continental margins of the United States. Data and information from thousands of sources have increased our scientific understanding of the geologic origins of the margin surface but rarely have those data been combined into a unified database. Initially, usSEABED was created by the U.S. Geological Survey (USGS), in cooperation with the Institute of Arctic and Alpine Research at the University of Colorado Boulder, for assessments of marine-based aggregates and for studies of sea-floor habitats by the U.S. Geological Survey (USGS). Since then, the USGS has continued to build up the database as a nationwide resource for many uses and applications. Previously published data derived from the usSEABED database have been released as three USGS data series publications containing data covering the U.S. Atlantic margin, the Gulf of Mexico and Caribbean regions, and the Pacific coast (Reid and others, 2005; Buczkowski and others, 2006; and Reid and others, 2006). This expanded USGS data release unifies the data from these three publications and includes an additional 54 data sources added to usSEABED since the original data series, provides revised output files, and expands the data coverage to include usSEABED data from all areas within the U.S. Exclusive Economic Zone (EEZ) as of the time of publication (including Alaska, Hawaii, and U.S. overseas territories). The usSEABED database was created using the most recent stable version of the dbSEABED software available to the USGS at the time of release (specifically, dbSEABED software [NMEv, version date 4/23/2010] using the dbSEABED thesaurus [db9 dict.rtf, version date 8/21/2009], the component set up file for U.S. waters [SET ABUN 2016.txt, version date 5/29/2016], and the facies set up file for U.S. waters [SET FACI.txt, version date 3/16/2012]). The USGS Open-File Report "Sediments and the sea floor of the continental shelves and coastal waters of the United States: About the usSEABED integrated sea-floor-characterization database, built with the dbSEABED processing system" (Buczkowski and others, 2020) accompanies this data release and provides information on the usSEABED database as well as the dbSEABED data processing system. Users are encouraged to read this companion report to learn more about how usSEABED is built, how the data should be interpreted, and how they are best used.
d
Idaho Groundwater Quality Dataset [Relational Database Table: Samples]
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Idaho Groundwater Quality Dataset [Relational Database Table: Samples] [Dataset]. https://catalog.data.gov/dataset/idaho-groundwater-quality-dataset-relational-database-table-samples
Explore at:
Dataset updated
Oct 29, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Idaho
Description
This dataset is a compilation of data obtained from the Idaho Department of Water Quality, the Idaho Department of Water Resources, and the Water Quality Portal. The 'Samples' table stores information about individual groundwater samples, including what was being sampled, when it was sampled, the results of the sample, etc. This table is related to the 'MonitoringLocation' table (which contains information about the well being sampled).
AWS Tickit Database
kaggle.com
zip
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abraham Ajibade (2024). AWS Tickit Database [Dataset]. https://www.kaggle.com/datasets/abrahamajibade/aws-tickit-database/code
Explore at:
zip(32612225 bytes)Available download formats
Dataset updated
Oct 28, 2024
Authors
Abraham Ajibade
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides all the necessary files to set up the AWS Tickit database. It includes one SQL source file, a folder of .csv files and a folder of .txt files. Each can be used to create the database based on user preferences.

The database consists of seven tables: two fact tables and five dimensions. The two fact tables each contain less than 200,000 rows, and the dimensions range from 11 rows in the CATEGORY table up to about 50,000 rows in the USERS table.

This dataset is ideal for practicing SQL operations, setting up data pipelines, and learning how to integrate different file formats for database initialization.
EXOSAT Bibliography - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). EXOSAT Bibliography - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/exosat-bibliography
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This database table contains information about all EXOSAT publications in refereed journals that make use of EXOSAT data. Each entry is unique for every combination of publication and X-ray source. For example, a paper which discusses five X-ray sources will have generated five distinct entries in the database, each referring to a different X-ray source. Unlike EXOLOG, the EXOPUBS database also includes entries for serendipitous sources. In addition to standard database parameters such as source name, coordinates, object class, etc., the EXOPUBS includes the full reference (authors, journal, volume, page, year) and title of each publication. Note the information is not complete after the year 1991. This is a service provided by NASA HEASARC .
H
Hydroinformatics Instruction Module Example Code: Databases and SQL in...
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Mar 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amber Spackman Jones; Jeffery S. Horsburgh; Camilo J. Bastidas Pacheco (2022). Hydroinformatics Instruction Module Example Code: Databases and SQL in Python [Dataset]. https://www.hydroshare.org/resource/63694236207440e6a970920e818a9f66
Explore at:
zip(358.6 MB)Available download formats
Dataset updated
Mar 21, 2022
Dataset provided by
HydroShare
Authors
Amber Spackman Jones; Jeffery S. Horsburgh; Camilo J. Bastidas Pacheco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This resource contains Jupyter Notebooks with examples that illustrate how to work with SQLite databases in Python including database creation and viewing and querying with SQL. The resource is part of set of materials for hydroinformatics and water data science instruction. Complete learning module materials are found in HydroLearn: Jones, A.S., Horsburgh, J.S., Bastidas Pacheco, C.J. (2022). Hydroinformatics and Water Data Science. HydroLearn. https://edx.hydrolearn.org/courses/course-v1:USU+CEE6110+2022/about..

This resources consists of 3 example notebooks and a SQLite database.

Notebooks: 1. Example 1: Querying databases using SQL in Python 2. Example 2: Python functions to query SQLite databases 3. Example 3: SQL join, aggregate, and subquery functions

Data files: These examples use a SQLite database that uses the Observations Data Model structure and is pre-populated with Logan River temperature data.
c
EXOSAT Bibliography
s.cnmilf.com
catalog.data.gov
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
High Energy Astrophysics Science Archive Research Center (2025). EXOSAT Bibliography [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/exosat-bibliography
Explore at:
Dataset updated
Sep 19, 2025
Dataset provided by
High Energy Astrophysics Science Archive Research Center
Description
This database table contains information about all EXOSAT publications in refereed journals that make use of EXOSAT data. Each entry is unique for every combination of publication and X-ray source. For example, a paper which discusses five X-ray sources will have generated five distinct entries in the database, each referring to a different X-ray source. Unlike EXOLOG, the EXOPUBS database also includes entries for serendipitous sources. In addition to standard database parameters such as source name, coordinates, object class, etc., the EXOPUBS includes the full reference (authors, journal, volume, page, year) and title of each publication. Note the information is not complete after the year 1991. This is a service provided by NASA HEASARC .
d
Data from: USGS North American Packrat Midden Database, Version 5.0
catalog.data.gov
data.usgs.gov
+3more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). USGS North American Packrat Midden Database, Version 5.0 [Dataset]. https://catalog.data.gov/dataset/usgs-north-american-packrat-midden-database-version-5-0
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data release contains the data tables for the USGS North American Packrat Midden Database (version 5.0). This version of the Midden Database contains data for 3,331 packrat midden samples obtained from published sources (journal articles, book chapters, theses, dissertations, government and private industry reports, conference proceedings) as well as unpublished data contributed by researchers. Compared to the previous version of the Midden Database (i.e., ver. 4), this version of the database (ver. 5.0) has been expanded to include more precise midden-sample site location data, calibrated midden-sample age data, and plant functional type (PFT) assignments for the taxa in each midden sample. In addition, World Wildlife Fund ecoregion and major habitat type (MHT) assignments (Ricketts and others, 1999, Terrestrial ecoregions of North America—A conservation assessment) and modern climate and bioclimate data (New and others, 2002; Davis and others, 2017) are provided for each midden-sample site location.
O*NET Database
onetcenter.org
excel, mysql, oracle +2
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for O*NET Development, O*NET Database [Dataset]. https://www.onetcenter.org/database.html
Explore at:
oracle, sql server, text, mysql, excelAvailable download formats
Dataset provided by
Occupational Information Network
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Dataset funded by
US Department of Labor, Employment and Training Administration
Description
The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.
Data content areas include:
Worker Characteristics (e.g., Abilities, Interests, Work Styles)
Worker Requirements (e.g., Education, Knowledge, Skills)
Experience Requirements (e.g., On-the-Job Training, Work Experience)
Occupational Requirements (e.g., Detailed Work Activities, Work Context)
Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)
E-commerce dataset by Olist (SQLite)
kaggle.com
zip
Updated Apr 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terenci Claramunt (2024). E-commerce dataset by Olist (SQLite) [Dataset]. https://www.kaggle.com/datasets/terencicp/e-commerce-dataset-by-olist-as-an-sqlite-database
Explore at:
zip(51085670 bytes)Available download formats
Dataset updated
Apr 28, 2024
Authors
Terenci Claramunt
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
I imported the two Olist Kaggle datasets into an SQLite database. I modified the original table names to make them shorter and easier to understand. Here's the Entity-Relationship Diagram of the resulting SQLite database:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2F23a7d4d8cd99e36e32e57303eb804fff%2Fdb-schema.png?generation=1714391550829633&alt=media" alt="Database Schema">

Data sources:

https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

https://www.kaggle.com/datasets/olistbr/marketing-funnel-olist

I used this database as a data source for my notebook:

SQL Challenge: E-commerce data analysis
Z
Source Code Archiving to the Rescue of Reproducible Deployment — Replication...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Courtès, Ludovic; Sample, Timothy; Simon, Tournier; Zacchiroli, Stefano (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
Explore at:
Dataset updated
May 23, 2024
Dataset provided by
Centre de Recherche Inria Bordeaux - Sud-Ouest
Université Paris Cité
Institut Polytechnique de Paris
Authors
Courtès, Ludovic; Sample, Timothy; Simon, Tournier; Zacchiroli, Stefano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication package for the paper:

Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

Generating the paper

The paper can be generated using the following command:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make

This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

GNU Make

SQLite 3

GNU AWK

Rubber

Graphviz

TeXLive

Structure

data/ contains the data examined in the paper

scripts/ contains dedicated code for the paper

logs/ contains logs generated during certain computations

Preservation of Guix

Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

Analysis

Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.

The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

Estimating missing sources

The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

To generate the estimate, use:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt

If not using Guix, you will need additional software beyond what is used to generate the paper:

GNU Guile

GNU Bash

GNU Mailutils

GNU Parallel

Measuring link rot

In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

guix repl -q scripts/table-per-origin.scm

When running these scripts for the paper, we tracked their output and saved it inside the logs directory.

Facebook

Twitter

Click to copy link

Link copied

Cite

Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG

Data Management Plan Examples Database

Explore at:

Unique identifier

https://doi.org/10.5683/SP3/SDITUG

Dataset updated

Sep 4, 2024

Dataset provided by

Borealis

Authors

Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak

Time period covered

Jan 1, 2011 - Jan 1, 2023

Description

This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.

Clear search

Close search

Google apps

Main menu

Data Management Plan Examples Database

Data from: Database used for the evaluation of data used to identify...

The Surface Water Chemistry (SWatCh) database

Current Population Survey (CPS)

Open access practices of selected library science journals

Up-to-date mapping of COVID-19 treatment and vaccine development...

Data from: A consensus compound/bioactivity dataset for data-driven drug...

Data from: Database for the U.S. Geological Survey Woods Hole Science...

[JeDI] - Jellyfish Database Initiative: Global records on gelatinous...

Data from: Compiled Database and Results of the Analysis of Multiple...

Data from: usSEABED: Offshore Surficial-Sediment Database for Samples...

Idaho Groundwater Quality Dataset [Relational Database Table: Samples]

AWS Tickit Database

EXOSAT Bibliography - Dataset - NASA Open Data Portal

Hydroinformatics Instruction Module Example Code: Databases and SQL in...

EXOSAT Bibliography

Data from: USGS North American Packrat Midden Database, Version 5.0

O*NET Database

E-commerce dataset by Olist (SQLite)

Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

Data Management Plan Examples Database