85 datasets found

d
ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure
dataone.org
data.ess-dive.lbl.gov
+2more
Updated Apr 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas (2022). ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure [Dataset]. http://doi.org/10.15485/1734841
Explore at:
Unique identifier
https://doi.org/10.15485/1734841
Dataset updated
Apr 4, 2022
Dataset provided by
ESS-DIVE
Authors
Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas
Time period covered
Jan 1, 2020 - Sep 30, 2021
Description
The ESS-DIVE reporting format for Comma-separated Values (CSV) file structure is based on a combination of existing guidelines and recommendations including some found within the Earth Science Community with valuable input from the Environmental Systems Science (ESS) Community. The CSV reporting format is designed to promote interoperability and machine-readability of CSV data files while also facilitating the collection of some file-level metadata content. Tabular data in the form of rows and columns should be archived in its simplest form, and we recommend submitting these tabular data following the ESS-DIVE reporting format for generic comma-separated values (CSV) text format files. In general, the CSV file format is more likely accessible by future systems when compared to a proprietary format and CSV files are preferred because this format is easier to exchange between different programs increasing the interoperability of a data file. By defining the reporting format and providing guidelines for how to structure CSV files and some field content within, this can increase the machine-readability of the data file for extracting, compiling, and comparing the data across files and systems. Data package files are in .csv, .png, and .md. Open the .csv with e.g. Microsoft Excel, LibreOffice, or Google Sheets. Open the .md files by downloading and using a text editor (e.g., notepad or TextEdit). Open the .png in e.g. a web browser, photo viewer/editor, or Google Drive.
d
Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015–September 2017 [Dataset]. https://catalog.data.gov/dataset/tidal-daily-discharge-and-quality-assurance-data-supporting-an-assessment-of-water-quality
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Herring River, Wellfleet, Massachusetts
Description
This data release provides data in support of an assessment of water quality and discharge in the Herring River at the Chequessett Neck Road dike in Wellfleet, Massachusetts, from November 2015 to September 2017. The assessment was a cooperative project among the U.S. Geological Survey, National Park Service, Cape Cod National Seashore, and the Friends of Herring River to characterize environmental conditions prior to a future removal of the dike. It is described in U.S. Geological Survey (USGS) Scientific Investigations Report "Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015 – September 2017." This data release is structured as a set of comma-separated values (CSV) files, each of which contains information on data source (or laboratory used for analysis), USGS site identification (ID) number, beginning date of time of observation or sampling, ending date and time of observation or sampling and data such as flow rate and analytical results. The CSV files include calculated tidal daily flows (Flood_Tide_Tidal_Day.csv and Ebb_Tide_Tidal_Day.csv) that were used in Huntington and others (2020) for estimation of nutrient loads. Tidal daily flows are the estimated mean daily discharges for two consecutive flood and ebb tide cycles (average duration: 24 hours, 48 minutes). The associated date is the day on which most of the flow occurred. CSV files contain quality assurance data for water-quality samples including blanks (Blanks.csv), replicates (Replicates.csv), standard reference materials (Standard_Reference_Material.csv), and atmospheric ammonium contamination (NH4_Atmospheric_Contamination.csv). One CSV file (EWI_vs_ISCO.csv) contains data comparing composite samples collected by an automatic sampler (ISCO) at a fixed point with depth-integrated samples collected at equal width increments (EWI). One CSV file (Cross_Section_Field_Parameters.csv) contains field parameter data (specific conductance, temperature, pH, and dissolved oxygen) collected at a fixed location and data collected along the cross sections at variable water depths and horizontal distances across the openings of the culverts at the Chequessett Neck Road dike. One CSV file (LOADEST_Bias_Statistics.csv) contains data that include estimated natural log of load, model residuals, Z-scores, and seasonal model residuals for winter (December, January, and February); spring (March, April and May); summer (June, July and August); and fall (September, October, and November). The data release also includes a data dictionary (Data_Dictionary.csv) that provides detailed descriptions of each field in each CSV file, including: data filename; laboratory or data source; U.S. Geological Survey site ID numbers; data types; constituent (analyte) U.S. Geological Survey parameter codes; descriptions of parameters; units; methods; minimum reporting limits; limits of quantitation, if appropriate; method reference citations; and minimum, maximum, median, and average values for each analyte. The data release also includes an abbreviations file (Abbreviations.pdf) that defines all the abbreviations in the data dictionary and CSV files. Note that the USGS site ID includes a leading zero (011058798) and some of the parameter codes contain leading zeros, so care must be taken in opening and subsequently saving these files in other formats where leading zeros may be dropped.
Z
Data from: Dataset from : Browsing is a strong filter for savanna tree...
data.niaid.nih.gov
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wayne Twine (2021). Dataset from : Browsing is a strong filter for savanna tree seedlings in their first growing season [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4972083
Explore at:
Dataset updated
Oct 1, 2021
Dataset provided by
Archibald, Sally
Craddock Mthabini
Wayne Twine
Nicola Stevens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data presented here were used to produce the following paper:

Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.

The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588

For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za

Description of file(s):

File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"

The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.

The data consist of one .csv file with the following column names:

treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)

File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"

The data consist of one .csv file with the following column names:

treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low

File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

Consists of a .csv file with the following column headings

Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high

File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

Consists of a .csv file with the following column headings

shtspec species name species_code species code genus genus rainclass low/medium/high seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species Surv_slope coefficient of the model predicting survival from age of clip for this species GR_intercept coefficient of the model predicting stem diameter from seedling age for this species GR_slope coefficient of the model predicting stem diameter from seedling age for this species species_code species code max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
d
Dataset metadata of known Dataverse installations
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DCDKZQ
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gautier, Julian
Description
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
d
ESS-DIVE Reporting Format for File-level Metadata
dataone.org
knb.ecoinformatics.org
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas (2023). ESS-DIVE Reporting Format for File-level Metadata [Dataset]. http://doi.org/10.15485/1734840
Explore at:
Unique identifier
https://doi.org/10.15485/1734840
Dataset updated
May 4, 2023
Dataset provided by
ESS-DIVE
Authors
Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas
Time period covered
Jan 1, 2020 - Sep 30, 2021
Description
The ESS-DIVE reporting format for file-level metadata (FLMD) provides granular information at the data file level to describe the contents, scope, and structure of the data file to enable comparison of data files within a data package. The FLMD are fully consistent with and augment the metadata collected at the data package level. We developed the FLMD template based on a review of a small number of existing FLMD in use at other agencies and repositories with valuable input from the Environmental Systems Science (ESS) Community. Also included is a template for a CSV Data Dictionary where users can provide file-level information about the contents of a CSV data file (e.g., define column names, provide units). Files are in .csv, .xlsx, and .md. Templates are in both .csv and .xlsx (open with e.g. Microsoft Excel, LibreOffice, or Google Sheets). Open the .md files by downloading and using a text editor (e.g. Notepad or TextEdit). Though we provide Excel templates for the file-level metadata reporting format, our instructions encourage users to 'Save the FLMD template as a CSV following the CSV Reporting Format guidance'. In addition, we developed the ESS-DIVE File Level Metadata Extractor which is a lightweight python script that can extract some FLMD fields following the recommended FLMD format and structure.
f
Human Resources.csv
figshare.com
csv
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anurag pardiash (2025). Human Resources.csv [Dataset]. http://doi.org/10.6084/m9.figshare.28780886.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28780886.v1
Dataset updated
Apr 11, 2025
Dataset provided by
figshare
Authors
anurag pardiash
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset titled Human Resources.csv contains anonymized employee data collected for internal HR analysis and research purposes. It includes fields such as employee ID, department, gender, age, job role, and employment status. The data can be used for workforce trend analysis, HR benchmarking, diversity studies, and training models in human resource analytics.The file is provided in CSV format (3.05 MB) and adheres to general data privacy standards, with no personally identifiable information (PII).Last updated: April 11, 2025. Uploaded by Anurag Pardiash.
ISO 639-1 Language Codes
kaggle.com
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahesh Jadhav (2023). ISO 639-1 Language Codes [Dataset]. https://www.kaggle.com/datasets/ursmaheshj/iso-639-1-language-codes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2023
Dataset provided by
Kaggle
Authors
Mahesh Jadhav
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
ISO 639-1 is a standard for language codes that assigns a two-letter code to represent a language. This code is used to identify languages in computer systems, websites, and other applications that require language tagging. The codes are based on the English names of languages, and each language is assigned a unique code that consists of two letters. For example, "en" represents English, "fr" represents French, and "es" represents Spanish. ISO 639-1 language codes are commonly used in multilingual environments to ensure consistent and accurate representation of languages across different systems and platforms.

You can easily use this dataset to combine and replace the ISO 639-1 codes in your dataset with appropriate language names.

This data is collected from below sites

https://www.iso.org/iso-639-language-codes.html

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

https://www.loc.gov/standards/iso639-2/php/code_list.php

https://lingohub.com/academy/best-practices/iso-639-1-list

https://localizely.com/iso-639-1-list
a
CSV file - Tsunami activity - Structure Locations - Geo 1.8
resources-gisinschools-nz.hub.arcgis.com
Updated Jan 15, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GIS in Schools - Teaching Materials - New Zealand (2016). CSV file - Tsunami activity - Structure Locations - Geo 1.8 [Dataset]. https://resources-gisinschools-nz.hub.arcgis.com/datasets/6f238744d0c044a6b1d6dc4bfc229e8e
Explore at:
Dataset updated
Jan 15, 2016
Dataset authored and provided by
GIS in Schools - Teaching Materials - New Zealand
Area covered

Description
Tauranga proposed vertical evacuation structures CSV file. For use with the Tsunami assessment activity.Achievement Standard 91014
Z
Types, open citations, closed citations, publishers, and participation...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiebi, Ivan (2020). Types, open citations, closed citations, publishers, and participation reports of Crossref entities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2558257
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Hiebi, Ivan
Peroni, Silvio
Shotton, David
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This publication contains several datasets that have been used in the paper "Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal" submitted to the 17th International Conference on Scientometrics and Bibliometrics (ISSI 2019), available at https://opencitations.wordpress.com/2019/02/07/crowdsourcing-open-citations-with-croci/.

Additional information about the analyses described in the paper, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb. The datasets contain the following information.

non_open.zip: it is a zipped (~5 GB unzipped) CSV file containing the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, dated October 2018. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other. The open CC0 citation data we used came from the CSV dump of most recent release of COCI dated 12 November 2018. The number of closed citations was calculated by subtracting the number of open citations to each entity available within COCI from the value “is-referenced-by-count” available in the Crossref metadata for that particular cited entity, which reports all the DOI-to-DOI citation links that point to the cited entity from within the whole Crossref database (including those present in the Crossref ‘closed’ dataset).

The columns of the CSV file are the following ones:

doi: the DOI of the publication in Crossref;

type: the type of the publication as indicated in Crossref;

cited_by: the number of open citations received by the publication according to COCI;

non_open: the number of closed citations received by the publication according to Crossref + COCI.

croci_types.csv: it is a CSV file that contains the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, as collected in the previous CSV file, alligned in five classes depening on the entity types retrieved from Crossref: journal (Crossref types: journal-article, journal-issue, journal-volume, journal), book (Crossref types: book, book-chapter, book-section, monograph, book track, book-part, book-set, reference-book, dissertation, book series, edited book), proceedings (Crossref types: proceedings-article, proceedings, proceedings-series), dataset (Crossref types: dataset), other (Crossref types: other, report, peer review, reference-entry, component, report-series, standard, posted-content, standard-series).

The columns of the CSV file are the following ones:

type: the type publication between "journal", "book", "proceedings", "dataset", "other";

label: the label assigned to the type for visualisation purposes;

coci_open_cit: the number of open citations received by the publication type according to COCI;

crossref_close_cit: the number of closed citations received by the publication according to Crossref + COCI.

publishers_cits.csv: it is a CSV file that contains the top twenty publishers that received the greatest number of open citations. The columns of the CSV file are the following ones:

publisher: the name of the publisher;

doi_prefix: the list of DOI prefixes used assigned by the publisher;

coci_open_cit: the number of open citations received by the publications of the publisher according to COCI;

crossref_close_cit: the number of closed citations received by the publications of the publishers according to Crossref + COCI;

total_cit: the total number of citations received by the publications of the publisher (= coci_open_cit + crossref_close_cit).

20publishers_cr.csv: it is a CSV file that contains the numbers of the contributions to open citations made by the twenty publishers introduced in the previous CSV file as of 24 January 2018, according to the data available through the Crossref API. The counts listed in this file refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list. The categories 'closed', 'limited' and 'open' refer to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all, respectively. In addition, the file also record the total number of publications for which the publisher has submitted metadata to Crossref, whether or not those metadata include the reference lists of those publications.

The columns of the CSV file are the following ones:

publisher: the name of the publisher;

open: the number of publications in Crossref with an 'open' visibility for their reference lists;

limited: the number of publications in Crossref with an 'limited' visibility for their reference lists;

closed: the number of publications in Crossref with an 'closed' visibility for their reference lists;

overall_deposited: the overall number of publications for which the publisher has submitted metadata to Crossref.
e
Data from: "A guide to using GitHub for developing and versioning data...
knb.ecoinformatics.org
dataone.org
+1more
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal (2023). Data from: "A guide to using GitHub for developing and versioning data standards and reporting formats" [Dataset]. http://doi.org/10.15485/1780565
Explore at:
Unique identifier
https://doi.org/10.15485/1780565
Dataset updated
May 4, 2023
Dataset provided by
ESS-DIVE
Authors
Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal
Time period covered
Sep 1, 2020 - Dec 3, 2020
Description
These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub. In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files. There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.
Data articles in journals
zenodo.org
csv, txt, xls
Updated May 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro (2025). Data articles in journals [Dataset]. http://doi.org/10.5281/zenodo.15553313
Explore at:
txt, csv, xlsAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15553313
Dataset updated
May 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2025
Description
Version: 6

Date of data collection: May 2025 General description: Publication of datasets according to the FAIR principles could be reached publishing a data paper (and/or a software paper) in data journals as well as in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list: - data_articles_journal_list_v6.xlsx: full list of 177 academic journals in which data papers or/and software papers could be published - data_articles_journal_list_v6.csv: full list of 177 academic journals in which data papers or/and software papers could be published - readme_v6.txt, with a detailed descritption of the dataset and its variables. Relationship between files: both files have the same information. Two different formats are offered to improve reuse Type of version of the dataset: final processed version Versions of the files: 6th version - Information updated: number of journals (17 were added and 4 were deleted), URL, document types associated to a specific journal. - Information added: diamond journals were identified.

Version: 5

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2023/09/05

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v5.xlsx: full list of 162 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v5.csv: full list of 162 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 5th version
- Information updated: number of journals, URL, document types associated to a specific journal.
163 journals (excel y csv)

Version: 4

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/12/15

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v4.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v4.csv: full list of 140 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 4th version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.

Version: 3

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/10/28

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 3rd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

Erratum - Data articles in journals Version 3:

Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2
Data -- ISSN 2306-5729 -- JCR (JIF) n/a
Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

Version: 2

Author: Francisco Rubio, Universitat Politècnia de València.

Date of data collection: 2020/06/23

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:

- data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 2nd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

Total size: 32 KB

Version 1: Description

This dataset contains a list of journals that publish data articles, code, software articles and database articles.

The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals.
Acknowledgements:
Xaquín Lores Torres for his invaluable help in preparing this dataset.
d
UNI-CEN Standardized Census Data Table - Census Subdivision (CSD) - 2021 -...
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UNI-CEN Project (2023). UNI-CEN Standardized Census Data Table - Census Subdivision (CSD) - 2021 - Wide Format (CSV) (Version 2023-03) [Dataset]. http://doi.org/10.5683/SP3/1QYMNB
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/1QYMNB
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
UNI-CEN Project
Time period covered
Jan 1, 2021
Description
UNI-CEN Standardized Census Data Tables contain Census data that have been reformatted into a common table format with standardized variable names and codes. The data are provided in two tabular formats for different use cases. "Long" tables are suitable for use in statistical environments, while "wide" tables are commonly used in GIS environments. The long tables are provided in Stata Binary (dta) format, which is readable by all statistics software. The wide tables are provided in comma-separated values (csv) and dBase 3 (dbf) formats with codebooks. The wide tables are easily joined to the UNI-CEN Digital Boundary Files. For the csv files, a .csvt file is provided to ensure that column data formats are correctly formatted when importing into QGIS. A schema.ini file does the same when importing into ArcGIS environments. As the DBF file format supports a maximum of 250 columns, tables with a larger number of variables are divided into multiple DBF files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.
Well Completion Reports from the California Department of Water Resource
redivis.com
application/jsonl +7
Updated Aug 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental Impact Data Collaborative (2022). Well Completion Reports from the California Department of Water Resource [Dataset]. https://redivis.com/datasets/h89s-2rga9nge3
Explore at:
spss, csv, sas, parquet, stata, avro, arrow, application/jsonlAvailable download formats
Dataset updated
Aug 8, 2022
Dataset provided by
Redivis Inc.
Authors
Environmental Impact Data Collaborative
Time period covered
Jul 4, 1776 - Aug 11, 8202
Area covered
California
Description
Methodology

This Well Completion Report dataset represents an index of records from the California Department of Water Resources' (DWR) Online System for Well Completion Reports (OSWCR). This dataset is for informational purposes only. All attribute values should be verified by reviewing the original Well Completion Report. Known issues include: - Missing and duplicate records - Missing values (either missing on original Well Completion Report, or not key entered into database) - Incorrect values (e.g. incorrect Latitude, Longitude, Record Type, Planned Use, Total Completed Depth) - Limited spatial resolution: The majority of well completion reports have been spatially registered to the center of the 1x1 mile Public Land Survey System section that the well is located in.

Usage

Date data was updated: 7/5/2022

OSWCR.csv: Records from the California Department of Water Resourcesí Online System of Well Completion Reports.

OSWCR_DataDictionary.csv: Data dictionary for OSWCR.csv

WCRLinks.csv: Table of links to Well Completion Report PDFs. This table is related to OSWCR via the WCRNumber field.

WCRLinks_DataDictionary.csv: Data dictionary for WCRLinks.csv

WellNumbers.csv: Table of state and local well numbers that are stored in OSWCR

GeologicLog_FreeForm.csv: OSWCR provides three different methods to enter lithologic information. The ìFree Formî method allows users to enter any material description they wish for each depth interval.

GeologicLog_QuickPick.csv: OSWCR provides three different methods to enter lithologic information. The ìQuick Pickî method provides a set of standard values for material type, color, and texture, but also allows the user to enter any additional descriptive information.

GeologicLog_USCS.csv: OSWCR provides three different methods to enter lithologic information. The ìUSCS/ASTM D2488î method provides standard values for soil classification, with along with soil color and other descriptive information.

GeologicLog_GeneralizedLithology: This table provides generalized lithology descriptions and texture classifications that have been interpreted from well completion reports by various programs. This table is under development.

CasingData.csv: Casing data that have been entered into OSWCR including the weld type, casing material type, and casing specifications.

AnnularMaterial.csv: Annular fill data that have been entered into OSWCR including the Fill Type and Fill Type Details.

BoreholeInformation.csv: Diameter of the boreholes and the depth range over which that diameter applies

Submittal_Template_For_Suggested_Corrections.xlsx: Template for submitting suggested Well Completion Report database corrections.

Well owner information has been redacted from the attached tables and from the PDFs that these tables link to.

The attached data and data structure are subject to change, and are for informational purposes only.

All values listed in these tables should be verified by reviewing the original Well Completion Report.

Known issues include:

- Missing values (either missing on original Well Completion Report, or not key entered into database)

- Incorrect values (e.g. wrong Latitude, Longitude, Record Type, Planned Use, Total Completed Depth)

The California Department of Water Resources welcomes feedback to improve the accuracy of the Well Completion Report database.

Please email suggested corrections to benjamin.brezing@water.ca.gov using the attached Submittal_Template_For_Suggested_Corrections.xlsx file.
m
Ransomware and user samples for training and validating ML models
data.mendeley.com
Updated Sep 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
Explore at:
Unique identifier
https://doi.org/10.17632/yhg5wk39kf.2
Dataset updated
Sep 17, 2021
Authors
Eduardo Berrueta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
Students Test Data
kaggle.com
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ATHARV BHARASKAR
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

Here's a column-wise description of the dataset:

Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

UNIVERSITY: The university where the student is enrolled.

PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

Semester: The semester or academic term in which the student took the exam.

Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.
R
Données de réplication pour : Towards the improvement of thermodynamic...
entrepot.recherche.data.gouv.fr
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Llompart; Pierre Llompart; Claire Minoletti; Claire Minoletti; Shamkhal Baybekov; Shamkhal Baybekov; Dragos Horvath; Dragos Horvath; Gilles Marcou; Gilles Marcou; Alexandre Varnek; Alexandre Varnek (2024). Données de réplication pour : Towards the improvement of thermodynamic solubility prediction – a review [Dataset]. http://doi.org/10.57745/CZVZIA
Explore at:
tsv(100787), tsv(205089), application/x-ipynb+json(72137), tsv(519218), tsv(1009535), txt(3373), tsv(993073), txt(2104), tsv(2857901), tsv(3164076), tsv(786411), tsv(5527968), tsv(12558825), txt(1257), application/x-ipynb+json(792128), tsv(1135906), tsv(2513344)Available download formats
Unique identifier
https://doi.org/10.57745/CZVZIA
Dataset updated
Dec 4, 2024
Dataset provided by
Recherche Data Gouv
Authors
Pierre Llompart; Pierre Llompart; Claire Minoletti; Claire Minoletti; Shamkhal Baybekov; Shamkhal Baybekov; Dragos Horvath; Dragos Horvath; Gilles Marcou; Gilles Marcou; Alexandre Varnek; Alexandre Varnek
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
ANRT Cifre
Description
Evaluating thermodynamic solubility is crucial to design successful drug candidates. Yet, predicting it with in silico approaches remains a challenge. Machine learning methods are used to develop regression models leveraged on molecular descriptors. Recently, powerful solubility predictive models have been published using feature- and graph-based neural networks. These models often display attractive performances, yet, their reliability may be deceiving when used for prospective prediction. This review investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the structure of the aqueous solubility dataverse and data quality. We demonstrate that new models are not ready for public usage because they lack a well-defined applicability domain and they overlook some historical data sources. On the basis of carefully reviewed dataset we are able to illustrate the influence the data quality on model predictivity. We comprehensively investigated over 20 years of published solubility datasets and models, highlighting overlooked and interconnected datasets. We benchmarked recently published models on a Sanofi dataset, as an example of pharmaceutical context, and they performed poorly. We observed the impact of factors influencing the performances of the models: interlaboratory standard deviation, ionic state of the solute and source of the solubility data. As a consequence we draw a general workflow to cure aqueous solubility data with the aim of producing predictive models. Our results show how data quality and applicability domain of public models have an impact on their utility in a real context in pharmaceutical industry. We found that some data sources may appear as less reliable than initially expected, as for instance, the eChem dataset. This exhaustive aqueous solubility data analysis led to the development of a curation workflow; the resulting models and datasets are publicly available. Data are available as CSV files. File AqSolDBc.csv and AqSolDB_Enriched AqSolDBc is the final curated dataset after filtering of the AqSolDB_Enriched dataset. AqSolDBc is the curated data from the AqSolDB. The available columns are: Source If in AqSolDBc, the value is "AqSolDBc" ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) ExperimentalLogS Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILEScurated Curated SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Group Data quality label imported from AqSolDB (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) DuplicateGroup ID used to regroup duplicate structures (integer) DuplicateSD Standard deviation of the measurements for each unique structure, based on duplicate observations (double) DuplicateOccurrence Number of measurement for each unique structure (integer) SD Experimental standard deviation as given in the original AqSolDB (double) File AqSolDB.csv Original data from the AqSolDB. The available columns are: ID Compound ID (string) Name Name of the compound (string) SMILES Original SMILES code of the chemical structure (string) SmilesCurated Curated SMILES code of the chemical structure (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) Dataset Source of the data point (string) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) SD Experimental standard deviation as given in the original AqSolDB (double) File AqSolDB_Enriched_for_AqSolDBc.csv An extended version of AqSolDB_Enriched supplemented with molecular descriptors. Available columns: ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Solubility Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILES Original SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float)...
f
A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...
figshare.com
csv
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin (2025). A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023): 2.82Million Record Resource for Empirical and ML-Based Research [Dataset]. http://doi.org/10.6084/m9.figshare.27800394.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27800394.v2
Dataset updated
Feb 23, 2025
Dataset provided by
figshare
Authors
Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.
C
Replication data for "High life satisfaction reported among small-scale...
dataverse.csuc.cat
csv, txt
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia (2024). Replication data for "High life satisfaction reported among small-scale societies with low incomes" [Dataset]. http://doi.org/10.34810/data904
Explore at:
csv(1620), csv(7829), txt(7017), csv(227502)Available download formats
Unique identifier
https://doi.org/10.34810/data904
Dataset updated
Feb 7, 2024
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2021 - Oct 24, 2023
Area covered
United Republic of, Mafia Island, Tanzania, Darjeeling, India, Mongolia, Bulgan soum, Kumbungu, Ghana, Guatemala, Western highlands, Fiji, Ba, Shangri-la, China, Argentina, Puna, Laprak, Nepal, Bassari country, Senegal
Dataset funded by
European Commission
Description
This dataset was created in order to document self-reported life evaluations among small-scale societies that exist on the fringes of mainstream industrialized socieities. The data were produced as part of the LICCI project, through fieldwork carried out by LICCI partners. The data include individual responses to a life satisfaction question, and household asset values. Data from Gallup World Poll and the World Values Survey are also included, as used for comparison. TABULAR DATA-SPECIFIC INFORMATION --------------------------------- 1. File name: LICCI_individual.csv Number of rows and columns: 2814,7 Variable list: Variable names: User, Site, village Description: identification of investigator and location Variable name: Well.being.general Description: numerical score for life satisfaction question Variable names: HH_Assets_US, HH_Assets_USD_capita Description: estimated value of representative assets in the household of respondent, total and per capita (accounting for number of household inhabitants) 2. File name: LICCI_bySite.csv Number of rows and columns: 19,8 Variable list: Variable names: Site, N Description: site name and number of respondents at the site Variable names: SWB_mean, SWB_SD Description: mean and standard deviation of life satisfaction score Variable names: HHAssets_USD_mean, HHAssets_USD_sd Description: Site mean and standard deviation of household asset value Variable names: PerCapAssets_USD_mean, PerCapAssets_USD_sd Description: Site mean and standard deviation of per capita asset value 3. File name: gallup_WVS_GDP_pk.csv Number of rows and columns: 146,8 Variable list: Variable name: Happiness Score, Whisker-high, Whisker-low Description: from Gallup World Poll as documented in World Happiness Report 2022. Variable name: GDP-PPP2017 Description: Gross Domestic Product per capita for year 2020 at PPP (constant 2017 international $). Accessed May 2022. Variable name: pk Description: Produced capital per capita for year 2018 (in 2018 US$) for available countries, as estimated by the World Bank (accessed February 2022). Variable names: WVS7_mean, WVS7_std Description: Results of Question 49 in the World Values Survey, Wave 7.
t
Soil benchmark data sets 2021 district-free city of Brandenburg an der Havel...
service.tib.eu
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Soil benchmark data sets 2021 district-free city of Brandenburg an der Havel - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/govdata_94ee7f6f-d384-4499-b592-718f6027da98
Explore at:
Dataset updated
Feb 4, 2025
Area covered
Brandenburg
Description
Ground benchmark datasets are issued annually in the standard file formats Text (CSV) and XML in relation to EPSG code 25833. Depending on the file format, ground benchmark data sets are provided in full for the areas of competence of the expert committees and for the State of Brandenburg in a zipped file with a statistical indication and a description of the elements. The CSV file is based on VBORIS2. A key bridge to the old format can be extracted from the data. On request, soil benchmark datasets for municipal areas can be cut out or provided in shape format. Furthermore, the delivery of soil benchmarks in the form of web-based geoservices is possible. Ground benchmark datasets are issued annually in the standard file formats Text (CSV) and XML in relation to EPSG code 25833. Depending on the file format, ground benchmark data sets are provided in full for the areas of competence of the expert committees and for the State of Brandenburg in a zipped file with a statistical indication and a description of the elements. The CSV file is based on VBORIS2. A key bridge to the old format can be extracted from the data. On request, soil benchmark datasets for municipal areas can be cut out or provided in shape format. Furthermore, the delivery of soil benchmarks in the form of web-based geoservices is possible. Ground benchmark datasets are issued annually in the standard file formats Text (CSV) and XML in relation to EPSG code 25833. Depending on the file format, ground benchmark data sets are provided in full for the areas of competence of the expert committees and for the State of Brandenburg in a zipped file with a statistical indication and a description of the elements. The CSV file is based on VBORIS2. A key bridge to the old format can be extracted from the data. On request, soil benchmark datasets for municipal areas can be cut out or provided in shape format. Furthermore, the delivery of soil benchmarks in the form of web-based geoservices is possible. Ground benchmark datasets are issued annually in the standard file formats Text (CSV) and XML in relation to EPSG code 25833. Depending on the file format, ground benchmark data sets are provided in full for the areas of competence of the expert committees and for the State of Brandenburg in a zipped file with a statistical indication and a description of the elements. The CSV file is based on VBORIS2. A key bridge to the old format can be extracted from the data. On request, soil benchmark datasets for municipal areas can be cut out or provided in shape format. Furthermore, the delivery of soil benchmarks in the form of web-based geoservices is possible.

Facebook

Twitter

Click to copy link

Link copied

Cite

Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas (2022). ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure [Dataset]. http://doi.org/10.15485/1734841

ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.15485/1734841

Dataset updated

Apr 4, 2022

Dataset provided by

ESS-DIVE

Authors

Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas

Time period covered

Jan 1, 2020 - Sep 30, 2021

Description

The ESS-DIVE reporting format for Comma-separated Values (CSV) file structure is based on a combination of existing guidelines and recommendations including some found within the Earth Science Community with valuable input from the Environmental Systems Science (ESS) Community. The CSV reporting format is designed to promote interoperability and machine-readability of CSV data files while also facilitating the collection of some file-level metadata content. Tabular data in the form of rows and columns should be archived in its simplest form, and we recommend submitting these tabular data following the ESS-DIVE reporting format for generic comma-separated values (CSV) text format files. In general, the CSV file format is more likely accessible by future systems when compared to a proprietary format and CSV files are preferred because this format is easier to exchange between different programs increasing the interoperability of a data file. By defining the reporting format and providing guidelines for how to structure CSV files and some field content within, this can increase the machine-readability of the data file for extracting, compiling, and comparing the data across files and systems. Data package files are in .csv, .png, and .md. Open the .csv with e.g. Microsoft Excel, LibreOffice, or Google Sheets. Open the .md files by downloading and using a text editor (e.g., notepad or TextEdit). Open the .png in e.g. a web browser, photo viewer/editor, or Google Drive.

Clear search

Close search

Google apps

Main menu

ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure

Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of...

Data from: Dataset from : Browsing is a strong filter for savanna tree...

Dataset metadata of known Dataverse installations

machine learning models on the WDBC dataset

ESS-DIVE Reporting Format for File-level Metadata

Human Resources.csv

ISO 639-1 Language Codes

CSV file - Tsunami activity - Structure Locations - Geo 1.8

Types, open citations, closed citations, publishers, and participation...

Data from: "A guide to using GitHub for developing and versioning data...

Data articles in journals

UNI-CEN Standardized Census Data Table - Census Subdivision (CSD) - 2021 -...

Well Completion Reports from the California Department of Water Resource

Methodology

Usage

Ransomware and user samples for training and validating ML models

Students Test Data

Données de réplication pour : Towards the improvement of thermodynamic...

A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...

Replication data for "High life satisfaction reported among small-scale...

Soil benchmark data sets 2021 district-free city of Brandenburg an der Havel...

ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure