90 datasets found
  1. e

    ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure

    • knb.ecoinformatics.org
    • dataone.org
    • +2more
    Updated May 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas (2023). ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure [Dataset]. http://doi.org/10.15485/1734841
    Explore at:
    Dataset updated
    May 4, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas
    Time period covered
    Jan 1, 2020 - Sep 30, 2021
    Description

    The ESS-DIVE reporting format for Comma-separated Values (CSV) file structure is based on a combination of existing guidelines and recommendations including some found within the Earth Science Community with valuable input from the Environmental Systems Science (ESS) Community. The CSV reporting format is designed to promote interoperability and machine-readability of CSV data files while also facilitating the collection of some file-level metadata content. Tabular data in the form of rows and columns should be archived in its simplest form, and we recommend submitting these tabular data following the ESS-DIVE reporting format for generic comma-separated values (CSV) text format files. In general, the CSV file format is more likely accessible by future systems when compared to a proprietary format and CSV files are preferred because this format is easier to exchange between different programs increasing the interoperability of a data file. By defining the reporting format and providing guidelines for how to structure CSV files and some field content within, this can increase the machine-readability of the data file for extracting, compiling, and comparing the data across files and systems. Data package files are in .csv, .png, and .md. Open the .csv with e.g. Microsoft Excel, LibreOffice, or Google Sheets. Open the .md files by downloading and using a text editor (e.g., notepad or TextEdit). Open the .png in e.g. a web browser, photo viewer/editor, or Google Drive.

  2. e

    Data from: "A guide to using GitHub for developing and versioning data...

    • knb.ecoinformatics.org
    • dataone.org
    • +1more
    Updated May 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal (2023). Data from: "A guide to using GitHub for developing and versioning data standards and reporting formats" [Dataset]. http://doi.org/10.15485/1780565
    Explore at:
    Dataset updated
    May 4, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal
    Time period covered
    Sep 1, 2020 - Dec 3, 2020
    Description

    These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub. In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files. There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.

  3. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

  4. d

    ESS-DIVE Reporting Format for File-level Metadata

    • dataone.org
    • knb.ecoinformatics.org
    Updated May 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas (2023). ESS-DIVE Reporting Format for File-level Metadata [Dataset]. http://doi.org/10.15485/1734840
    Explore at:
    Dataset updated
    May 4, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas
    Time period covered
    Jan 1, 2020 - Sep 30, 2021
    Description

    The ESS-DIVE reporting format for file-level metadata (FLMD) provides granular information at the data file level to describe the contents, scope, and structure of the data file to enable comparison of data files within a data package. The FLMD are fully consistent with and augment the metadata collected at the data package level. We developed the FLMD template based on a review of a small number of existing FLMD in use at other agencies and repositories with valuable input from the Environmental Systems Science (ESS) Community. Also included is a template for a CSV Data Dictionary where users can provide file-level information about the contents of a CSV data file (e.g., define column names, provide units). Files are in .csv, .xlsx, and .md. Templates are in both .csv and .xlsx (open with e.g. Microsoft Excel, LibreOffice, or Google Sheets). Open the .md files by downloading and using a text editor (e.g. Notepad or TextEdit). Though we provide Excel templates for the file-level metadata reporting format, our instructions encourage users to 'Save the FLMD template as a CSV following the CSV Reporting Format guidance'. In addition, we developed the ESS-DIVE File Level Metadata Extractor which is a lightweight python script that can extract some FLMD fields following the recommended FLMD format and structure.

  5. d

    Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015–September 2017 [Dataset]. https://catalog.data.gov/dataset/tidal-daily-discharge-and-quality-assurance-data-supporting-an-assessment-of-water-quality
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Wellfleet, Herring River, Massachusetts
    Description

    This data release provides data in support of an assessment of water quality and discharge in the Herring River at the Chequessett Neck Road dike in Wellfleet, Massachusetts, from November 2015 to September 2017. The assessment was a cooperative project among the U.S. Geological Survey, National Park Service, Cape Cod National Seashore, and the Friends of Herring River to characterize environmental conditions prior to a future removal of the dike. It is described in U.S. Geological Survey (USGS) Scientific Investigations Report "Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015 – September 2017." This data release is structured as a set of comma-separated values (CSV) files, each of which contains information on data source (or laboratory used for analysis), USGS site identification (ID) number, beginning date of time of observation or sampling, ending date and time of observation or sampling and data such as flow rate and analytical results. The CSV files include calculated tidal daily flows (Flood_Tide_Tidal_Day.csv and Ebb_Tide_Tidal_Day.csv) that were used in Huntington and others (2020) for estimation of nutrient loads. Tidal daily flows are the estimated mean daily discharges for two consecutive flood and ebb tide cycles (average duration: 24 hours, 48 minutes). The associated date is the day on which most of the flow occurred. CSV files contain quality assurance data for water-quality samples including blanks (Blanks.csv), replicates (Replicates.csv), standard reference materials (Standard_Reference_Material.csv), and atmospheric ammonium contamination (NH4_Atmospheric_Contamination.csv). One CSV file (EWI_vs_ISCO.csv) contains data comparing composite samples collected by an automatic sampler (ISCO) at a fixed point with depth-integrated samples collected at equal width increments (EWI). One CSV file (Cross_Section_Field_Parameters.csv) contains field parameter data (specific conductance, temperature, pH, and dissolved oxygen) collected at a fixed location and data collected along the cross sections at variable water depths and horizontal distances across the openings of the culverts at the Chequessett Neck Road dike. One CSV file (LOADEST_Bias_Statistics.csv) contains data that include estimated natural log of load, model residuals, Z-scores, and seasonal model residuals for winter (December, January, and February); spring (March, April and May); summer (June, July and August); and fall (September, October, and November). The data release also includes a data dictionary (Data_Dictionary.csv) that provides detailed descriptions of each field in each CSV file, including: data filename; laboratory or data source; U.S. Geological Survey site ID numbers; data types; constituent (analyte) U.S. Geological Survey parameter codes; descriptions of parameters; units; methods; minimum reporting limits; limits of quantitation, if appropriate; method reference citations; and minimum, maximum, median, and average values for each analyte. The data release also includes an abbreviations file (Abbreviations.pdf) that defines all the abbreviations in the data dictionary and CSV files. Note that the USGS site ID includes a leading zero (011058798) and some of the parameter codes contain leading zeros, so care must be taken in opening and subsequently saving these files in other formats where leading zeros may be dropped.

  6. ISO 639-1 Language Codes

    • kaggle.com
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahesh Jadhav (2023). ISO 639-1 Language Codes [Dataset]. https://www.kaggle.com/datasets/ursmaheshj/iso-639-1-language-codes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Dataset provided by
    Kaggle
    Authors
    Mahesh Jadhav
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ISO 639-1 is a standard for language codes that assigns a two-letter code to represent a language. This code is used to identify languages in computer systems, websites, and other applications that require language tagging. The codes are based on the English names of languages, and each language is assigned a unique code that consists of two letters. For example, "en" represents English, "fr" represents French, and "es" represents Spanish. ISO 639-1 language codes are commonly used in multilingual environments to ensure consistent and accurate representation of languages across different systems and platforms.

    You can easily use this dataset to combine and replace the ISO 639-1 codes in your dataset with appropriate language names.

    This data is collected from below sites

  7. d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  8. Z

    Data from: Dataset from : Browsing is a strong filter for savanna tree...

    • data.niaid.nih.gov
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wayne Twine (2021). Dataset from : Browsing is a strong filter for savanna tree seedlings in their first growing season [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4972083
    Explore at:
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    Archibald, Sally
    Craddock Mthabini
    Wayne Twine
    Nicola Stevens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data presented here were used to produce the following paper:

    Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.

    The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588

    For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za

    Description of file(s):

    File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"

    The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.

    The data consist of one .csv file with the following column names:

    treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)

    File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"

    The data consist of one .csv file with the following column names:

    treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low

    File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

    Consists of a .csv file with the following column headings

    Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high

    File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

    Consists of a .csv file with the following column headings

    shtspec species name species_code species code genus genus rainclass low/medium/high seed mass mass of seed (g per 1000seeds)
    Surv_intercept coefficient of the model predicting survival from age of clip for this species Surv_slope coefficient of the model predicting survival from age of clip for this species GR_intercept coefficient of the model predicting stem diameter from seedling age for this species GR_slope coefficient of the model predicting stem diameter from seedling age for this species species_code species code max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite

  9. R

    Données de réplication pour : Towards the improvement of thermodynamic...

    • entrepot.recherche.data.gouv.fr
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Llompart; Pierre Llompart; Claire Minoletti; Claire Minoletti; Shamkhal Baybekov; Shamkhal Baybekov; Dragos Horvath; Dragos Horvath; Gilles Marcou; Gilles Marcou; Alexandre Varnek; Alexandre Varnek (2024). Données de réplication pour : Towards the improvement of thermodynamic solubility prediction – a review [Dataset]. http://doi.org/10.57745/CZVZIA
    Explore at:
    tsv(100787), tsv(205089), application/x-ipynb+json(72137), tsv(519218), tsv(1009535), txt(3373), tsv(993073), txt(2104), tsv(2857901), tsv(3164076), tsv(786411), tsv(5527968), tsv(12558825), txt(1257), application/x-ipynb+json(792128), tsv(1135906), tsv(2513344)Available download formats
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    Recherche Data Gouv
    Authors
    Pierre Llompart; Pierre Llompart; Claire Minoletti; Claire Minoletti; Shamkhal Baybekov; Shamkhal Baybekov; Dragos Horvath; Dragos Horvath; Gilles Marcou; Gilles Marcou; Alexandre Varnek; Alexandre Varnek
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    ANRT Cifre
    Description

    Evaluating thermodynamic solubility is crucial to design successful drug candidates. Yet, predicting it with in silico approaches remains a challenge. Machine learning methods are used to develop regression models leveraged on molecular descriptors. Recently, powerful solubility predictive models have been published using feature- and graph-based neural networks. These models often display attractive performances, yet, their reliability may be deceiving when used for prospective prediction. This review investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the structure of the aqueous solubility dataverse and data quality. We demonstrate that new models are not ready for public usage because they lack a well-defined applicability domain and they overlook some historical data sources. On the basis of carefully reviewed dataset we are able to illustrate the influence the data quality on model predictivity. We comprehensively investigated over 20 years of published solubility datasets and models, highlighting overlooked and interconnected datasets. We benchmarked recently published models on a Sanofi dataset, as an example of pharmaceutical context, and they performed poorly. We observed the impact of factors influencing the performances of the models: interlaboratory standard deviation, ionic state of the solute and source of the solubility data. As a consequence we draw a general workflow to cure aqueous solubility data with the aim of producing predictive models. Our results show how data quality and applicability domain of public models have an impact on their utility in a real context in pharmaceutical industry. We found that some data sources may appear as less reliable than initially expected, as for instance, the eChem dataset. This exhaustive aqueous solubility data analysis led to the development of a curation workflow; the resulting models and datasets are publicly available. Data are available as CSV files. File AqSolDBc.csv and AqSolDB_Enriched AqSolDBc is the final curated dataset after filtering of the AqSolDB_Enriched dataset. AqSolDBc is the curated data from the AqSolDB. The available columns are: Source If in AqSolDBc, the value is "AqSolDBc" ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) ExperimentalLogS Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILEScurated Curated SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Group Data quality label imported from AqSolDB (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) DuplicateGroup ID used to regroup duplicate structures (integer) DuplicateSD Standard deviation of the measurements for each unique structure, based on duplicate observations (double) DuplicateOccurrence Number of measurement for each unique structure (integer) SD Experimental standard deviation as given in the original AqSolDB (double) File AqSolDB.csv Original data from the AqSolDB. The available columns are: ID Compound ID (string) Name Name of the compound (string) SMILES Original SMILES code of the chemical structure (string) SmilesCurated Curated SMILES code of the chemical structure (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) Dataset Source of the data point (string) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) SD Experimental standard deviation as given in the original AqSolDB (double) File AqSolDB_Enriched_for_AqSolDBc.csv An extended version of AqSolDB_Enriched supplemented with molecular descriptors. Available columns: ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Solubility Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILES Original SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float)...

  10. f

    Human Resources.csv

    • figshare.com
    csv
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anurag pardiash (2025). Human Resources.csv [Dataset]. http://doi.org/10.6084/m9.figshare.28780886.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    figshare
    Authors
    anurag pardiash
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset titled Human Resources.csv contains anonymized employee data collected for internal HR analysis and research purposes. It includes fields such as employee ID, department, gender, age, job role, and employment status. The data can be used for workforce trend analysis, HR benchmarking, diversity studies, and training models in human resource analytics.The file is provided in CSV format (3.05 MB) and adheres to general data privacy standards, with no personally identifiable information (PII).Last updated: April 11, 2025. Uploaded by Anurag Pardiash.

  11. d

    UNI-CEN Standardized Census Data Table - Census Subdivision (CSD) - 1996 -...

    • search.dataone.org
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UNI-CEN Project (2023). UNI-CEN Standardized Census Data Table - Census Subdivision (CSD) - 1996 - Wide Format (CSV) (Version 2023-03) [Dataset]. http://doi.org/10.5683/SP3/WJ7WJ6
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    UNI-CEN Project
    Time period covered
    Jan 1, 1996
    Description

    UNI-CEN Standardized Census Data Tables contain Census data that have been reformatted into a common table format with standardized variable names and codes. The data are provided in two tabular formats for different use cases. "Long" tables are suitable for use in statistical environments, while "wide" tables are commonly used in GIS environments. The long tables are provided in Stata Binary (dta) format, which is readable by all statistics software. The wide tables are provided in comma-separated values (csv) and dBase 3 (dbf) formats with codebooks. The wide tables are easily joined to the UNI-CEN Digital Boundary Files. For the csv files, a .csvt file is provided to ensure that column data formats are correctly formatted when importing into QGIS. A schema.ini file does the same when importing into ArcGIS environments. As the DBF file format supports a maximum of 250 columns, tables with a larger number of variables are divided into multiple DBF files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.

  12. Types, open citations, closed citations, publishers, and participation...

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Hiebi; Ivan Hiebi; Silvio Peroni; Silvio Peroni; David Shotton; David Shotton (2020). Types, open citations, closed citations, publishers, and participation reports of Crossref entities [Dataset]. http://doi.org/10.5281/zenodo.2558258
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Hiebi; Ivan Hiebi; Silvio Peroni; Silvio Peroni; David Shotton; David Shotton
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This publication contains several datasets that have been used in the paper "Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal" submitted to the 17th International Conference on Scientometrics and Bibliometrics (ISSI 2019), available at https://opencitations.wordpress.com/2019/02/07/crowdsourcing-open-citations-with-croci/.

    Additional information about the analyses described in the paper, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb. The datasets contain the following information.

    non_open.zip: it is a zipped (~5 GB unzipped) CSV file containing the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, dated October 2018. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other. The open CC0 citation data we used came from the CSV dump of most recent release of COCI dated 12 November 2018. The number of closed citations was calculated by subtracting the number of open citations to each entity available within COCI from the value “is-referenced-by-count” available in the Crossref metadata for that particular cited entity, which reports all the DOI-to-DOI citation links that point to the cited entity from within the whole Crossref database (including those present in the Crossref ‘closed’ dataset).

    The columns of the CSV file are the following ones:

    • doi: the DOI of the publication in Crossref;
    • type: the type of the publication as indicated in Crossref;
    • cited_by: the number of open citations received by the publication according to COCI;
    • non_open: the number of closed citations received by the publication according to Crossref + COCI.

    croci_types.csv: it is a CSV file that contains the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, as collected in the previous CSV file, alligned in five classes depening on the entity types retrieved from Crossref: journal (Crossref types: journal-article, journal-issue, journal-volume, journal), book (Crossref types: book, book-chapter, book-section, monograph, book track, book-part, book-set, reference-book, dissertation, book series, edited book), proceedings (Crossref types: proceedings-article, proceedings, proceedings-series), dataset (Crossref types: dataset), other (Crossref types: other, report, peer review, reference-entry, component, report-series, standard, posted-content, standard-series).

    The columns of the CSV file are the following ones:

    • type: the type publication between "journal", "book", "proceedings", "dataset", "other";
    • label: the label assigned to the type for visualisation purposes;
    • coci_open_cit: the number of open citations received by the publication type according to COCI;
    • crossref_close_cit: the number of closed citations received by the publication according to Crossref + COCI.

    publishers_cits.csv: it is a CSV file that contains the top twenty publishers that received the greatest number of open citations. The columns of the CSV file are the following ones:

    • publisher: the name of the publisher;
    • doi_prefix: the list of DOI prefixes used assigned by the publisher;
    • coci_open_cit: the number of open citations received by the publications of the publisher according to COCI;
    • crossref_close_cit: the number of closed citations received by the publications of the publishers according to Crossref + COCI;
    • total_cit: the total number of citations received by the publications of the publisher (= coci_open_cit + crossref_close_cit).

    20publishers_cr.csv: it is a CSV file that contains the numbers of the contributions to open citations made by the twenty publishers introduced in the previous CSV file as of 24 January 2018, according to the data available through the Crossref API. The counts listed in this file refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list. The categories 'closed', 'limited' and 'open' refer to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all, respectively. In addition, the file also record the total number of publications for which the publisher has submitted metadata to Crossref, whether or not those metadata include the reference lists of those publications.

    The columns of the CSV file are the following ones:

    • publisher: the name of the publisher;
    • open: the number of publications in Crossref with an 'open' visibility for their reference lists;
    • limited: the number of publications in Crossref with an 'limited' visibility for their reference lists;
    • closed: the number of publications in Crossref with an 'closed' visibility for their reference lists;
    • overall_deposited: the overall number of publications for which the publisher has submitted metadata to Crossref.
  13. Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

    • figshare.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.6084/m9.figshare.20416092.v1
    Explore at:
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elizabeth Szkirpan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

  14. d

    Cause of Loss Historical Files

    • catalog.data.gov
    • cloud.csiss.gmu.edu
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U. S. Department of Agriculture (2025). Cause of Loss Historical Files [Dataset]. https://catalog.data.gov/dataset/cause-of-loss-historical-files
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    U. S. Department of Agriculture
    Description

    The Risk Management Agency (RMA) Cause of Loss Historical Files summarize participation information broken down by the causes of loss. Each link contains a ZIP file with compressed data containing CSV flat-files that can be imported into any standard spreadsheet and/or database for further analysis. Record description file located in each subfolder.

  15. Z

    Data articles in journals

    • data.niaid.nih.gov
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loureiro, Vanesa (2023). Data articles in journals [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3753373
    Explore at:
    Dataset updated
    Sep 22, 2023
    Dataset provided by
    Balsa-Sanchez, Carlota
    Loureiro, Vanesa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Version: 5

    Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

    Date of data collection: 2023/09/05

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

    • data_articles_journal_list_v5.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published
    • data_articles_journal_list_v5.csv: full list of 140 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 5th version - Information updated: number of journals, URL, document types associated to a specific journal.

    Version: 4

    Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

    Date of data collection: 2022/12/15

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

    • data_articles_journal_list_v4.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published
    • data_articles_journal_list_v4.csv: full list of 140 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 4th version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.

    Version: 3

    Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

    Date of data collection: 2022/10/28

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

    • data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
    • data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 3rd version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

    Erratum - Data articles in journals Version 3:

    Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2 Data -- ISSN 2306-5729 -- JCR (JIF) n/a Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

    Version: 2

    Author: Francisco Rubio, Universitat Politècnia de València.

    Date of data collection: 2020/06/23

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

    • data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
    • data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 2nd version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

    Total size: 32 KB

    Version 1: Description

    This dataset contains a list of journals that publish data articles, code, software articles and database articles.

    The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals. Acknowledgements: Xaquín Lores Torres for his invaluable help in preparing this dataset.

  16. m

    Ransomware and user samples for training and validating ML models

    • data.mendeley.com
    Updated Sep 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
    Explore at:
    Dataset updated
    Sep 17, 2021
    Authors
    Eduardo Berrueta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

    This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

    The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

    Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

    In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

    The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

    Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.

  17. Students Test Data

    • kaggle.com
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ATHARV BHARASKAR
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

    Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

    Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

    Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

    Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

    Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

    Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

    Here's a column-wise description of the dataset:

    Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

    UNIVERSITY: The university where the student is enrolled.

    PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

    Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

    Semester: The semester or academic term in which the student took the exam.

    Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

    GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

    Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

    TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.

  18. C

    Replication data for "High life satisfaction reported among small-scale...

    • dataverse.csuc.cat
    csv, txt
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia (2024). Replication data for "High life satisfaction reported among small-scale societies with low incomes" [Dataset]. http://doi.org/10.34810/data904
    Explore at:
    csv(1620), csv(7829), txt(7017), csv(227502)Available download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2021 - Oct 24, 2023
    Area covered
    Mongolia, Bulgan soum, Kumbungu, Ghana, Bassari country, Senegal, Shangri-la, China, Fiji, Ba, Tanzania, United Republic of, Mafia Island, Western highlands, Guatemala, Argentina, Puna, Nepal, Laprak, India, Darjeeling
    Dataset funded by
    European Commission
    Description

    This dataset was created in order to document self-reported life evaluations among small-scale societies that exist on the fringes of mainstream industrialized socieities. The data were produced as part of the LICCI project, through fieldwork carried out by LICCI partners. The data include individual responses to a life satisfaction question, and household asset values. Data from Gallup World Poll and the World Values Survey are also included, as used for comparison. TABULAR DATA-SPECIFIC INFORMATION --------------------------------- 1. File name: LICCI_individual.csv Number of rows and columns: 2814,7 Variable list: Variable names: User, Site, village Description: identification of investigator and location Variable name: Well.being.general Description: numerical score for life satisfaction question Variable names: HH_Assets_US, HH_Assets_USD_capita Description: estimated value of representative assets in the household of respondent, total and per capita (accounting for number of household inhabitants) 2. File name: LICCI_bySite.csv Number of rows and columns: 19,8 Variable list: Variable names: Site, N Description: site name and number of respondents at the site Variable names: SWB_mean, SWB_SD Description: mean and standard deviation of life satisfaction score Variable names: HHAssets_USD_mean, HHAssets_USD_sd Description: Site mean and standard deviation of household asset value Variable names: PerCapAssets_USD_mean, PerCapAssets_USD_sd Description: Site mean and standard deviation of per capita asset value 3. File name: gallup_WVS_GDP_pk.csv Number of rows and columns: 146,8 Variable list: Variable name: Happiness Score, Whisker-high, Whisker-low Description: from Gallup World Poll as documented in World Happiness Report 2022. Variable name: GDP-PPP2017 Description: Gross Domestic Product per capita for year 2020 at PPP (constant 2017 international $). Accessed May 2022. Variable name: pk Description: Produced capital per capita for year 2018 (in 2018 US$) for available countries, as estimated by the World Bank (accessed February 2022). Variable names: WVS7_mean, WVS7_std Description: Results of Question 49 in the World Values Survey, Wave 7.

  19. f

    A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...

    • figshare.com
    csv
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin (2025). A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023): 2.82Million Record Resource for Empirical and ML-Based Research [Dataset]. http://doi.org/10.6084/m9.figshare.27800394.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    figshare
    Authors
    Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.

  20. Annotated 12 lead ECG dataset

    • zenodo.org
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12 lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3625007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students.
    It is used as test set on the paper:
    "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network".
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a 
    `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different 
    patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12
    different leads of the ECG exam. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 
    10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples).
    In order to make them all have the same size (4096 samples) we fill them with zeros
    on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648
    samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved
    in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should
    be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header).
    The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files.
    The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2
     agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a 
     third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in 
     "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way 
     it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas (2023). ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure [Dataset]. http://doi.org/10.15485/1734841

ESS-DIVE Reporting Format for Comma-separated Values (CSV) File Structure

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
May 4, 2023
Dataset provided by
ESS-DIVE
Authors
Terri Velliquette; Jessica Welch; Michael Crow; Ranjeet Devarakonda; Susan Heinz; Robert Crystal-Ornelas
Time period covered
Jan 1, 2020 - Sep 30, 2021
Description

The ESS-DIVE reporting format for Comma-separated Values (CSV) file structure is based on a combination of existing guidelines and recommendations including some found within the Earth Science Community with valuable input from the Environmental Systems Science (ESS) Community. The CSV reporting format is designed to promote interoperability and machine-readability of CSV data files while also facilitating the collection of some file-level metadata content. Tabular data in the form of rows and columns should be archived in its simplest form, and we recommend submitting these tabular data following the ESS-DIVE reporting format for generic comma-separated values (CSV) text format files. In general, the CSV file format is more likely accessible by future systems when compared to a proprietary format and CSV files are preferred because this format is easier to exchange between different programs increasing the interoperability of a data file. By defining the reporting format and providing guidelines for how to structure CSV files and some field content within, this can increase the machine-readability of the data file for extracting, compiling, and comparing the data across files and systems. Data package files are in .csv, .png, and .md. Open the .csv with e.g. Microsoft Excel, LibreOffice, or Google Sheets. Open the .md files by downloading and using a text editor (e.g., notepad or TextEdit). Open the .png in e.g. a web browser, photo viewer/editor, or Google Drive.

Search
Clear search
Close search
Google apps
Main menu