43 datasets found
  1. m

    Software code quality and source code metrics dataset

    • data.mendeley.com
    • narcis.nl
    Updated Feb 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2
    Explore at:
    Dataset updated
    Feb 17, 2021
    Authors
    Sayed Mohsin Reza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.

  2. d

    Performance Metrics for Workforce Development Programs

    • catalog.data.gov
    • data.cityofnewyork.us
    • +1more
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2023). Performance Metrics for Workforce Development Programs [Dataset]. https://catalog.data.gov/dataset/performance-metrics-for-workforce-development-programs
    Explore at:
    Dataset updated
    Sep 2, 2023
    Dataset provided by
    data.cityofnewyork.us
    Description

    The report contains thirteen (13) performance metrics for City's workforce development programs. Each metric can be breakdown by three demographic types (gender, race/ethnicity, and age group) and the program target population (e.g., youth and young adults, NYCHA communities) as well. This report is a key output of an integrated data system that collects, integrates, and generates disaggregated data by Mayor's Office for Economic Opportunity (NYC Opportunity). Currently, the report is generated by the integrated database incorporating data from 18 workforce development programs managed by 5 City agencies. There has been no single "workforce development system" in the City of New York. Instead, many discrete public agencies directly manage or fund local partners to deliver a range of different services, sometimes tailored to specific populations. As a result, program data have historically been fragmented as well, making it challenging to develop insights based on a comprehensive picture. To overcome it, NYC Opportunity collects data from 5 City agencies and builds the integrated database, and it begins to build a complete picture of how participants move through the system onto a career pathway. Each row represents a count of unique individuals for a specific performance metric, program target population, a specific demographic group, and a specific period. For example, if the Metric Value is 2000 with Clients Served (Metric Name), NYCHA Communities (Program Target Population), Asian (Subgroup), and 2019 (Period), you can say that "In 2019, 2,000 Asian individuals participated programs targeting NYCHA communities. Please refer to the Workforce Data Portal for further data guidance (https://workforcedata.nyc.gov/en/data-guidance), and interactive visualizations for this report (https://workforcedata.nyc.gov/en/common-metrics).

  3. Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx

    • figshare.com
    xlsx
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agata Piekniewska; Laurel Haak; Darla Henderson; Katherine McNeill; Anita Bandrowski; Yvette Seger (2023). Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.22720399.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 28, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Agata Piekniewska; Laurel Haak; Darla Henderson; Katherine McNeill; Anita Bandrowski; Yvette Seger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data were generated for an investigation of research data repository (RDR) mentions in biuomedical research articles.

    Supplementary Table 1 is a discrete subset of SciCrunch RDRs used to study RDR mentions in biomedical literature. We generated this list by starting with the top 1000 entries in the SciCrunch database, measured by citations, removed entries for organizations (such as universities without a corresponding RDR) or non-relevant tools (such as reference managers), updated links, and consolidated duplicates resulting from RDR mergers and name variations. The resulting list of 737 RDRs is shown in with as a base based on a source list of RDRs in the SciCrunch database. The file includes the Research Resource Identifier (RRID), the RDR name, and a link to the RDR record in the SciCrunch database.

    Supplementary Table 2 shows the RDRs, associated journals, and article-mention pairs (records) with text snippets extracted from mined Methods text in 2020 PubMed articles. The dataset has 4 components. The first shows the list of repositories with RDR mentions, and includes the Research Resource Identifier (RRID), the RDR name, the number of articles that mention the RDR, and a link to the record in the SciCrunch database. The second shows the list of journals in the study set with at least 1 RDR mention, andincludes the Journal ID, nam, ESSN/ISSN, the total count of publications in 2020, the number of articles that had text available to mine, the number of article-mention pairs (records), number of articles with RDR mentions, the number of unique RDRs mentioned, % of articles with minable text. The third shows the top 200 journals by RDR mention, normalized by the proportion of articles with available text to mine, with the same metadata as the second table. The fourth shows text snippets for each RDR mention, and includes the RRID, RDR name, PubMedID (PMID), DOI, article publication date, journal name, journal ID, ESSN/ISSN, article title, and snippet.

  4. T

    Data from: dices

    • tensorflow.org
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). dices [Dataset]. https://www.tensorflow.org/datasets/catalog/dices
    Explore at:
    Dataset updated
    Sep 3, 2024
    Description

    The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

    Machine learning approaches are often trained and evaluated with datasets that require a clear separation between positive and negative examples. This approach overly simplifies the natural subjectivity present in many tasks and content items. It also obscures the inherent diversity in human perceptions and opinions. Often tasks that attempt to preserve the variance in content and diversity in humans are quite expensive and laborious. To fill in this gap and facilitate more in-depth model performance analyses we propose the DICES dataset - a unique dataset with diverse perspectives on safety of AI generated conversations. We focus on the task of safety evaluation of conversational AI systems. The DICES dataset contains detailed demographics information about each rater, extremely high replication of unique ratings per conversation to ensure statistical significance of further analyses and encodes rater votes as distributions across different demographics to allow for in-depth explorations of different rating aggregation strategies.

    This dataset is well suited to observe and measure variance, ambiguity and diversity in the context of safety of conversational AI. The dataset is accompanied by a paper describing a set of metrics that show how rater diversity influences the safety perception of raters from different geographic regions, ethnicity groups, age groups and genders. The goal of the DICES dataset is to be used as a shared benchmark for safety evaluation of conversational AI systems.

    CONTENT WARNING: This dataset contains adversarial examples of conversations that may be offensive.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('dices', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. Data from: A global high-resolution and bias-corrected dataset of CMIP6...

    • zenodo.org
    bin
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinqin Kong; Qinqin Kong; Matthew Huber; Matthew Huber (2024). A global high-resolution and bias-corrected dataset of CMIP6 projected heat stress metrics [Dataset]. http://doi.org/10.5281/zenodo.13799897
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qinqin Kong; Qinqin Kong; Matthew Huber; Matthew Huber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation

    Increasing heat stress due to climate change poses significant risks to human health and can lead to widespread social and economic consequences. Evaluating these impacts requires reliable datasets of heat stress projections.

    Data Record

    We present a global dataset projecting future dry-bulb, wet-bulb, and wet-bulb globe temperatures under 1-4°C global warming scenarios (at 0.5°C intervals) relative to the preindustrial era, using outputs from 16 CMIP6 global climate models (GCMs) (Table 1). All variables were retrieved from the historical and SSP585 scenarios which were selected to maximize the warming signal.

    The dataset was bias-corrected against ERA5 reanalysis by incorporating the GCM-simulated climate change signal onto the ERA5 baseline (1950-1976) at a 3-hourly frequency. It therefore includes a 27-year sample for each GCM under each warming target.

    The data is provided at a fine spatial resolution of 0.25° x 0.25° and a temporal resolution of 3 hours, and is stored in a self-describing NetCDF format. Filenames follow the pattern "VAR_bias_corrected_3hr_GCM_XC_yyyy.nc", where:

    • "VAR" represents the variable (Ta, Tw, WBGT for dry-bulb, wet-bulb, and wet-bulb globe temperature, respectively),

    • "GCM" denotes the CMIP6 GCM name,

    • "X" indicates the warming target compared to the preindustrial period,

    • "yyyy" represents the year index (0001-0027) of the 27-year sample

    Table 1 CMIP6 GCMs used for generating the dataset for Ta, Tw and WBGT.

    GCM

    Realization

    GCM grid spacing

    Ta

    Tw

    WBGT

    ACCESS-CM2

    r1i1p1f1

    1.25ox1.875o

    BCC-CSM2-MR

    r1i1p1f1

    1.1ox1.125o

    CanESM5

    r1i1p2f1

    2.8ox2.8o

    CMCC-CM2-SR5

    r1i1p1f1

    0.94ox1.25o

    CMCC-ESM2

    r1i1p1f1

    0.94ox1.25o

    CNRM-CM6-1

    r1i1p1f2

    1.4ox1.4o

    EC-Earth3

    r1i1p1f1

    0.7ox0.7o

    GFDL-ESM4

    r1i1p1f1

    1.0ox1.25o

    HadGEM3-GC31-LL

    r1i1p1f3

    1.25ox1.875o

    HadGEM3-GC31-MM

    r1i1p1f3

    0.55ox0.83o

    KACE-1-0-G

    r1i1p1f1

    1.25ox1.875o

    KIOST-ESM

    r1i1p1f1

    1.9ox1.9o

    MIROC-ES2L

    r1i1p1f2

    2.8ox2.8o

    MIROC6

    r1i1p1f1

    1.4ox1.4o

    MPI-ESM1-2-HR

    r1i1p1f1

    0.93ox0.93o

    MPI-ESM1-2-LR

    r1i1p1f1

    1.85ox1.875o

    Data Access

    An inventory of the dataset is available in this repository. The complete dataset, approximately 57 TB in size, is freely accessible via Purdue Fortress' long-term archive through Globus at Globus Link. After clicking the link, users may be prompted to log in with a Purdue institutional Globus account. You can switch to your institutional account, or log in via a personal Globus ID, Gmail, GitHub handle, or ORCID ID. Alternatively, the dataset can be accessed by searching for the universally unique identifier (UUID): "6538f53a-1ea7-4c13-a0cf-10478190b901" in Globus.

    Dataset Validation

    We validate the bias-correction method and show that it significantly enhances the GCMs' accuracy in reproducing both the annual average and the full range of quantiles for all metrics within an ERA5 reference climate state. This dataset is expected to support future research on projected changes in mean and extreme heat stress and the assessment of related health and socio-economic impacts.

    For a detailed introduction to the dataset and its validation, please refer to our data descriptor currently under review at Scientific Data. We will update this information upon publication.









  6. Kudos dataset

    • figshare.com
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojisola Helen Erdt; Htet Htet Aung; Ashley Sara Aw; Charlie Rapple; Yin-Leng Theng (2023). Kudos dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4272446.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mojisola Helen Erdt; Htet Htet Aung; Ashley Sara Aw; Charlie Rapple; Yin-Leng Theng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Kudos dataset (extracted from Kudos in February 2016) is analysed in the research article with the title "Analysing researchers' outreach efforts and the association with publication metrics: A case study of Kudos". This research paper is a result of a joint research collaboration between Kudos and CHESS, Nanyang Technological University, Singapore. Kudos made funds available to CHESS to perform the study and also provided the dataset used for the analysis.In recent years, social media and scholarly collaboration networks have become increasingly accepted as effective tools for discovering and sharing research. Altmetrics are also becoming more common, as they reflect impact fast, are openly accessible and represent both academic and lay audiences, unlike traditional metrics such as citation counts. As a researcher, it still remains challenging to know whether the efforts to increase the visibility and outreach of your research on social media are associated with improved publication metrics.In this paper, we analyse the effectiveness of common online channels used for sharing publications using Kudos (https://www.growkudos.com, launched in May 2014), a web-based service that aims to help researchers increase the outreach of their publications, as a case study. We extracted a dataset from Kudos of 20,775 unique publications that had been claimed by authors, and for which actions had been taken to explain or share via Kudos. For 4,867 of these, full text download data from publishers was available. Our findings show that researchers are most likely to share their work on Facebook, but links shared on Twitter are most likely to be clicked on. A Mann-Whitney U test revealed that a treatment group (publications having actions in Kudos) had a significantly higher median average of 149 full text downloads (23.1% more) per publication as compared to a control group (having no actions in Kudos) with a median average of 121 full text downloads per publication. These findings suggest that performing actions on publications, such as sharing, explaining, or enriching, could help to increase the number of full text downloads of a publication.The DOIs of the publications in the dataset have been anonymised to protect the privacy of the users in Kudos. A readme text file is provided describing the data fields of the four datasets.All fields in the CSV file should be imported (e.g., into Excel) as text values.

  7. d

    Forest Health Protection Tree Species Metrics Basal Area

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +5more
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2025). Forest Health Protection Tree Species Metrics Basal Area [Dataset]. https://catalog.data.gov/dataset/forest-health-protection-tree-species-metrics-basal-area
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    U.S. Forest Service
    Description

    Basal Area (BA). 30 meter pixel resolution. Data represents forest conditions circa 2002.These data are a product of a multi-year effort by the FHTET (Forest Health Technology Enterprise Team) Remote Sensing Program to develop raster datasets of forest parameters for each of the tree species measured in the Forest Service’s Forest Inventory and Analysis (FIA) program. This dataset was created to support the 2013–2027 National Insect and Disease Risk Map (NIDRM) assessment. The statistical modeling approach used data-mining software and an archive of geospatial information to find the complex relationships between GIS layers and the presence/abundance of tree species as measured in over 300,000 FIA plot locations. Unique statistical models were developed from predictor layers consisting of climate, terrain, soils, and satellite imagery. Modeled basal area (BA) and stand density index (SDI) datasets for individual tree species were further post-processed to 1) match BA and SDI histograms of FIA data, 2) ensure that the sum of individual species BA and SDI on a pixel did not exceed separately modeled total for all species BA and SDI raster datasets, 3) derive additional tree parameters like quadratic mean diameter and trees per acre. With Landsat image collection dates ranging from 1985 to 2005, and a mean collection date for treed areas of 2002, and FIA plot data generally ranging from 1999 to 2005, the vintage of the base parameter datasets varies based on location, but can be roughly considered as 2002

  8. Bassins versants dérivés du LiDAR avec mesures - Calvert Island

    • catalogue.hakai.org
    html
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gordon Frazer; Ian Giesbrecht (2025). Bassins versants dérivés du LiDAR avec mesures - Calvert Island [Dataset]. http://doi.org/10.21966/1.15311
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset provided by
    Hakai Institutehttps://www.hakai.org/
    Authors
    Gordon Frazer; Ian Giesbrecht
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Calvert Island
    Variables measured
    Other
    Description

    Cet ensemble de données fournit les limites des bassins versants dérivés du LiDAR pour toutes les îles Calvert et Hecate, en Colombie-Britannique. Les bassins versants ont été délimités à partir d'un modèle altimétrique numérique de 3 m. Pour chaque polygone de bassin versant, le jeu de données comprend un identificateur unique et des statistiques sommaires simples pour décrire la topographie et l'hydrologie. Polygones de bassin versant Cet ensemble de données a été produit à partir des résultats de la modélisation hydrologique « traditionnelle » menée à l'aide du MNT de terre nue complet topographiquement complet basé sur lidar de 2012 + 2014 avec une zone tampon de 10 m autour de la côte pour s'assurer que tous les bassins versants modélisés atteignent l'océan. Les bassins versants ont été délimités à l'aide de points d'coulée créés à l'intersection des cours d'eau modélisés et du littoral. Après la délimitation du bassin versant, ceux-ci ont été coupés sur le rivage de l'île.

  9. d

    Data release: A large-scale database of modeled contemporary and future...

    • datasets.ai
    • data.usgs.gov
    • +3more
    55
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2024). Data release: A large-scale database of modeled contemporary and future water temperature data for 10,774 Michigan, Minnesota and Wisconsin Lakes [Dataset]. https://datasets.ai/datasets/data-release-a-large-scale-database-of-modeled-contemporary-and-future-water-temperature-d
    Explore at:
    55Available download formats
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    Department of the Interior
    Area covered
    Minnesota, Wisconsin, Michigan
    Description

    Climate change has been shown to influence lake temperatures globally. To better understand the diversity of lake responses to climate change and give managers tools to manage individual lakes, we modelled daily water temperature profiles for 10,774 lakes in Michigan, Minnesota and Wisconsin for contemporary (1979-2015) and future (2020-2040 and 2080-2100) time periods with climate models based on the Representative Concentration Pathway 8.5, the worst-case emission scenario. From simulated temperatures, we derived commonly used, ecologically relevant annual metrics of thermal conditions for each lake. We included all available supporting metadata including satellite and in-situ observations of water clarity, maximum observed lake depth, land-cover based estimates of surrounding canopy height and observed water temperature profiles (used here for validation). This unique dataset offers landscape-level insight into the future impact of climate change on lakes. This data set contains the following parameters: Thermal metrics, Spatial data, Temperature data, Model drivers, Model configuration, which are defined below.

  10. c

    ONC Regional Extension Centers (REC) Key Performance Indicators (KPIs) by...

    • s.cnmilf.com
    • healthdata.gov
    • +3more
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of the National Coordinator for Health Information Technology (2025). ONC Regional Extension Centers (REC) Key Performance Indicators (KPIs) by State [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/onc-regional-extension-centers-rec-key-performance-indicators-kpis-by-state
    Explore at:
    Dataset updated
    Jul 11, 2025
    Description

    The ONC Regional Extension Centers (REC) Program provides assistance to health care providers to adopt and meaningfully use certified EHR technology. The program, funded through the American Recovery and Reinvestment Act (ARRA) or The Recovery Act, provides grants to organizations, Regional Extension Centers, that assist providers directly in the organization's region. There are 62 unique RECs across the United States. This data set provides county-level health care professional participation in the REC Program. You can track metrics on the total primary care and non-primary care providers that signed up for REC assistance, gone live with an EHR, and demonstrated meaningful use of certified EHR technology. See ONC's REC data by state to track these metrics at the state level.

  11. cross-dataset-drp-paper

    • zenodo.org
    zip
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. Partin; A. Partin (2025). cross-dataset-drp-paper [Dataset]. http://doi.org/10.5281/zenodo.15258451
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    A. Partin; A. Partin
    Description

    This benchmark data was train and evaluate the models presented in the paper: A. Partin and P. Vasanthakumari et al. "Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis"

    The benchmark data for Cross-Study Analysis (CSA) include four kinds of data, which are cell line response data, cell line multi-omics data, drug feature data, and data partitions. The figure below illustrates the curation, processing, and assembly of benchmark data, and a unified schema for data curation. Cell line response data were extracted from five sources, including the Cancer Cell Line Encyclopedia (CCLE), the Cancer Therapeutics Response Portal version 2 (CTRPv2), the Genomics of Drug Sensitivity in Cancer version 1 (GDSC1), the Genomics of Drug Sensitivity in Cancer version 2 (GDSC2), and the Genentech Cell Line Screening Initiative (GCSI). These are five large-scale cell line drug screening studies. We extracted their multi-dose viability data and used a unified dose response fitting pipeline to calculate multiple dose-independent response metrics as shown in the figure below, such as the area under the dose response curve (AUC) and the half-maximal inhibitory concentration (IC50). The multi-omics data of cell lines were extracted from the the Dependency Map (DepMap) portal of CCLE, including gene expressions, DNA mutations, DNA methylation, gene copy numbers, protein expressions measured by reverse phase protein array (RPPA), and miRNA expressions. Data preprocessing was performed, such as descritizing gene copy numbers and mapping between different gene identifier systems. Drug information was retrived from PubChem. Based on the drug SMILES (Simplified Molecular Input Line Entry Specification) strings, we calculated their molecular fingerprints and descriptors using the Mordred and RDKit Python packages. Data partition files were generated using the IMPROVE benchmark data preparation pipeline. They indicate, for each modeling analysis run, which samples should be included in the training, validation, and testing sets, for building and evaluating the drug response prediction (DRP) models. The Table below shows the numbers of cell lines, drugs, and experiments in each dataset. Across the five datasets, there are 785 unique cell lines and 749 unique drugs. All cell lines have gene expression, mutation, DNA methylation, and copy number data available. 760 of the cell lines have RPPA protein expressions, and 781 of them have miRNA expressions.

    Further description is provided here: https://jdacs4c-improve.github.io/docs/content/app_drp_benchmark.html

  12. Data from: Udacity Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Udacity Dataset [Dataset]. https://brightdata.com/products/datasets/udacity
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 2, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    We'll tailor a Udacity dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, enrollment numbers, review scores, and other pertinent metrics.

    Leverage our Udacity datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.

    Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.

  13. v

    Forest Health Protection Tree Species Metrics Stand Density Index

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • agdatacommons.nal.usda.gov
    • +5more
    Updated Aug 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2025). Forest Health Protection Tree Species Metrics Stand Density Index [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/forest-health-protection-tree-species-metrics-stand-density-index
    Explore at:
    Dataset updated
    Aug 5, 2025
    Dataset provided by
    U.S. Forest Service
    Description

    These data are a product of a multi-year effort by the FHTET (Forest Health Technology Enterprise Team) Remote Sensing Program to develop raster datasets of forest parameters for each of the tree species measured in the Forest Service’s Forest Inventory and Analysis (FIA) program. This dataset was created to support the 2013–2027 National Insect and Disease Risk Map (NIDRM) assessment. The statistical modeling approach used data-mining software and an archive of geospatial information to find the complex relationships between GIS layers and the presence/abundance of tree species as measured in over 300,000 FIA plot locations. Unique statistical models were developed from predictor layers consisting of climate, terrain, soils, and satellite imagery. Modeled basal area (BA) and stand density index (SDI) datasets for individual tree species were further post-processed to 1) match BA and SDI histograms of FIA data, 2) ensure that the sum of individual species BA and SDI on a pixel did not exceed separately modeled total for all species BA and SDI raster datasets, 3) derive additional tree parameters like quadratic mean diameter and trees per acre. With Landsat image collection dates ranging from 1985 to 2005, and a mean collection date for treed areas of 2002, and FIA plot data generally ranging from 1999 to 2005, the vintage of the base parameter datasets varies based on location, but can be roughly considered as 2002

  14. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...

    • zenodo.org
    application/gzip, csv
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška (2025). CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting [Dataset]. http://doi.org/10.5281/zenodo.13382427
    Explore at:
    csv, application/gzipAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

    The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.

    Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.

    Please cite the usage of our dataset as:

    Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x

    @Article{cesnettimeseries24,
    author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
    title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
    journal={Scientific Data},
    year={2025},
    month={Feb},
    day={26},
    volume={12},
    number={1},
    pages={338},
    issn={2052-4463},
    doi={10.1038/s41597-025-04603-x},
    url={https://doi.org/10.1038/s41597-025-04603-x}
    }

    Time series

    We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.

    Datapoints created by the aggregation of IP flows contain the following time-series metrics:

    • Simple volumetric metrics: the number of IP flows, the number of packets, and the transmitted data size (i.e. number of bytes)
    • Unique volumetric metrics: the number of unique destination IP addresses, the number of unique destination Autonomous System Numbers (ASNs), and the number of unique destination transport layer ports. The aggregation of \textit{Unique volumetric metrics} is memory intensive since all unique values must be stored in an array. We used a server with 41 GB of RAM, which was enough for 10-minute aggregation on the ISP network.
    • Ratios metrics: the ratio of UDP/TCP packets, the ratio of UDP/TCP transmitted data size, the direction ratio of packets, and the direction ratio of transmitted data size
    • Average metrics: the average flow duration, and the average Time To Live (TTL)

    Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.

    Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.

    Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.

    Data Records

    The file hierarchy is described below:

    cesnet-timeseries24/

    |- institution_subnets/

    | |- agg_10_minutes/

    | |- agg_1_hour/

    | |- agg_1_day/

    | |- identifiers.csv

    |- institutions/

    | |- agg_10_minutes/

    | |- agg_1_hour/

    | |- agg_1_day/

    | |- identifiers.csv

    |- ip_addresses_full/

    | |- agg_10_minutes/

    | |- agg_1_hour/

    | |- agg_1_day/

    | |- identifiers.csv

    |- ip_addresses_sample/

    | |- agg_10_minutes/

    | |- agg_1_hour/

    | |- agg_1_day/

    | |- identifiers.csv

    |- times/

    | |- times_10_minutes.csv

    | |- times_1_hour.csv

    | |- times_1_day.csv

    |- ids_relationship.csv
    |- weekends_and_holidays.csv

    The following list describes time series data fields in CSV files:

    • id_time: Unique identifier for each aggregation interval within the time series, used to segment the dataset into specific time periods for analysis.
    • n_flows: Total number of flows observed in the aggregation interval, indicating the volume of distinct sessions or connections for the IP address.
    • n_packets: Total number of packets transmitted during the aggregation interval, reflecting the packet-level traffic volume for the IP address.
    • n_bytes: Total number of bytes transmitted during the aggregation interval, representing the data volume for the IP address.
    • n_dest_ip: Number of unique destination IP addresses contacted by the IP address during the aggregation interval, showing the diversity of endpoints reached.
    • n_dest_asn: Number of unique destination Autonomous System Numbers (ASNs) contacted by the IP address during the aggregation interval, indicating the diversity of networks reached.
    • n_dest_port: Number of unique destination transport layer ports contacted by the IP address during the aggregation interval, representing the variety of services accessed.
    • tcp_udp_ratio_packets: Ratio of packets sent using TCP versus UDP by the IP address during the aggregation interval, providing insight into the transport protocol usage pattern. This metric belongs to the interval <0, 1> where 1 is when all packets are sent over TCP, and 0 is when all packets are sent over UDP.
    • tcp_udp_ratio_bytes: Ratio of bytes sent using TCP versus UDP by the IP address during the aggregation interval, highlighting the data volume distribution between protocols. This metric belongs to the interval <0, 1> with same rule as tcp_udp_ratio_packets.
    • dir_ratio_packets: Ratio of packet directions (inbound versus outbound) for the IP address during the aggregation interval, indicating the balance of traffic flow directions. This metric belongs to the interval <0, 1>, where 1 is when all packets are sent in the outgoing direction from the monitored IP address, and 0 is when all packets are sent in the incoming direction to the monitored IP address.
    • dir_ratio_bytes: Ratio of byte directions (inbound versus outbound) for the IP address during the aggregation interval, showing the data volume distribution in traffic flows. This metric belongs to the interval <0, 1> with the same rule as dir_ratio_packets.
    • avg_duration: Average duration of IP flows for the IP address during the aggregation interval, measuring the typical session length.
    • avg_ttl: Average Time To Live (TTL) of IP flows for the IP address during the aggregation interval, providing insight into the lifespan of packets.

    Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:

    • sum_n_dest_ip: Sum of numbers of unique destination IP addresses.
    • avg_n_dest_ip: The average number of unique destination IP addresses.
    • std_n_dest_ip: Standard deviation of numbers of unique destination IP addresses.
    • sum_n_dest_asn: Sum of numbers of unique destination ASNs.
    • avg_n_dest_asn: The average number of unique destination ASNs.
    • std_n_dest_asn: Standard deviation of numbers of unique destination ASNs)
    • sum_n_dest_port: Sum of numbers of unique destination transport layer ports.
    • avg_n_dest_port: The average number of unique destination transport layer ports.
    • std_n_dest_port: Standard deviation of numbers of unique destination transport layer

  15. d

    Overtone Journalistic Content Bot/Human Indicator Dataset

    • datarade.ai
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone
    Explore at:
    Dataset updated
    Jan 23, 2023
    Dataset authored and provided by
    Overtone
    Area covered
    Brazil, Panama, Aruba, Belarus, Virgin Islands (U.S.), Falkland Islands (Malvinas), Australia, Russian Federation, Finland, Belize
    Description

    We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

    Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

    Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

    Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

    Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

    Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

    Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.

  16. Technical Debt identification in Issue Trackers using Natural Language...

    • zenodo.org
    bin
    Updated Mar 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AAAAAA; BBBBB; CCCC; DDD; AAAAAA; BBBBB; CCCC; DDD (2023). Technical Debt identification in Issue Trackers using Natural Language Processing based on Transformers [Dataset]. http://doi.org/10.5281/zenodo.7221631
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    AAAAAA; BBBBB; CCCC; DDD; AAAAAA; BBBBB; CCCC; DDD
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In order to ensure transparency and reproducibility, we have made everything available publicly here, including the Code, Models, Datasets and more. All the files and their functionality used in this paper are explained clearly in the README.md file.

    Background: Technical Debt (TD) needs to be controlled and tracked during software development. Current support, such as static analysis tools and even ML-based automatic tagging, is still ineffective, especially for context-dependent TD.

    Aim: We study the usage of a large TD dataset in combination with cutting-edge Natural Language Processing (NLP) approaches to classify TD automatically in issue trackers, allowing the identification and tracking of informal TD conversations.

    Method: We mine and analyse more than 160GB of textual data from GitHub projects, collecting over 55,600 TD issues and consolidating them into a large dataset (GTD-dataset). We then use our dataset to train state-of-the-art Transformer ML models, before performing a quantitative case study on three projects and evaluating the performance metrics during inference. Additionally, we study the adaptation of our model to classify context-dependent TD in an unseen project, by retraining the model including different percentages of the TD issues in the target project.

    Results: (i) We provide GTD- dataset, the most comprehensive datasets of TD issues to date, including issues from 6,401 unique public repositories with various contexts;

    (ii) By training state-of-the-art Transformers using the GTD-dataset, we achieve performance metrics that outperform previous approaches;

    (iii) We show that our model can provide a relatively reliable tool to classify automatically TD in issue trackers, especially when adapted to unseen projects where the training includes a small portion of TD issues in the new project.

    Conclusion: Our results indicate that we have taken significant steps towards closing the gap to practically and semi-automatically track TD issues in issue trackers.

  17. Multi-relational real world manufacturing process data (FCUP, CMF)

    • figshare.com
    txt
    Updated Jul 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Balau (2020). Multi-relational real world manufacturing process data (FCUP, CMF) [Dataset]. http://doi.org/10.6084/m9.figshare.12681983.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 21, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rui Balau
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comes from a real world manufacturing process of a Critical Manufacturing business partner. The manufacturing process is monitored via a IoT system. The dataset has been carefully anonymized due to privacy concerns, for more details on how this process was conducted see the accompanying thesis.In the case of the process that generates this data eight different readings are taken each time a particular tool is used. Eventually once a tool begins underperforming, it is retired and therefore does not again again appear in the dataset. We believe that this dataset may be used to estimate and predict tool longevity, as it likely presents time dependent covariates as such be of use to the research of multilevel survival analysis or predictive maintenance models.Name |Type |Description--------------------------|---------------------|---------OperationEndTime |Numerical |Difference in seconds from the first operation in the dataset.ToolId |Numerical Key |The tool used. It’s value is unique to each different tool in the dataset.Machine |Numeric |A categorical variable, representing the machine that used the tool. It’s value is unique to each different machine in the dataset.Process |Numeric |A categorical variable, representing the process that used the tool. It’s value is unique to each different process in the dataset.P1DataPoint1 |Numeric |A concrete value for a reading of parameter one.P1DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint1.P2DataPoint1 |Numeric |A concrete value for a reading of parameter two.P2DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint2.... |... |...P8DataPoint1 |Numeric |A concrete value for a reading of parameter eight.P8DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint8.

  18. f

    Example based metrics.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Aug 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pal, Arpan; Ukil, Arijit; Saha, Soumadeep; Khandelwal, Sundeep; Garain, Utpal (2023). Example based metrics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000966013
    Explore at:
    Dataset updated
    Aug 10, 2023
    Authors
    Pal, Arpan; Ukil, Arijit; Saha, Soumadeep; Khandelwal, Sundeep; Garain, Utpal
    Description

    When judging the quality of a computational system for a pathological screening task, several factors seem to be important, like sensitivity, specificity, accuracy, etc. With machine learning based approaches showing promise in the multi-label paradigm, they are being widely adopted to diagnostics and digital therapeutics. Metrics are usually borrowed from machine learning literature, and the current consensus is to report results on a diverse set of metrics. It is infeasible to compare efficacy of computational systems which have been evaluated on different sets of metrics. From a diagnostic utility standpoint, the current metrics themselves are far from perfect, often biased by prevalence of negative samples or other statistical factors and importantly, they are designed to evaluate general purpose machine learning tasks. In this paper we outline the various parameters that are important in constructing a clinical metric aligned with diagnostic practice, and demonstrate their incompatibility with existing metrics. We propose a new metric, MedTric that takes into account several factors that are of clinical importance. MedTric is built from the ground up keeping in mind the unique context of computational diagnostics and the principle of risk minimization, penalizing missed diagnosis more harshly than over-diagnosis. MedTric is a unified metric for medical or pathological screening system evaluation. We compare this metric against other widely used metrics and demonstrate how our system outperforms them in key areas of medical relevance.

  19. Data from: Coursera Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated May 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Coursera Dataset [Dataset]. https://brightdata.com/products/datasets/coursera
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    May 7, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    We'll tailor a Coursera dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, demographic data of learners, enrollment numbers, review scores, and other pertinent metrics.

    Leverage our Coursera datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.

    Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.

  20. f

    Dataset of the 4 conditions experimentally recorded.

    • plos.figshare.com
    zip
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Océane Dubois; Agnès Roby-Brami; Ross Parry; Nathanaël Jarrassé (2025). Dataset of the 4 conditions experimentally recorded. [Dataset]. http://doi.org/10.1371/journal.pone.0325792.s004
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 5, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Océane Dubois; Agnès Roby-Brami; Ross Parry; Nathanaël Jarrassé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This zip file contains one folder for each condition. For each condition, the 3 repetitions of the movements for the 3 different targets’ height are presented in individual csv files. (ZIP)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2

Software code quality and source code metrics dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 17, 2021
Authors
Sayed Mohsin Reza
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.

Search
Clear search
Close search
Google apps
Main menu