43 datasets found

m
Software code quality and source code metrics dataset
data.mendeley.com
narcis.nl
Updated Feb 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2
Explore at:
Unique identifier
https://doi.org/10.17632/77p6rzb73n.2
Dataset updated
Feb 17, 2021
Authors
Sayed Mohsin Reza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.
d
Performance Metrics for Workforce Development Programs
catalog.data.gov
data.cityofnewyork.us
+1more
Updated Sep 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2023). Performance Metrics for Workforce Development Programs [Dataset]. https://catalog.data.gov/dataset/performance-metrics-for-workforce-development-programs
Explore at:
Dataset updated
Sep 2, 2023
Dataset provided by
data.cityofnewyork.us
Description
The report contains thirteen (13) performance metrics for City's workforce development programs. Each metric can be breakdown by three demographic types (gender, race/ethnicity, and age group) and the program target population (e.g., youth and young adults, NYCHA communities) as well. This report is a key output of an integrated data system that collects, integrates, and generates disaggregated data by Mayor's Office for Economic Opportunity (NYC Opportunity). Currently, the report is generated by the integrated database incorporating data from 18 workforce development programs managed by 5 City agencies. There has been no single "workforce development system" in the City of New York. Instead, many discrete public agencies directly manage or fund local partners to deliver a range of different services, sometimes tailored to specific populations. As a result, program data have historically been fragmented as well, making it challenging to develop insights based on a comprehensive picture. To overcome it, NYC Opportunity collects data from 5 City agencies and builds the integrated database, and it begins to build a complete picture of how participants move through the system onto a career pathway. Each row represents a count of unique individuals for a specific performance metric, program target population, a specific demographic group, and a specific period. For example, if the Metric Value is 2000 with Clients Served (Metric Name), NYCHA Communities (Program Target Population), Asian (Subgroup), and 2019 (Period), you can say that "In 2019, 2,000 Asian individuals participated programs targeting NYCHA communities. Please refer to the Workforce Data Portal for further data guidance (https://workforcedata.nyc.gov/en/data-guidance), and interactive visualizations for this report (https://workforcedata.nyc.gov/en/common-metrics).
Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx
figshare.com
xlsx
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agata Piekniewska; Laurel Haak; Darla Henderson; Katherine McNeill; Anita Bandrowski; Yvette Seger (2023). Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.22720399.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22720399.v1
Dataset updated
Apr 28, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Agata Piekniewska; Laurel Haak; Darla Henderson; Katherine McNeill; Anita Bandrowski; Yvette Seger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data were generated for an investigation of research data repository (RDR) mentions in biuomedical research articles.

Supplementary Table 1 is a discrete subset of SciCrunch RDRs used to study RDR mentions in biomedical literature. We generated this list by starting with the top 1000 entries in the SciCrunch database, measured by citations, removed entries for organizations (such as universities without a corresponding RDR) or non-relevant tools (such as reference managers), updated links, and consolidated duplicates resulting from RDR mergers and name variations. The resulting list of 737 RDRs is shown in with as a base based on a source list of RDRs in the SciCrunch database. The file includes the Research Resource Identifier (RRID), the RDR name, and a link to the RDR record in the SciCrunch database.

Supplementary Table 2 shows the RDRs, associated journals, and article-mention pairs (records) with text snippets extracted from mined Methods text in 2020 PubMed articles. The dataset has 4 components. The first shows the list of repositories with RDR mentions, and includes the Research Resource Identifier (RRID), the RDR name, the number of articles that mention the RDR, and a link to the record in the SciCrunch database. The second shows the list of journals in the study set with at least 1 RDR mention, andincludes the Journal ID, nam, ESSN/ISSN, the total count of publications in 2020, the number of articles that had text available to mine, the number of article-mention pairs (records), number of articles with RDR mentions, the number of unique RDRs mentioned, % of articles with minable text. The third shows the top 200 journals by RDR mention, normalized by the proportion of articles with available text to mine, with the same metadata as the second table. The fourth shows text snippets for each RDR mention, and includes the RRID, RDR name, PubMedID (PMID), DOI, article publication date, journal name, journal ID, ESSN/ISSN, article title, and snippet.
T
Data from: dices
tensorflow.org
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). dices [Dataset]. https://www.tensorflow.org/datasets/catalog/dices
Explore at:
Dataset updated
Sep 3, 2024
Description
The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

Machine learning approaches are often trained and evaluated with datasets that require a clear separation between positive and negative examples. This approach overly simplifies the natural subjectivity present in many tasks and content items. It also obscures the inherent diversity in human perceptions and opinions. Often tasks that attempt to preserve the variance in content and diversity in humans are quite expensive and laborious. To fill in this gap and facilitate more in-depth model performance analyses we propose the DICES dataset - a unique dataset with diverse perspectives on safety of AI generated conversations. We focus on the task of safety evaluation of conversational AI systems. The DICES dataset contains detailed demographics information about each rater, extremely high replication of unique ratings per conversation to ensure statistical significance of further analyses and encodes rater votes as distributions across different demographics to allow for in-depth explorations of different rating aggregation strategies.

This dataset is well suited to observe and measure variance, ambiguity and diversity in the context of safety of conversational AI. The dataset is accompanied by a paper describing a set of metrics that show how rater diversity influences the safety perception of raters from different geographic regions, ethnicity groups, age groups and genders. The goal of the DICES dataset is to be used as a shared benchmark for safety evaluation of conversational AI systems.

CONTENT WARNING: This dataset contains adversarial examples of conversations that may be offensive.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dices', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Data from: A global high-resolution and bias-corrected dataset of CMIP6...

zenodo.org

bin

Updated Sep 20, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Qinqin Kong; Qinqin Kong; Matthew Huber; Matthew Huber (2024). A global high-resolution and bias-corrected dataset of CMIP6 projected heat stress metrics [Dataset]. http://doi.org/10.5281/zenodo.13799897

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13799897

Dataset updated

Sep 20, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Qinqin Kong; Qinqin Kong; Matthew Huber; Matthew Huber

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Motivation

Increasing heat stress due to climate change poses significant risks to human health and can lead to widespread social and economic consequences. Evaluating these impacts requires reliable datasets of heat stress projections.

Data Record

We present a global dataset projecting future dry-bulb, wet-bulb, and wet-bulb globe temperatures under 1-4°C global warming scenarios (at 0.5°C intervals) relative to the preindustrial era, using outputs from 16 CMIP6 global climate models (GCMs) (Table 1). All variables were retrieved from the historical and SSP585 scenarios which were selected to maximize the warming signal.

The dataset was bias-corrected against ERA5 reanalysis by incorporating the GCM-simulated climate change signal onto the ERA5 baseline (1950-1976) at a 3-hourly frequency. It therefore includes a 27-year sample for each GCM under each warming target.

The data is provided at a fine spatial resolution of 0.25° x 0.25° and a temporal resolution of 3 hours, and is stored in a self-describing NetCDF format. Filenames follow the pattern "VAR_bias_corrected_3hr_GCM_XC_yyyy.nc", where:

"VAR" represents the variable (Ta, Tw, WBGT for dry-bulb, wet-bulb, and wet-bulb globe temperature, respectively),
"GCM" denotes the CMIP6 GCM name,
"X" indicates the warming target compared to the preindustrial period,
"yyyy" represents the year index (0001-0027) of the 27-year sample

Table 1 CMIP6 GCMs used for generating the dataset for Ta, Tw and WBGT.

GCM	Realization	GCM grid spacing	Ta	Tw	WBGT
ACCESS-CM2	r1i1p1f1	1.25ox1.875o	✓	✓	✓
BCC-CSM2-MR	r1i1p1f1	1.1ox1.125o	✓	✓	✓
CanESM5	r1i1p2f1	2.8ox2.8o	✓	✓	✓
CMCC-CM2-SR5	r1i1p1f1	0.94ox1.25o	✓	✓	✓
CMCC-ESM2	r1i1p1f1	0.94ox1.25o	✓	✓	✓
CNRM-CM6-1	r1i1p1f2	1.4ox1.4o	✓	✓
EC-Earth3	r1i1p1f1	0.7ox0.7o	✓	✓	✓
GFDL-ESM4	r1i1p1f1	1.0ox1.25o	✓	✓	✓
HadGEM3-GC31-LL	r1i1p1f3	1.25ox1.875o	✓	✓	✓
HadGEM3-GC31-MM	r1i1p1f3	0.55ox0.83o	✓	✓	✓
KACE-1-0-G	r1i1p1f1	1.25ox1.875o	✓	✓	✓
KIOST-ESM	r1i1p1f1	1.9ox1.9o	✓	✓	✓
MIROC-ES2L	r1i1p1f2	2.8ox2.8o	✓	✓	✓
MIROC6	r1i1p1f1	1.4ox1.4o	✓	✓	✓
MPI-ESM1-2-HR	r1i1p1f1	0.93ox0.93o	✓	✓	✓
MPI-ESM1-2-LR	r1i1p1f1	1.85ox1.875o	✓	✓	✓

Data Access

An inventory of the dataset is available in this repository. The complete dataset, approximately 57 TB in size, is freely accessible via Purdue Fortress' long-term archive through Globus at Globus Link. After clicking the link, users may be prompted to log in with a Purdue institutional Globus account. You can switch to your institutional account, or log in via a personal Globus ID, Gmail, GitHub handle, or ORCID ID. Alternatively, the dataset can be accessed by searching for the universally unique identifier (UUID): "6538f53a-1ea7-4c13-a0cf-10478190b901" in Globus.

Dataset Validation

We validate the bias-correction method and show that it significantly enhances the GCMs' accuracy in reproducing both the annual average and the full range of quantiles for all metrics within an ERA5 reference climate state. This dataset is expected to support future research on projected changes in mean and extreme heat stress and the assessment of related health and socio-economic impacts.

For a detailed introduction to the dataset and its validation, please refer to our data descriptor currently under review at Scientific Data. We will update this information upon publication.

Kudos dataset
figshare.com
txt
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mojisola Helen Erdt; Htet Htet Aung; Ashley Sara Aw; Charlie Rapple; Yin-Leng Theng (2023). Kudos dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4272446.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4272446.v3
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Mojisola Helen Erdt; Htet Htet Aung; Ashley Sara Aw; Charlie Rapple; Yin-Leng Theng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Kudos dataset (extracted from Kudos in February 2016) is analysed in the research article with the title "Analysing researchers' outreach efforts and the association with publication metrics: A case study of Kudos". This research paper is a result of a joint research collaboration between Kudos and CHESS, Nanyang Technological University, Singapore. Kudos made funds available to CHESS to perform the study and also provided the dataset used for the analysis.In recent years, social media and scholarly collaboration networks have become increasingly accepted as effective tools for discovering and sharing research. Altmetrics are also becoming more common, as they reflect impact fast, are openly accessible and represent both academic and lay audiences, unlike traditional metrics such as citation counts. As a researcher, it still remains challenging to know whether the efforts to increase the visibility and outreach of your research on social media are associated with improved publication metrics.In this paper, we analyse the effectiveness of common online channels used for sharing publications using Kudos (https://www.growkudos.com, launched in May 2014), a web-based service that aims to help researchers increase the outreach of their publications, as a case study. We extracted a dataset from Kudos of 20,775 unique publications that had been claimed by authors, and for which actions had been taken to explain or share via Kudos. For 4,867 of these, full text download data from publishers was available. Our findings show that researchers are most likely to share their work on Facebook, but links shared on Twitter are most likely to be clicked on. A Mann-Whitney U test revealed that a treatment group (publications having actions in Kudos) had a significantly higher median average of 149 full text downloads (23.1% more) per publication as compared to a control group (having no actions in Kudos) with a median average of 121 full text downloads per publication. These findings suggest that performing actions on publications, such as sharing, explaining, or enriching, could help to increase the number of full text downloads of a publication.The DOIs of the publications in the dataset have been anonymised to protect the privacy of the users in Kudos. A readme text file is provided describing the data fields of the four datasets.All fields in the CSV file should be imported (e.g., into Excel) as text values.
d
Forest Health Protection Tree Species Metrics Basal Area
catalog.data.gov
agdatacommons.nal.usda.gov
+5more
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Forest Service (2025). Forest Health Protection Tree Species Metrics Basal Area [Dataset]. https://catalog.data.gov/dataset/forest-health-protection-tree-species-metrics-basal-area
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
U.S. Forest Service
Description
Basal Area (BA). 30 meter pixel resolution. Data represents forest conditions circa 2002.These data are a product of a multi-year effort by the FHTET (Forest Health Technology Enterprise Team) Remote Sensing Program to develop raster datasets of forest parameters for each of the tree species measured in the Forest Service’s Forest Inventory and Analysis (FIA) program. This dataset was created to support the 2013–2027 National Insect and Disease Risk Map (NIDRM) assessment. The statistical modeling approach used data-mining software and an archive of geospatial information to find the complex relationships between GIS layers and the presence/abundance of tree species as measured in over 300,000 FIA plot locations. Unique statistical models were developed from predictor layers consisting of climate, terrain, soils, and satellite imagery. Modeled basal area (BA) and stand density index (SDI) datasets for individual tree species were further post-processed to 1) match BA and SDI histograms of FIA data, 2) ensure that the sum of individual species BA and SDI on a pixel did not exceed separately modeled total for all species BA and SDI raster datasets, 3) derive additional tree parameters like quadratic mean diameter and trees per acre. With Landsat image collection dates ranging from 1985 to 2005, and a mean collection date for treed areas of 2002, and FIA plot data generally ranging from 1999 to 2005, the vintage of the base parameter datasets varies based on location, but can be roughly considered as 2002
Bassins versants dérivés du LiDAR avec mesures - Calvert Island
catalogue.hakai.org
html
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gordon Frazer; Ian Giesbrecht (2025). Bassins versants dérivés du LiDAR avec mesures - Calvert Island [Dataset]. http://doi.org/10.21966/1.15311
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.21966/1.15311
Dataset updated
Jan 29, 2025
Dataset provided by
Hakai Institutehttps://www.hakai.org/
Authors
Gordon Frazer; Ian Giesbrecht
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Calvert Island
Variables measured
Other
Description
Cet ensemble de données fournit les limites des bassins versants dérivés du LiDAR pour toutes les îles Calvert et Hecate, en Colombie-Britannique. Les bassins versants ont été délimités à partir d'un modèle altimétrique numérique de 3 m. Pour chaque polygone de bassin versant, le jeu de données comprend un identificateur unique et des statistiques sommaires simples pour décrire la topographie et l'hydrologie. Polygones de bassin versant Cet ensemble de données a été produit à partir des résultats de la modélisation hydrologique « traditionnelle » menée à l'aide du MNT de terre nue complet topographiquement complet basé sur lidar de 2012 + 2014 avec une zone tampon de 10 m autour de la côte pour s'assurer que tous les bassins versants modélisés atteignent l'océan. Les bassins versants ont été délimités à l'aide de points d'coulée créés à l'intersection des cours d'eau modélisés et du littoral. Après la délimitation du bassin versant, ceux-ci ont été coupés sur le rivage de l'île.
d
Data release: A large-scale database of modeled contemporary and future...
datasets.ai
data.usgs.gov
+3more
55
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Data release: A large-scale database of modeled contemporary and future water temperature data for 10,774 Michigan, Minnesota and Wisconsin Lakes [Dataset]. https://datasets.ai/datasets/data-release-a-large-scale-database-of-modeled-contemporary-and-future-water-temperature-d
Explore at:
55Available download formats
Dataset updated
Aug 7, 2024
Dataset authored and provided by
Department of the Interior
Area covered
Minnesota, Wisconsin, Michigan
Description
Climate change has been shown to influence lake temperatures globally. To better understand the diversity of lake responses to climate change and give managers tools to manage individual lakes, we modelled daily water temperature profiles for 10,774 lakes in Michigan, Minnesota and Wisconsin for contemporary (1979-2015) and future (2020-2040 and 2080-2100) time periods with climate models based on the Representative Concentration Pathway 8.5, the worst-case emission scenario. From simulated temperatures, we derived commonly used, ecologically relevant annual metrics of thermal conditions for each lake. We included all available supporting metadata including satellite and in-situ observations of water clarity, maximum observed lake depth, land-cover based estimates of surrounding canopy height and observed water temperature profiles (used here for validation). This unique dataset offers landscape-level insight into the future impact of climate change on lakes. This data set contains the following parameters: Thermal metrics, Spatial data, Temperature data, Model drivers, Model configuration, which are defined below.
c
ONC Regional Extension Centers (REC) Key Performance Indicators (KPIs) by...
s.cnmilf.com
healthdata.gov
+3more
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of the National Coordinator for Health Information Technology (2025). ONC Regional Extension Centers (REC) Key Performance Indicators (KPIs) by State [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/onc-regional-extension-centers-rec-key-performance-indicators-kpis-by-state
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Office of the National Coordinator for Health Information Technologyhttp://healthit.gov/
Description
The ONC Regional Extension Centers (REC) Program provides assistance to health care providers to adopt and meaningfully use certified EHR technology. The program, funded through the American Recovery and Reinvestment Act (ARRA) or The Recovery Act, provides grants to organizations, Regional Extension Centers, that assist providers directly in the organization's region. There are 62 unique RECs across the United States. This data set provides county-level health care professional participation in the REC Program. You can track metrics on the total primary care and non-primary care providers that signed up for REC assistance, gone live with an EHR, and demonstrated meaningful use of certified EHR technology. See ONC's REC data by state to track these metrics at the state level.
cross-dataset-drp-paper
zenodo.org
zip
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. Partin; A. Partin (2025). cross-dataset-drp-paper [Dataset]. http://doi.org/10.5281/zenodo.15258451
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15258451
Dataset updated
Apr 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
A. Partin; A. Partin
Description
This benchmark data was train and evaluate the models presented in the paper: A. Partin and P. Vasanthakumari et al. "Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis"

The benchmark data for Cross-Study Analysis (CSA) include four kinds of data, which are cell line response data, cell line multi-omics data, drug feature data, and data partitions. The figure below illustrates the curation, processing, and assembly of benchmark data, and a unified schema for data curation. Cell line response data were extracted from five sources, including the Cancer Cell Line Encyclopedia (CCLE), the Cancer Therapeutics Response Portal version 2 (CTRPv2), the Genomics of Drug Sensitivity in Cancer version 1 (GDSC1), the Genomics of Drug Sensitivity in Cancer version 2 (GDSC2), and the Genentech Cell Line Screening Initiative (GCSI). These are five large-scale cell line drug screening studies. We extracted their multi-dose viability data and used a unified dose response fitting pipeline to calculate multiple dose-independent response metrics as shown in the figure below, such as the area under the dose response curve (AUC) and the half-maximal inhibitory concentration (IC50). The multi-omics data of cell lines were extracted from the the Dependency Map (DepMap) portal of CCLE, including gene expressions, DNA mutations, DNA methylation, gene copy numbers, protein expressions measured by reverse phase protein array (RPPA), and miRNA expressions. Data preprocessing was performed, such as descritizing gene copy numbers and mapping between different gene identifier systems. Drug information was retrived from PubChem. Based on the drug SMILES (Simplified Molecular Input Line Entry Specification) strings, we calculated their molecular fingerprints and descriptors using the Mordred and RDKit Python packages. Data partition files were generated using the IMPROVE benchmark data preparation pipeline. They indicate, for each modeling analysis run, which samples should be included in the training, validation, and testing sets, for building and evaluating the drug response prediction (DRP) models. The Table below shows the numbers of cell lines, drugs, and experiments in each dataset. Across the five datasets, there are 785 unique cell lines and 749 unique drugs. All cell lines have gene expression, mutation, DNA methylation, and copy number data available. 760 of the cell lines have RPPA protein expressions, and 781 of them have miRNA expressions.

Further description is provided here: https://jdacs4c-improve.github.io/docs/content/app_drp_benchmark.html
Data from: Udacity Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jun 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Udacity Dataset [Dataset]. https://brightdata.com/products/datasets/udacity
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 2, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
We'll tailor a Udacity dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, enrollment numbers, review scores, and other pertinent metrics.

Leverage our Udacity datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.

Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.
v
Forest Health Protection Tree Species Metrics Stand Density Index
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
agdatacommons.nal.usda.gov
+5more
Updated Aug 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Forest Service (2025). Forest Health Protection Tree Species Metrics Stand Density Index [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/forest-health-protection-tree-species-metrics-stand-density-index
Explore at:
Dataset updated
Aug 5, 2025
Dataset provided by
U.S. Forest Service
Description
These data are a product of a multi-year effort by the FHTET (Forest Health Technology Enterprise Team) Remote Sensing Program to develop raster datasets of forest parameters for each of the tree species measured in the Forest Service’s Forest Inventory and Analysis (FIA) program. This dataset was created to support the 2013–2027 National Insect and Disease Risk Map (NIDRM) assessment. The statistical modeling approach used data-mining software and an archive of geospatial information to find the complex relationships between GIS layers and the presence/abundance of tree species as measured in over 300,000 FIA plot locations. Unique statistical models were developed from predictor layers consisting of climate, terrain, soils, and satellite imagery. Modeled basal area (BA) and stand density index (SDI) datasets for individual tree species were further post-processed to 1) match BA and SDI histograms of FIA data, 2) ensure that the sum of individual species BA and SDI on a pixel did not exceed separately modeled total for all species BA and SDI raster datasets, 3) derive additional tree parameters like quadratic mean diameter and trees per acre. With Landsat image collection dates ranging from 1985 to 2005, and a mean collection date for treed areas of 2002, and FIA plot data generally ranging from 1999 to 2005, the vintage of the base parameter datasets varies based on location, but can be roughly considered as 2002
CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...
zenodo.org
application/gzip, csv
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška (2025). CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting [Dataset]. http://doi.org/10.5281/zenodo.13382427
Explore at:
csv, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13382427
Dataset updated
Feb 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka; Pavel Šiška; Pavel Šiška
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.

Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.

Please cite the usage of our dataset as:

Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x

@Article{cesnettimeseries24,
author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
journal={Scientific Data},
year={2025},
month={Feb},
day={26},
volume={12},
number={1},
pages={338},
issn={2052-4463},
doi={10.1038/s41597-025-04603-x},
url={https://doi.org/10.1038/s41597-025-04603-x}
}

Time series

We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.

Datapoints created by the aggregation of IP flows contain the following time-series metrics:

Simple volumetric metrics: the number of IP flows, the number of packets, and the transmitted data size (i.e. number of bytes)

Unique volumetric metrics: the number of unique destination IP addresses, the number of unique destination Autonomous System Numbers (ASNs), and the number of unique destination transport layer ports. The aggregation of \textit{Unique volumetric metrics} is memory intensive since all unique values must be stored in an array. We used a server with 41 GB of RAM, which was enough for 10-minute aggregation on the ISP network.

Ratios metrics: the ratio of UDP/TCP packets, the ratio of UDP/TCP transmitted data size, the direction ratio of packets, and the direction ratio of transmitted data size

Average metrics: the average flow duration, and the average Time To Live (TTL)

Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.

Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.

Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.

Data Records

The file hierarchy is described below:

cesnet-timeseries24/

|- institution_subnets/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- institutions/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- ip_addresses_full/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- ip_addresses_sample/

| |- agg_10_minutes/

| |- agg_1_hour/

| |- agg_1_day/

| |- identifiers.csv

|- times/

| |- times_10_minutes.csv

| |- times_1_hour.csv

| |- times_1_day.csv

|- ids_relationship.csv
|- weekends_and_holidays.csv

The following list describes time series data fields in CSV files:

id_time: Unique identifier for each aggregation interval within the time series, used to segment the dataset into specific time periods for analysis.

n_flows: Total number of flows observed in the aggregation interval, indicating the volume of distinct sessions or connections for the IP address.

n_packets: Total number of packets transmitted during the aggregation interval, reflecting the packet-level traffic volume for the IP address.

n_bytes: Total number of bytes transmitted during the aggregation interval, representing the data volume for the IP address.

n_dest_ip: Number of unique destination IP addresses contacted by the IP address during the aggregation interval, showing the diversity of endpoints reached.

n_dest_asn: Number of unique destination Autonomous System Numbers (ASNs) contacted by the IP address during the aggregation interval, indicating the diversity of networks reached.

n_dest_port: Number of unique destination transport layer ports contacted by the IP address during the aggregation interval, representing the variety of services accessed.

tcp_udp_ratio_packets: Ratio of packets sent using TCP versus UDP by the IP address during the aggregation interval, providing insight into the transport protocol usage pattern. This metric belongs to the interval <0, 1> where 1 is when all packets are sent over TCP, and 0 is when all packets are sent over UDP.

tcp_udp_ratio_bytes: Ratio of bytes sent using TCP versus UDP by the IP address during the aggregation interval, highlighting the data volume distribution between protocols. This metric belongs to the interval <0, 1> with same rule as tcp_udp_ratio_packets.

dir_ratio_packets: Ratio of packet directions (inbound versus outbound) for the IP address during the aggregation interval, indicating the balance of traffic flow directions. This metric belongs to the interval <0, 1>, where 1 is when all packets are sent in the outgoing direction from the monitored IP address, and 0 is when all packets are sent in the incoming direction to the monitored IP address.

dir_ratio_bytes: Ratio of byte directions (inbound versus outbound) for the IP address during the aggregation interval, showing the data volume distribution in traffic flows. This metric belongs to the interval <0, 1> with the same rule as dir_ratio_packets.

avg_duration: Average duration of IP flows for the IP address during the aggregation interval, measuring the typical session length.

avg_ttl: Average Time To Live (TTL) of IP flows for the IP address during the aggregation interval, providing insight into the lifespan of packets.

Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:

sum_n_dest_ip: Sum of numbers of unique destination IP addresses.

avg_n_dest_ip: The average number of unique destination IP addresses.

std_n_dest_ip: Standard deviation of numbers of unique destination IP addresses.

sum_n_dest_asn: Sum of numbers of unique destination ASNs.

avg_n_dest_asn: The average number of unique destination ASNs.

std_n_dest_asn: Standard deviation of numbers of unique destination ASNs)

sum_n_dest_port: Sum of numbers of unique destination transport layer ports.

avg_n_dest_port: The average number of unique destination transport layer ports.

std_n_dest_port: Standard deviation of numbers of unique destination transport layer
d
Overtone Journalistic Content Bot/Human Indicator Dataset
datarade.ai
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Overtone (2023). Overtone Journalistic Content Bot/Human Indicator Dataset [Dataset]. https://datarade.ai/data-products/overtone-journalistic-content-bot-human-indicator-dataset-overtone
Explore at:
Dataset updated
Jan 23, 2023
Dataset authored and provided by
Overtone
Area covered
Brazil, Panama, Aruba, Belarus, Virgin Islands (U.S.), Falkland Islands (Malvinas), Australia, Russian Federation, Finland, Belize
Description
We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.

Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.

Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.

Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.

Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.

Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.

Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.
Technical Debt identification in Issue Trackers using Natural Language...
zenodo.org
bin
Updated Mar 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AAAAAA; BBBBB; CCCC; DDD; AAAAAA; BBBBB; CCCC; DDD (2023). Technical Debt identification in Issue Trackers using Natural Language Processing based on Transformers [Dataset]. http://doi.org/10.5281/zenodo.7221631
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7221631
Dataset updated
Mar 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
AAAAAA; BBBBB; CCCC; DDD; AAAAAA; BBBBB; CCCC; DDD
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In order to ensure transparency and reproducibility, we have made everything available publicly here, including the Code, Models, Datasets and more. All the files and their functionality used in this paper are explained clearly in the README.md file.

Background: Technical Debt (TD) needs to be controlled and tracked during software development. Current support, such as static analysis tools and even ML-based automatic tagging, is still ineffective, especially for context-dependent TD.

Aim: We study the usage of a large TD dataset in combination with cutting-edge Natural Language Processing (NLP) approaches to classify TD automatically in issue trackers, allowing the identification and tracking of informal TD conversations.

Method: We mine and analyse more than 160GB of textual data from GitHub projects, collecting over 55,600 TD issues and consolidating them into a large dataset (GTD-dataset). We then use our dataset to train state-of-the-art Transformer ML models, before performing a quantitative case study on three projects and evaluating the performance metrics during inference. Additionally, we study the adaptation of our model to classify context-dependent TD in an unseen project, by retraining the model including different percentages of the TD issues in the target project.

Results: (i) We provide GTD- dataset, the most comprehensive datasets of TD issues to date, including issues from 6,401 unique public repositories with various contexts;

(ii) By training state-of-the-art Transformers using the GTD-dataset, we achieve performance metrics that outperform previous approaches;

(iii) We show that our model can provide a relatively reliable tool to classify automatically TD in issue trackers, especially when adapted to unseen projects where the training includes a small portion of TD issues in the new project.

Conclusion: Our results indicate that we have taken significant steps towards closing the gap to practically and semi-automatically track TD issues in issue trackers.
Multi-relational real world manufacturing process data (FCUP, CMF)
figshare.com
txt
Updated Jul 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Balau (2020). Multi-relational real world manufacturing process data (FCUP, CMF) [Dataset]. http://doi.org/10.6084/m9.figshare.12681983.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12681983.v2
Dataset updated
Jul 21, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rui Balau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset comes from a real world manufacturing process of a Critical Manufacturing business partner. The manufacturing process is monitored via a IoT system. The dataset has been carefully anonymized due to privacy concerns, for more details on how this process was conducted see the accompanying thesis.In the case of the process that generates this data eight different readings are taken each time a particular tool is used. Eventually once a tool begins underperforming, it is retired and therefore does not again again appear in the dataset. We believe that this dataset may be used to estimate and predict tool longevity, as it likely presents time dependent covariates as such be of use to the research of multilevel survival analysis or predictive maintenance models.Name |Type |Description--------------------------|---------------------|---------OperationEndTime |Numerical |Difference in seconds from the first operation in the dataset.ToolId |Numerical Key |The tool used. It’s value is unique to each different tool in the dataset.Machine |Numeric |A categorical variable, representing the machine that used the tool. It’s value is unique to each different machine in the dataset.Process |Numeric |A categorical variable, representing the process that used the tool. It’s value is unique to each different process in the dataset.P1DataPoint1 |Numeric |A concrete value for a reading of parameter one.P1DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint1.P2DataPoint1 |Numeric |A concrete value for a reading of parameter two.P2DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint2.... |... |...P8DataPoint1 |Numeric |A concrete value for a reading of parameter eight.P8DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint8.
f
Example based metrics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pal, Arpan; Ukil, Arijit; Saha, Soumadeep; Khandelwal, Sundeep; Garain, Utpal (2023). Example based metrics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000966013
Explore at:
Dataset updated
Aug 10, 2023
Authors
Pal, Arpan; Ukil, Arijit; Saha, Soumadeep; Khandelwal, Sundeep; Garain, Utpal
Description
When judging the quality of a computational system for a pathological screening task, several factors seem to be important, like sensitivity, specificity, accuracy, etc. With machine learning based approaches showing promise in the multi-label paradigm, they are being widely adopted to diagnostics and digital therapeutics. Metrics are usually borrowed from machine learning literature, and the current consensus is to report results on a diverse set of metrics. It is infeasible to compare efficacy of computational systems which have been evaluated on different sets of metrics. From a diagnostic utility standpoint, the current metrics themselves are far from perfect, often biased by prevalence of negative samples or other statistical factors and importantly, they are designed to evaluate general purpose machine learning tasks. In this paper we outline the various parameters that are important in constructing a clinical metric aligned with diagnostic practice, and demonstrate their incompatibility with existing metrics. We propose a new metric, MedTric that takes into account several factors that are of clinical importance. MedTric is built from the ground up keeping in mind the unique context of computational diagnostics and the principle of risk minimization, penalizing missed diagnosis more harshly than over-diagnosis. MedTric is a unified metric for medical or pathological screening system evaluation. We compare this metric against other widely used metrics and demonstrate how our system outperforms them in key areas of medical relevance.
Data from: Coursera Dataset
brightdata.com
.json, .csv, .xlsx
Updated May 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Coursera Dataset [Dataset]. https://brightdata.com/products/datasets/coursera
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
May 7, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
We'll tailor a Coursera dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, demographic data of learners, enrollment numbers, review scores, and other pertinent metrics.

Leverage our Coursera datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.

Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.
f
Dataset of the 4 conditions experimentally recorded.
plos.figshare.com
zip
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Océane Dubois; Agnès Roby-Brami; Ross Parry; Nathanaël Jarrassé (2025). Dataset of the 4 conditions experimentally recorded. [Dataset]. http://doi.org/10.1371/journal.pone.0325792.s004
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325792.s004
Dataset updated
Aug 5, 2025
Dataset provided by
PLOS ONE
Authors
Océane Dubois; Agnès Roby-Brami; Ross Parry; Nathanaël Jarrassé
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zip file contains one folder for each condition. For each condition, the 3 repetitions of the movements for the 3 different targets’ height are presented in individual csv files. (ZIP)

Facebook

Twitter

Click to copy link

Link copied

Cite

Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2

Software code quality and source code metrics dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.17632/77p6rzb73n.2

Dataset updated

Feb 17, 2021

Authors

Sayed Mohsin Reza

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.

Clear search

Close search

Google apps

Main menu

Software code quality and source code metrics dataset

Performance Metrics for Workforce Development Programs

Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx

Data from: dices

The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

Data from: A global high-resolution and bias-corrected dataset of CMIP6...

Kudos dataset

Forest Health Protection Tree Species Metrics Basal Area

Bassins versants dérivés du LiDAR avec mesures - Calvert Island

Data release: A large-scale database of modeled contemporary and future...

ONC Regional Extension Centers (REC) Key Performance Indicators (KPIs) by...

cross-dataset-drp-paper

Data from: Udacity Dataset

Forest Health Protection Tree Species Metrics Stand Density Index

CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly...

CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection

Time series

Data Records

Overtone Journalistic Content Bot/Human Indicator Dataset

Technical Debt identification in Issue Trackers using Natural Language...

Multi-relational real world manufacturing process data (FCUP, CMF)

Example based metrics.

Data from: Coursera Dataset

Dataset of the 4 conditions experimentally recorded.

Software code quality and source code metrics dataset