Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.
The report contains thirteen (13) performance metrics for City's workforce development programs. Each metric can be breakdown by three demographic types (gender, race/ethnicity, and age group) and the program target population (e.g., youth and young adults, NYCHA communities) as well. This report is a key output of an integrated data system that collects, integrates, and generates disaggregated data by Mayor's Office for Economic Opportunity (NYC Opportunity). Currently, the report is generated by the integrated database incorporating data from 18 workforce development programs managed by 5 City agencies. There has been no single "workforce development system" in the City of New York. Instead, many discrete public agencies directly manage or fund local partners to deliver a range of different services, sometimes tailored to specific populations. As a result, program data have historically been fragmented as well, making it challenging to develop insights based on a comprehensive picture. To overcome it, NYC Opportunity collects data from 5 City agencies and builds the integrated database, and it begins to build a complete picture of how participants move through the system onto a career pathway. Each row represents a count of unique individuals for a specific performance metric, program target population, a specific demographic group, and a specific period. For example, if the Metric Value is 2000 with Clients Served (Metric Name), NYCHA Communities (Program Target Population), Asian (Subgroup), and 2019 (Period), you can say that "In 2019, 2,000 Asian individuals participated programs targeting NYCHA communities. Please refer to the Workforce Data Portal for further data guidance (https://workforcedata.nyc.gov/en/data-guidance), and interactive visualizations for this report (https://workforcedata.nyc.gov/en/common-metrics).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data were generated for an investigation of research data repository (RDR) mentions in biuomedical research articles.
Supplementary Table 1 is a discrete subset of SciCrunch RDRs used to study RDR mentions in biomedical literature. We generated this list by starting with the top 1000 entries in the SciCrunch database, measured by citations, removed entries for organizations (such as universities without a corresponding RDR) or non-relevant tools (such as reference managers), updated links, and consolidated duplicates resulting from RDR mergers and name variations. The resulting list of 737 RDRs is shown in with as a base based on a source list of RDRs in the SciCrunch database. The file includes the Research Resource Identifier (RRID), the RDR name, and a link to the RDR record in the SciCrunch database.
Supplementary Table 2 shows the RDRs, associated journals, and article-mention pairs (records) with text snippets extracted from mined Methods text in 2020 PubMed articles. The dataset has 4 components. The first shows the list of repositories with RDR mentions, and includes the Research Resource Identifier (RRID), the RDR name, the number of articles that mention the RDR, and a link to the record in the SciCrunch database. The second shows the list of journals in the study set with at least 1 RDR mention, andincludes the Journal ID, nam, ESSN/ISSN, the total count of publications in 2020, the number of articles that had text available to mine, the number of article-mention pairs (records), number of articles with RDR mentions, the number of unique RDRs mentioned, % of articles with minable text. The third shows the top 200 journals by RDR mention, normalized by the proportion of articles with available text to mine, with the same metadata as the second table. The fourth shows text snippets for each RDR mention, and includes the RRID, RDR name, PubMedID (PMID), DOI, article publication date, journal name, journal ID, ESSN/ISSN, article title, and snippet.
Machine learning approaches are often trained and evaluated with datasets that require a clear separation between positive and negative examples. This approach overly simplifies the natural subjectivity present in many tasks and content items. It also obscures the inherent diversity in human perceptions and opinions. Often tasks that attempt to preserve the variance in content and diversity in humans are quite expensive and laborious. To fill in this gap and facilitate more in-depth model performance analyses we propose the DICES dataset - a unique dataset with diverse perspectives on safety of AI generated conversations. We focus on the task of safety evaluation of conversational AI systems. The DICES dataset contains detailed demographics information about each rater, extremely high replication of unique ratings per conversation to ensure statistical significance of further analyses and encodes rater votes as distributions across different demographics to allow for in-depth explorations of different rating aggregation strategies.
This dataset is well suited to observe and measure variance, ambiguity and diversity in the context of safety of conversational AI. The dataset is accompanied by a paper describing a set of metrics that show how rater diversity influences the safety perception of raters from different geographic regions, ethnicity groups, age groups and genders. The goal of the DICES dataset is to be used as a shared benchmark for safety evaluation of conversational AI systems.
CONTENT WARNING: This dataset contains adversarial examples of conversations that may be offensive.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('dices', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation
Increasing heat stress due to climate change poses significant risks to human health and can lead to widespread social and economic consequences. Evaluating these impacts requires reliable datasets of heat stress projections.
Data Record
We present a global dataset projecting future dry-bulb, wet-bulb, and wet-bulb globe temperatures under 1-4°C global warming scenarios (at 0.5°C intervals) relative to the preindustrial era, using outputs from 16 CMIP6 global climate models (GCMs) (Table 1). All variables were retrieved from the historical and SSP585 scenarios which were selected to maximize the warming signal.
The dataset was bias-corrected against ERA5 reanalysis by incorporating the GCM-simulated climate change signal onto the ERA5 baseline (1950-1976) at a 3-hourly frequency. It therefore includes a 27-year sample for each GCM under each warming target.
The data is provided at a fine spatial resolution of 0.25° x 0.25° and a temporal resolution of 3 hours, and is stored in a self-describing NetCDF format. Filenames follow the pattern "VAR_bias_corrected_3hr_GCM_XC_yyyy.nc", where:
"VAR" represents the variable (Ta, Tw, WBGT for dry-bulb, wet-bulb, and wet-bulb globe temperature, respectively),
"GCM" denotes the CMIP6 GCM name,
"X" indicates the warming target compared to the preindustrial period,
"yyyy" represents the year index (0001-0027) of the 27-year sample
Table 1 CMIP6 GCMs used for generating the dataset for Ta, Tw and WBGT.
GCM |
Realization |
GCM grid spacing |
Ta |
Tw |
WBGT |
ACCESS-CM2 |
r1i1p1f1 |
1.25ox1.875o |
✓ |
✓ |
✓ |
BCC-CSM2-MR |
r1i1p1f1 |
1.1ox1.125o |
✓ |
✓ |
✓ |
CanESM5 |
r1i1p2f1 |
2.8ox2.8o |
✓ |
✓ |
✓ |
CMCC-CM2-SR5 |
r1i1p1f1 |
0.94ox1.25o |
✓ |
✓ |
✓ |
CMCC-ESM2 |
r1i1p1f1 |
0.94ox1.25o |
✓ |
✓ |
✓ |
CNRM-CM6-1 |
r1i1p1f2 |
1.4ox1.4o |
✓ |
✓ | |
EC-Earth3 |
r1i1p1f1 |
0.7ox0.7o |
✓ |
✓ |
✓ |
GFDL-ESM4 |
r1i1p1f1 |
1.0ox1.25o |
✓ |
✓ |
✓ |
HadGEM3-GC31-LL |
r1i1p1f3 |
1.25ox1.875o |
✓ |
✓ |
✓ |
HadGEM3-GC31-MM |
r1i1p1f3 |
0.55ox0.83o |
✓ |
✓ |
✓ |
KACE-1-0-G |
r1i1p1f1 |
1.25ox1.875o |
✓ |
✓ |
✓ |
KIOST-ESM |
r1i1p1f1 |
1.9ox1.9o |
✓ |
✓ |
✓ |
MIROC-ES2L |
r1i1p1f2 |
2.8ox2.8o |
✓ |
✓ |
✓ |
MIROC6 |
r1i1p1f1 |
1.4ox1.4o |
✓ |
✓ |
✓ |
MPI-ESM1-2-HR |
r1i1p1f1 |
0.93ox0.93o |
✓ |
✓ |
✓ |
MPI-ESM1-2-LR |
r1i1p1f1 |
1.85ox1.875o |
✓ |
✓ |
✓ |
Data Access
An inventory of the dataset is available in this repository. The complete dataset, approximately 57 TB in size, is freely accessible via Purdue Fortress' long-term archive through Globus at Globus Link. After clicking the link, users may be prompted to log in with a Purdue institutional Globus account. You can switch to your institutional account, or log in via a personal Globus ID, Gmail, GitHub handle, or ORCID ID. Alternatively, the dataset can be accessed by searching for the universally unique identifier (UUID): "6538f53a-1ea7-4c13-a0cf-10478190b901" in Globus.
Dataset Validation
We validate the bias-correction method and show that it significantly enhances the GCMs' accuracy in reproducing both the annual average and the full range of quantiles for all metrics within an ERA5 reference climate state. This dataset is expected to support future research on projected changes in mean and extreme heat stress and the assessment of related health and socio-economic impacts.
For a detailed introduction to the dataset and its validation, please refer to our data descriptor currently under review at Scientific Data. We will update this information upon publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Kudos dataset (extracted from Kudos in February 2016) is analysed in the research article with the title "Analysing researchers' outreach efforts and the association with publication metrics: A case study of Kudos". This research paper is a result of a joint research collaboration between Kudos and CHESS, Nanyang Technological University, Singapore. Kudos made funds available to CHESS to perform the study and also provided the dataset used for the analysis.In recent years, social media and scholarly collaboration networks have become increasingly accepted as effective tools for discovering and sharing research. Altmetrics are also becoming more common, as they reflect impact fast, are openly accessible and represent both academic and lay audiences, unlike traditional metrics such as citation counts. As a researcher, it still remains challenging to know whether the efforts to increase the visibility and outreach of your research on social media are associated with improved publication metrics.In this paper, we analyse the effectiveness of common online channels used for sharing publications using Kudos (https://www.growkudos.com, launched in May 2014), a web-based service that aims to help researchers increase the outreach of their publications, as a case study. We extracted a dataset from Kudos of 20,775 unique publications that had been claimed by authors, and for which actions had been taken to explain or share via Kudos. For 4,867 of these, full text download data from publishers was available. Our findings show that researchers are most likely to share their work on Facebook, but links shared on Twitter are most likely to be clicked on. A Mann-Whitney U test revealed that a treatment group (publications having actions in Kudos) had a significantly higher median average of 149 full text downloads (23.1% more) per publication as compared to a control group (having no actions in Kudos) with a median average of 121 full text downloads per publication. These findings suggest that performing actions on publications, such as sharing, explaining, or enriching, could help to increase the number of full text downloads of a publication.The DOIs of the publications in the dataset have been anonymised to protect the privacy of the users in Kudos. A readme text file is provided describing the data fields of the four datasets.All fields in the CSV file should be imported (e.g., into Excel) as text values.
Basal Area (BA). 30 meter pixel resolution. Data represents forest conditions circa 2002.These data are a product of a multi-year effort by the FHTET (Forest Health Technology Enterprise Team) Remote Sensing Program to develop raster datasets of forest parameters for each of the tree species measured in the Forest Service’s Forest Inventory and Analysis (FIA) program. This dataset was created to support the 2013–2027 National Insect and Disease Risk Map (NIDRM) assessment. The statistical modeling approach used data-mining software and an archive of geospatial information to find the complex relationships between GIS layers and the presence/abundance of tree species as measured in over 300,000 FIA plot locations. Unique statistical models were developed from predictor layers consisting of climate, terrain, soils, and satellite imagery. Modeled basal area (BA) and stand density index (SDI) datasets for individual tree species were further post-processed to 1) match BA and SDI histograms of FIA data, 2) ensure that the sum of individual species BA and SDI on a pixel did not exceed separately modeled total for all species BA and SDI raster datasets, 3) derive additional tree parameters like quadratic mean diameter and trees per acre. With Landsat image collection dates ranging from 1985 to 2005, and a mean collection date for treed areas of 2002, and FIA plot data generally ranging from 1999 to 2005, the vintage of the base parameter datasets varies based on location, but can be roughly considered as 2002
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cet ensemble de données fournit les limites des bassins versants dérivés du LiDAR pour toutes les îles Calvert et Hecate, en Colombie-Britannique. Les bassins versants ont été délimités à partir d'un modèle altimétrique numérique de 3 m. Pour chaque polygone de bassin versant, le jeu de données comprend un identificateur unique et des statistiques sommaires simples pour décrire la topographie et l'hydrologie. Polygones de bassin versant Cet ensemble de données a été produit à partir des résultats de la modélisation hydrologique « traditionnelle » menée à l'aide du MNT de terre nue complet topographiquement complet basé sur lidar de 2012 + 2014 avec une zone tampon de 10 m autour de la côte pour s'assurer que tous les bassins versants modélisés atteignent l'océan. Les bassins versants ont été délimités à l'aide de points d'coulée créés à l'intersection des cours d'eau modélisés et du littoral. Après la délimitation du bassin versant, ceux-ci ont été coupés sur le rivage de l'île.
Climate change has been shown to influence lake temperatures globally. To better understand the diversity of lake responses to climate change and give managers tools to manage individual lakes, we modelled daily water temperature profiles for 10,774 lakes in Michigan, Minnesota and Wisconsin for contemporary (1979-2015) and future (2020-2040 and 2080-2100) time periods with climate models based on the Representative Concentration Pathway 8.5, the worst-case emission scenario. From simulated temperatures, we derived commonly used, ecologically relevant annual metrics of thermal conditions for each lake. We included all available supporting metadata including satellite and in-situ observations of water clarity, maximum observed lake depth, land-cover based estimates of surrounding canopy height and observed water temperature profiles (used here for validation). This unique dataset offers landscape-level insight into the future impact of climate change on lakes. This data set contains the following parameters: Thermal metrics, Spatial data, Temperature data, Model drivers, Model configuration, which are defined below.
The ONC Regional Extension Centers (REC) Program provides assistance to health care providers to adopt and meaningfully use certified EHR technology. The program, funded through the American Recovery and Reinvestment Act (ARRA) or The Recovery Act, provides grants to organizations, Regional Extension Centers, that assist providers directly in the organization's region. There are 62 unique RECs across the United States. This data set provides county-level health care professional participation in the REC Program. You can track metrics on the total primary care and non-primary care providers that signed up for REC assistance, gone live with an EHR, and demonstrated meaningful use of certified EHR technology. See ONC's REC data by state to track these metrics at the state level.
This benchmark data was train and evaluate the models presented in the paper: A. Partin and P. Vasanthakumari et al. "Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis"
The benchmark data for Cross-Study Analysis (CSA) include four kinds of data, which are cell line response data, cell line multi-omics data, drug feature data, and data partitions. The figure below illustrates the curation, processing, and assembly of benchmark data, and a unified schema for data curation. Cell line response data were extracted from five sources, including the Cancer Cell Line Encyclopedia (CCLE), the Cancer Therapeutics Response Portal version 2 (CTRPv2), the Genomics of Drug Sensitivity in Cancer version 1 (GDSC1), the Genomics of Drug Sensitivity in Cancer version 2 (GDSC2), and the Genentech Cell Line Screening Initiative (GCSI). These are five large-scale cell line drug screening studies. We extracted their multi-dose viability data and used a unified dose response fitting pipeline to calculate multiple dose-independent response metrics as shown in the figure below, such as the area under the dose response curve (AUC) and the half-maximal inhibitory concentration (IC50). The multi-omics data of cell lines were extracted from the the Dependency Map (DepMap) portal of CCLE, including gene expressions, DNA mutations, DNA methylation, gene copy numbers, protein expressions measured by reverse phase protein array (RPPA), and miRNA expressions. Data preprocessing was performed, such as descritizing gene copy numbers and mapping between different gene identifier systems. Drug information was retrived from PubChem. Based on the drug SMILES (Simplified Molecular Input Line Entry Specification) strings, we calculated their molecular fingerprints and descriptors using the Mordred and RDKit Python packages. Data partition files were generated using the IMPROVE benchmark data preparation pipeline. They indicate, for each modeling analysis run, which samples should be included in the training, validation, and testing sets, for building and evaluating the drug response prediction (DRP) models. The Table below shows the numbers of cell lines, drugs, and experiments in each dataset. Across the five datasets, there are 785 unique cell lines and 749 unique drugs. All cell lines have gene expression, mutation, DNA methylation, and copy number data available. 760 of the cell lines have RPPA protein expressions, and 781 of them have miRNA expressions.
Further description is provided here: https://jdacs4c-improve.github.io/docs/content/app_drp_benchmark.html
https://brightdata.com/licensehttps://brightdata.com/license
We'll tailor a Udacity dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, enrollment numbers, review scores, and other pertinent metrics.
Leverage our Udacity datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.
Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.
These data are a product of a multi-year effort by the FHTET (Forest Health Technology Enterprise Team) Remote Sensing Program to develop raster datasets of forest parameters for each of the tree species measured in the Forest Service’s Forest Inventory and Analysis (FIA) program. This dataset was created to support the 2013–2027 National Insect and Disease Risk Map (NIDRM) assessment. The statistical modeling approach used data-mining software and an archive of geospatial information to find the complex relationships between GIS layers and the presence/abundance of tree species as measured in over 300,000 FIA plot locations. Unique statistical models were developed from predictor layers consisting of climate, terrain, soils, and satellite imagery. Modeled basal area (BA) and stand density index (SDI) datasets for individual tree species were further post-processed to 1) match BA and SDI histograms of FIA data, 2) ensure that the sum of individual species BA and SDI on a pixel did not exceed separately modeled total for all species BA and SDI raster datasets, 3) derive additional tree parameters like quadratic mean diameter and trees per acre. With Landsat image collection dates ranging from 1985 to 2005, and a mean collection date for treed areas of 2002, and FIA plot data generally ranging from 1999 to 2005, the vintage of the base parameter datasets varies based on location, but can be roughly considered as 2002
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.
Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.
Please cite the usage of our dataset as:
Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x
@Article{cesnettimeseries24,
author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel},
title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting},
journal={Scientific Data},
year={2025},
month={Feb},
day={26},
volume={12},
number={1},
pages={338},
issn={2052-4463},
doi={10.1038/s41597-025-04603-x},
url={https://doi.org/10.1038/s41597-025-04603-x}
}
We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.
Datapoints created by the aggregation of IP flows contain the following time-series metrics:
Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.
Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.
Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.
The file hierarchy is described below:
cesnet-timeseries24/
|- institution_subnets/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- institutions/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- ip_addresses_full/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- ip_addresses_sample/
| |- agg_10_minutes/
| |- agg_1_hour/
| |- agg_1_day/
| |- identifiers.csv
|- times/
| |- times_10_minutes.csv
| |- times_1_hour.csv
| |- times_1_day.csv
|- ids_relationship.csv
|- weekends_and_holidays.csv
The following list describes time series data fields in CSV files:
Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:
We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.
Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.
Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.
Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.
Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.
Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.
Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In order to ensure transparency and reproducibility, we have made everything available publicly here, including the Code, Models, Datasets and more. All the files and their functionality used in this paper are explained clearly in the README.md file.
Background: Technical Debt (TD) needs to be controlled and tracked during software development. Current support, such as static analysis tools and even ML-based automatic tagging, is still ineffective, especially for context-dependent TD.
Aim: We study the usage of a large TD dataset in combination with cutting-edge Natural Language Processing (NLP) approaches to classify TD automatically in issue trackers, allowing the identification and tracking of informal TD conversations.
Method: We mine and analyse more than 160GB of textual data from GitHub projects, collecting over 55,600 TD issues and consolidating them into a large dataset (GTD-dataset). We then use our dataset to train state-of-the-art Transformer ML models, before performing a quantitative case study on three projects and evaluating the performance metrics during inference. Additionally, we study the adaptation of our model to classify context-dependent TD in an unseen project, by retraining the model including different percentages of the TD issues in the target project.
Results: (i) We provide GTD- dataset, the most comprehensive datasets of TD issues to date, including issues from 6,401 unique public repositories with various contexts;
(ii) By training state-of-the-art Transformers using the GTD-dataset, we achieve performance metrics that outperform previous approaches;
(iii) We show that our model can provide a relatively reliable tool to classify automatically TD in issue trackers, especially when adapted to unseen projects where the training includes a small portion of TD issues in the new project.
Conclusion: Our results indicate that we have taken significant steps towards closing the gap to practically and semi-automatically track TD issues in issue trackers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comes from a real world manufacturing process of a Critical Manufacturing business partner. The manufacturing process is monitored via a IoT system. The dataset has been carefully anonymized due to privacy concerns, for more details on how this process was conducted see the accompanying thesis.In the case of the process that generates this data eight different readings are taken each time a particular tool is used. Eventually once a tool begins underperforming, it is retired and therefore does not again again appear in the dataset. We believe that this dataset may be used to estimate and predict tool longevity, as it likely presents time dependent covariates as such be of use to the research of multilevel survival analysis or predictive maintenance models.Name |Type |Description--------------------------|---------------------|---------OperationEndTime |Numerical |Difference in seconds from the first operation in the dataset.ToolId |Numerical Key |The tool used. It’s value is unique to each different tool in the dataset.Machine |Numeric |A categorical variable, representing the machine that used the tool. It’s value is unique to each different machine in the dataset.Process |Numeric |A categorical variable, representing the process that used the tool. It’s value is unique to each different process in the dataset.P1DataPoint1 |Numeric |A concrete value for a reading of parameter one.P1DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint1.P2DataPoint1 |Numeric |A concrete value for a reading of parameter two.P2DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint2.... |... |...P8DataPoint1 |Numeric |A concrete value for a reading of parameter eight.P8DataPoint2 |Numeric |A concrete value for an error metric associated with the process that generated the value present on P1DataPoint8.
When judging the quality of a computational system for a pathological screening task, several factors seem to be important, like sensitivity, specificity, accuracy, etc. With machine learning based approaches showing promise in the multi-label paradigm, they are being widely adopted to diagnostics and digital therapeutics. Metrics are usually borrowed from machine learning literature, and the current consensus is to report results on a diverse set of metrics. It is infeasible to compare efficacy of computational systems which have been evaluated on different sets of metrics. From a diagnostic utility standpoint, the current metrics themselves are far from perfect, often biased by prevalence of negative samples or other statistical factors and importantly, they are designed to evaluate general purpose machine learning tasks. In this paper we outline the various parameters that are important in constructing a clinical metric aligned with diagnostic practice, and demonstrate their incompatibility with existing metrics. We propose a new metric, MedTric that takes into account several factors that are of clinical importance. MedTric is built from the ground up keeping in mind the unique context of computational diagnostics and the principle of risk minimization, penalizing missed diagnosis more harshly than over-diagnosis. MedTric is a unified metric for medical or pathological screening system evaluation. We compare this metric against other widely used metrics and demonstrate how our system outperforms them in key areas of medical relevance.
https://brightdata.com/licensehttps://brightdata.com/license
We'll tailor a Coursera dataset to meet your unique needs, encompassing course titles, user engagement metrics, completion rates, demographic data of learners, enrollment numbers, review scores, and other pertinent metrics.
Leverage our Coursera datasets for diverse applications to bolster strategic planning and market analysis. Scrutinizing these datasets enables organizations to grasp learner preferences and online education trends, facilitating nuanced educational program development and learning initiatives. Customize your access to the entire dataset or specific subsets as per your business requisites.
Popular use cases involve optimizing educational content based on engagement insights, enhancing learning strategies through targeted learner segmentation, and identifying and forecasting trends to stay ahead in the online education landscape.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zip file contains one folder for each condition. For each condition, the 3 repetitions of the movements for the 3 different targets’ height are presented in individual csv files. (ZIP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.