Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.
Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.
This statistic shows the countries where American and European organizations face regulatory challenges involving cross-border data issues in 2019. During the survey, 24 percent of respondents mentioned they faced a challenge involving cross-border data issues in the United States.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Armed non-state conflict without the direct involvement of the state government is a common phenomenon. Violence between armed gangs, rebel groups or communal militias is an important source of instability and has gained increasing scholarly attention. In this article, we introduce a data collection on conflict issues and key actor characteristics in armed non-state conflicts that provides new opportunities for investigating the causes, dynamics and consequences of this form of organized violence. The data builds on and extends the UCDP Non-State Conflict dataset by introducing additional information on what the actors in the conflict are fighting over, alongside actor characteristics. It covers Africa 1989-2011. The dataset distinguishes between two main categories of issues; territory or authority, in addition to a residual category of other issues. Furthermore, we specify sub-issues within these categories, such as agricultural land/water as sub-issue for territory and religious issues for other issues. As actor characteristics, the dataset notes whether warring parties received military support by external actors and whether religion and the mode of livelihood were salient in the mobilization of the armed group. The article presents coding processes, key features of the dataset and point to avenues for new research based on these data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Jira is an issue tracking system that supports software companies (among other types of companies) with managing their projects, community, and processes. This dataset is a collection of public Jira repositories downloaded from the internet using the Jira API V2. We collected data from 16 pubic Jira repositories containing 1822 projects and 2.7 million issues. Included in this data are historical records of 32 million changes, 8 million comments, and 1 million issue links that connect the issues in complex ways. This artefact repository contains the data as a MongoDB dump, the scripts used to download the data, the scripts used to interpret the data, and qualitative work conducted to make the data more approachable.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset, containing 1_084_300 repositories, that 50_032 of them support IRTs.
For more details see the GitHub page of the dataset: https://github.com/kargaranamir/girt-data
The dataset is accepted for MSR 2023 conference, under the title of "GIRT-Data: Sampling GitHub Issue Report Templates" Search in Google Scholar.
The Cook County Commission on Women’s Issues hosts an annual public hearing to address issues faced by women and girls in Cook County. The primary purpose of the hearing is educational. The plan is to use the information gathered to develop a set of recommendations for action by the County Board and other interested parties. The Commission issues a Public Hearing Report based on information presented by speakers and research which offers both insight and recommendations for change, including recommendations for how County government may assist or participate in facilitating necessary change.
A comprehensive record of the status of issues identified since 1978, which involve public health and safety, the common defense and security, or the environment, and which could affect multiple entities under NRC jurisdiction.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A log of dataset alerts open, monitored or resolved on the open data portal. Alerts can include issues as well as deprecation or discontinuation notices.
https://www.icpsr.umich.edu/web/ICPSR/studies/7023/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/7023/terms
Of the 14 nations included in the original study, these data cover the following ten: Brazil, Cuba, Dominican Republic, India, Israel, Nigeria, Panama, United States, West Germany, and Yugoslavia. (The data for Egypt, Japan, the Philippines, and Poland are not available through ICPSR.) In India and Israel the interviews were conducted in two waves, with different samples. Besides ascertaining the usual personal information, the study employed a "Self-Anchoring Striving Scale," an open-ended scale asking the respondent to define hopes and fears for self and the nation, to determine the two extremes of a self-defined spectrum on each of several variables. After these subjective ratings were obtained, the respondents indicated their perceptions of where they and their nations stood on a hypothetical ladder at three different points in time. Demographic variables include the respondents' age, gender, marital status, and level of education. For more information on the samples, coding, and the means of measurement, see the related publication listed below.
The statistic shows the problems that organizations face when using big data technologies worldwide as of 2017. Around 53 percent of respondents stated that inadequate analytical know-how was a major problem that their organization faced when using big data technologies as of 2017.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the paper titled 'Issues and Their Causes in WebAssembly Applications: An Empirical Study.' The dataset is stored in a Microsoft Excel file, which comprises multiple worksheets. A brief description of each worksheet is provided below.
(1) The 'Selected Systems' worksheet contains information on the 12 chosen open-source WebAssembly applications, along with the URL for each application.
(2) The 'GitHub-Raw Data' worksheet contains information on the initially retrieved 6,667 issues, including the titles, links, and statuses of each individual issue discussion.
(3) The 'SOF-Raw Data' worksheet contains information on the initially retrieved 6,667 questions and answers, including the details of each question and answer, respective links, and associated tags.
(4) The 'GitHubData Random Selected' worksheet contains a list of issues randomly selected from the initial pool of 6,667 issues, as well as extracted data from the discussions associated with these randomly selected issues.
(5) The 'GitHub-(Issues, Causes)' worksheet contains the initial codes categorizing the types of issues and causes.
(6) The 'SOF (Issues, Causes)' worksheet contains information gleaned from a randomly selected subset of 354 Stack Overflow posts. This information includes the title and body of each question, the associated link, tags, as well as key points for types of issues and causes.
(7) The 'Combine (Git and SOF) Data' worksheet contains the compiled issues and causes extracted from both GitHub and Stack Overflow.
(8) The 'Issue Taxonomy' worksheet contains a comprehensive issue taxonomy, which is organized into 9 categories, 20 subcategories, and 120 specific types of issues.
(9) The 'Cause Taxonomy' worksheet contains a comprehensive cause taxonomy, which is organized into 10 categories, 35 subcategories, and 278 specific types of causes.
Problems reported, comments and satisfaction surveys submitted by the general public through focused citizen engagement applications.
The Department of Housing Preservation and Development (HPD) records complaints that are made by the public for conditions which violate the New York City Housing Maintenance Code (HMC) or the New York State Multiple Dwelling Law (MDL).
This dataset was created by junaid wahid
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern technologies such as the Internet of Things (IoT) play a key role in Smart Manufacturing and Business Process Management (BPM). In particular, process mining benefits from enriched event logs that incorporate physical sensor data. This dataset presents an IoT-enriched XES event log recorded in a physical smart factory environment. It builds upon the previously published dataset “An IoT-Enriched Event Log for Process Mining in Smart Factories” (available on Zenodo) and follows the DataStream XES extension. In this modified version, three types of common Data Quality Issues (DQIs) - missing sensor values, missing sensors, and time shifts - have been artificially injected into the sensor data. These issues reflect realistic challenges in industrial IoT data processing and are valuable for developing and testing robust data cleaning and analysis methods.
By comparing the original (clean) dataset with this modified version, researchers can systematically evaluate DQI detection, handling, and solving techniques under controlled conditions. Further details are provided for each of three DQI types in the subfolders in a csv changelog.
The New Security Issues, State and Local Governments tables (1.45) are updated monthly. Data were previously published in the Supplement to the Federal Reserve Bulletin, which ceased publication in December 2008. Data sources have included: Mergent, beginning November 2011; Securities Data Company, from January 1990 to October 2011; and Investment Dealers Digest before then.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.
This dataset was created by DimaVinn
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Yuki S
Released under Apache 2.0
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.