100+ datasets found
  1. P

    Data from: Data Science Problems Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan, Data Science Problems Dataset [Dataset]. https://paperswithcode.com/dataset/data-science-problems
    Explore at:
    Authors
    Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan
    Description

    Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

  2. d

    Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://catalog.data.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.

  3. Countries where companies face challenges with international data issues...

    • statista.com
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Countries where companies face challenges with international data issues 2019 [Dataset]. https://www.statista.com/statistics/997950/cross-border-data-issues-country/
    Explore at:
    Dataset updated
    Jul 6, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 9, 2018 - Nov 5, 2018
    Area covered
    Worldwide
    Description

    This statistic shows the countries where American and European organizations face regulatory challenges involving cross-border data issues in 2019. During the survey, 24 percent of respondents mentioned they faced a challenge involving cross-border data issues in the United States.

  4. H

    Replication Data for: Issues and Actors in African Non-State Conflicts: A...

    • dataverse.harvard.edu
    Updated Sep 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina von Uexkull; Therese Pettersson (2018). Replication Data for: Issues and Actors in African Non-State Conflicts: A new Dataset [Dataset]. http://doi.org/10.7910/DVN/IEFDSE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Nina von Uexkull; Therese Pettersson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Armed non-state conflict without the direct involvement of the state government is a common phenomenon. Violence between armed gangs, rebel groups or communal militias is an important source of instability and has gained increasing scholarly attention. In this article, we introduce a data collection on conflict issues and key actor characteristics in armed non-state conflicts that provides new opportunities for investigating the causes, dynamics and consequences of this form of organized violence. The data builds on and extends the UCDP Non-State Conflict dataset by introducing additional information on what the actors in the conflict are fighting over, alongside actor characteristics. It covers Africa 1989-2011. The dataset distinguishes between two main categories of issues; territory or authority, in addition to a residual category of other issues. Furthermore, we specify sub-issues within these categories, such as agricultural land/water as sub-issue for territory and religious issues for other issues. As actor characteristics, the dataset notes whether warring parties received military support by external actors and whether religion and the mode of livelihood were salient in the mobilization of the armed group. The article presents coding processes, key features of the dataset and point to avenues for new research based on these data.

  5. The Public Jira Dataset

    • zenodo.org
    Updated May 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lloyd Montgomery; Lloyd Montgomery; Clara Lüders; Prof. Dr. Walid Maalej; Clara Lüders; Prof. Dr. Walid Maalej (2025). The Public Jira Dataset [Dataset]. http://doi.org/10.5281/zenodo.5882882
    Explore at:
    Dataset updated
    May 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lloyd Montgomery; Lloyd Montgomery; Clara Lüders; Prof. Dr. Walid Maalej; Clara Lüders; Prof. Dr. Walid Maalej
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Jira is an issue tracking system that supports software companies (among other types of companies) with managing their projects, community, and processes. This dataset is a collection of public Jira repositories downloaded from the internet using the Jira API V2. We collected data from 16 pubic Jira repositories containing 1822 projects and 2.7 million issues. Included in this data are historical records of 32 million changes, 8 million comments, and 1 million issue links that connect the issues in complex ways. This artefact repository contains the data as a MongoDB dump, the scripts used to download the data, the scripts used to interpret the data, and qualitative work conducted to make the data more approachable.

  6. Z

    Data from: GIRT-Data: Sampling GitHub Issue Report Templates

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir Hossein Kargaran (2023). GIRT-Data: Sampling GitHub Issue Report Templates [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7724792
    Explore at:
    Dataset updated
    Mar 13, 2023
    Dataset provided by
    Nafiseh Nikeghbal
    Hinrich Schütze
    Amir Hossein Kargaran
    Abbas Heydarnoori
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset, containing 1_084_300 repositories, that 50_032 of them support IRTs.

    For more details see the GitHub page of the dataset: https://github.com/kargaranamir/girt-data

    The dataset is accepted for MSR 2023 conference, under the title of "GIRT-Data: Sampling GitHub Issue Report Templates" Search in Google Scholar.

  7. d

    Commission on Women’s Issues Public Hearing Reports

    • catalog.data.gov
    • datacatalog.cookcountyil.gov
    • +1more
    Updated Nov 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datacatalog.cookcountyil.gov (2021). Commission on Women’s Issues Public Hearing Reports [Dataset]. https://catalog.data.gov/dataset/commission-on-womens-issues-public-hearing-reports
    Explore at:
    Dataset updated
    Nov 29, 2021
    Dataset provided by
    datacatalog.cookcountyil.gov
    Description

    The Cook County Commission on Women’s Issues hosts an annual public hearing to address issues faced by women and girls in Cook County. The primary purpose of the hearing is educational. The plan is to use the information gathered to develop a set of recommendations for action by the County Board and other interested parties. The Commission issues a Public Hearing Report based on information presented by speakers and research which offers both insight and recommendations for change, including recommendations for how County government may assist or participate in facilitating necessary change.

  8. Generic Issues

    • catalog.data.gov
    • data.wu.ac.at
    Updated Nov 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuclear Regulatory Commission (2024). Generic Issues [Dataset]. https://catalog.data.gov/dataset/generic-issues
    Explore at:
    Dataset updated
    Nov 18, 2024
    Dataset provided by
    Nuclear Regulatory Commissionhttp://www.nrc.gov/
    Description

    A comprehensive record of the status of issues identified since 1978, which involve public health and safety, the common defense and security, or the environment, and which could affect multiple entities under NRC jurisdiction.

  9. D

    Dataset Alerts - Open and Monitoring

    • datasf.org
    • data.sfgov.org
    • +1more
    application/rdfxml +5
    Updated May 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset Alerts - Open and Monitoring [Dataset]. https://datasf.org/opendata/
    Explore at:
    json, application/rssxml, csv, tsv, xml, application/rdfxmlAvailable download formats
    Dataset updated
    May 12, 2025
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    A log of dataset alerts open, monitored or resolved on the open data portal. Alerts can include issues as well as deprecation or discontinuation notices.

  10. Pattern of Human Concerns Data, 1957-1963

    • icpsr.umich.edu
    ascii, sas, spss
    Updated Jan 12, 2006
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cantril, Hadley (2006). Pattern of Human Concerns Data, 1957-1963 [Dataset]. http://doi.org/10.3886/ICPSR07023.v1
    Explore at:
    ascii, spss, sasAvailable download formats
    Dataset updated
    Jan 12, 2006
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Cantril, Hadley
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/7023/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/7023/terms

    Time period covered
    1957 - 1963
    Area covered
    Israel, Nigeria, India, United States, Germany, Yugoslavia, Cuba, Global, Panama, Brazil
    Description

    Of the 14 nations included in the original study, these data cover the following ten: Brazil, Cuba, Dominican Republic, India, Israel, Nigeria, Panama, United States, West Germany, and Yugoslavia. (The data for Egypt, Japan, the Philippines, and Poland are not available through ICPSR.) In India and Israel the interviews were conducted in two waves, with different samples. Besides ascertaining the usual personal information, the study employed a "Self-Anchoring Striving Scale," an open-ended scale asking the respondent to define hopes and fears for self and the nation, to determine the two extremes of a self-defined spectrum on each of several variables. After these subjective ratings were obtained, the respondents indicated their perceptions of where they and their nations stood on a hypothetical ladder at three different points in time. Demographic variables include the respondents' age, gender, marital status, and level of education. For more information on the samples, coding, and the means of measurement, see the related publication listed below.

  11. Top challenges for big data analytics implementation in companies worldwide...

    • statista.com
    Updated May 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Top challenges for big data analytics implementation in companies worldwide 2017 [Dataset]. https://www.statista.com/statistics/933143/worldwide-big-data-implementation-problems/
    Explore at:
    Dataset updated
    May 23, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2017
    Area covered
    Worldwide
    Description

    The statistic shows the problems that organizations face when using big data technologies worldwide as of 2017. Around 53 percent of respondents stated that inadequate analytical know-how was a major problem that their organization faced when using big data technologies as of 2017.

  12. Dataset for the Paper: Issues and Their Causes in WebAssembly Applications:...

    • zenodo.org
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen; Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen (2024). Dataset for the Paper: Issues and Their Causes in WebAssembly Applications: An Empirical Study [Dataset]. http://doi.org/10.5281/zenodo.10528609
    Explore at:
    Dataset updated
    Mar 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen; Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the paper titled 'Issues and Their Causes in WebAssembly Applications: An Empirical Study.' The dataset is stored in a Microsoft Excel file, which comprises multiple worksheets. A brief description of each worksheet is provided below.

    (1) The 'Selected Systems' worksheet contains information on the 12 chosen open-source WebAssembly applications, along with the URL for each application.

    (2) The 'GitHub-Raw Data' worksheet contains information on the initially retrieved 6,667 issues, including the titles, links, and statuses of each individual issue discussion.

    (3) The 'SOF-Raw Data' worksheet contains information on the initially retrieved 6,667 questions and answers, including the details of each question and answer, respective links, and associated tags.

    (4) The 'GitHubData Random Selected' worksheet contains a list of issues randomly selected from the initial pool of 6,667 issues, as well as extracted data from the discussions associated with these randomly selected issues.

    (5) The 'GitHub-(Issues, Causes)' worksheet contains the initial codes categorizing the types of issues and causes.

    (6) The 'SOF (Issues, Causes)' worksheet contains information gleaned from a randomly selected subset of 354 Stack Overflow posts. This information includes the title and body of each question, the associated link, tags, as well as key points for types of issues and causes.

    (7) The 'Combine (Git and SOF) Data' worksheet contains the compiled issues and causes extracted from both GitHub and Stack Overflow.

    (8) The 'Issue Taxonomy' worksheet contains a comprehensive issue taxonomy, which is organized into 9 categories, 20 subcategories, and 120 specific types of issues.

    (9) The 'Cause Taxonomy' worksheet contains a comprehensive cause taxonomy, which is organized into 10 categories, 35 subcategories, and 278 specific types of causes.

  13. a

    Citizen Problems (Open Data)

    • hub.arcgis.com
    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    Updated Oct 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ArcGIS Solutions Demonstration organization (2019). Citizen Problems (Open Data) [Dataset]. https://hub.arcgis.com/maps/2e94dc541827461e9b5827f44b89ec4f
    Explore at:
    Dataset updated
    Oct 4, 2019
    Dataset authored and provided by
    ArcGIS Solutions Demonstration organization
    Area covered
    Description

    Problems reported, comments and satisfaction surveys submitted by the general public through focused citizen engagement applications.

  14. Housing Maintenance Code Complaints and Problems

    • data.cityofnewyork.us
    • catalog.data.gov
    application/rdfxml +5
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Housing Preservation & Development (HPD) (2025). Housing Maintenance Code Complaints and Problems [Dataset]. https://data.cityofnewyork.us/Housing-Development/Housing-Maintenance-Code-Complaints-and-Problems/ygpa-z7cr
    Explore at:
    xml, tsv, csv, application/rssxml, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Jun 7, 2025
    Authors
    Department of Housing Preservation & Development (HPD)
    Description

    The Department of Housing Preservation and Development (HPD) records complaints that are made by the public for conditions which violate the New York City Housing Maintenance Code (HMC) or the New York State Multiple Dwelling Law (MDL).

  15. The Housing Data

    • kaggle.com
    Updated Apr 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    junaid wahid (2020). The Housing Data [Dataset]. https://www.kaggle.com/datasets/junaidwahid/the-housing-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    junaid wahid
    Description

    Dataset

    This dataset was created by junaid wahid

    Contents

  16. An IoT-Enriched Event Log for Smart Factories with Injected Data Quality...

    • zenodo.org
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joscha Grüger; Joscha Grüger; Alexander Schultheis; Alexander Schultheis; Lukas Malburg; Lukas Malburg; Yannis Bertrand; Yannis Bertrand (2025). An IoT-Enriched Event Log for Smart Factories with Injected Data Quality Issues [Dataset]. http://doi.org/10.5281/zenodo.15487019
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joscha Grüger; Joscha Grüger; Alexander Schultheis; Alexander Schultheis; Lukas Malburg; Lukas Malburg; Yannis Bertrand; Yannis Bertrand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern technologies such as the Internet of Things (IoT) play a key role in Smart Manufacturing and Business Process Management (BPM). In particular, process mining benefits from enriched event logs that incorporate physical sensor data. This dataset presents an IoT-enriched XES event log recorded in a physical smart factory environment. It builds upon the previously published dataset An IoT-Enriched Event Log for Process Mining in Smart Factories (available on Zenodo) and follows the DataStream XES extension. In this modified version, three types of common Data Quality Issues (DQIs) - missing sensor values, missing sensors, and time shifts - have been artificially injected into the sensor data. These issues reflect realistic challenges in industrial IoT data processing and are valuable for developing and testing robust data cleaning and analysis methods.

    By comparing the original (clean) dataset with this modified version, researchers can systematically evaluate DQI detection, handling, and solving techniques under controlled conditions. Further details are provided for each of three DQI types in the subfolders in a csv changelog.

  17. New Security Issues, State and Local Governments

    • catalog.data.gov
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Board of Governors of the Federal Reserve System (2024). New Security Issues, State and Local Governments [Dataset]. https://catalog.data.gov/dataset/new-security-issues-state-and-local-governments
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    Federal Reserve Board of Governors
    Federal Reserve Systemhttp://www.federalreserve.gov/
    Description

    The New Security Issues, State and Local Governments tables (1.45) are updated monthly. Data were previously published in the Supplement to the Federal Reserve Bulletin, which ceased publication in December 2008. Data sources have included: Mergent, beginning November 2011; Securities Data Company, from January 1990 to October 2011; and Investment Dealers Digest before then.

  18. Mental Health Dataset

    • kaggle.com
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavik Jikadara (2024). Mental Health Dataset [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/mental-health-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhavik Jikadara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.

    Benefits of using this dataset:

    • Insight into Mental Health: The dataset provides valuable insights into mental health by analyzing linguistic patterns, sentiment, and psychological indicators in text data. Researchers and data scientists can gain a better understanding of how mental health issues manifest in online communication.
    • Predictive Modeling: With a wide range of features, including sentiment analysis scores and psychological indicators, the dataset offers opportunities for developing predictive models to identify or predict mental health outcomes based on textual data. This can be useful for early intervention and support.
    • Community Engagement: Mental health is a topic of increasing importance, and this dataset can foster community engagement on platforms like Kaggle. Data enthusiasts, researchers, and mental health professionals can collaborate to analyze the data and develop solutions to address mental health challenges.
    • Data-driven Insights: By analyzing the dataset, users can uncover correlations and patterns between linguistic features, sentiment, and mental health indicators. These insights can inform interventions, policies, and support systems aimed at promoting mental well-being.
    • Educational Resource: The dataset can serve as a valuable educational resource for teaching and learning about mental health analytics, sentiment analysis, and text mining techniques. It provides a real-world dataset for students and practitioners to apply data science skills in a meaningful context.
  19. USA Housing Data

    • kaggle.com
    Updated Nov 11, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DimaVinn (2020). USA Housing Data [Dataset]. https://www.kaggle.com/dimavinn/usa-housing-data/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 11, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DimaVinn
    Area covered
    United States
    Description

    Dataset

    This dataset was created by DimaVinn

    Contents

  20. Housing Data

    • kaggle.com
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuki S (2023). Housing Data [Dataset]. https://www.kaggle.com/datasets/yukio0/housing-data/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yuki S
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Yuki S

    Released under Apache 2.0

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan, Data Science Problems Dataset [Dataset]. https://paperswithcode.com/dataset/data-science-problems

Data from: Data Science Problems Dataset

Related Article
Explore at:
Authors
Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan
Description

Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

Search
Clear search
Close search
Google apps
Main menu