100+ datasets found
  1. Github Dataset - Data Analysis

    • kaggle.com
    zip
    Updated Feb 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Ranjan (2023). Github Dataset - Data Analysis [Dataset]. https://www.kaggle.com/datasets/abhishekrp1517/github-dataset-data-analysis
    Explore at:
    zip(422672 bytes)Available download formats
    Dataset updated
    Feb 24, 2023
    Authors
    Abhishek Ranjan
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Abhishek Ranjan

    Released under Database: Open Database, Contents: Database Contents

    Contents

  2. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  3. m

    Data extracted from GitHub repositories (training and test data-sets)

    • data.mendeley.com
    Updated Aug 1, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youcef Bouziane (2019). Data extracted from GitHub repositories (training and test data-sets) [Dataset]. http://doi.org/10.17632/gt3f4jnbvn.3
    Explore at:
    Dataset updated
    Aug 1, 2019
    Authors
    Youcef Bouziane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the SQL tables of the training and test datasets used in our experimentation. These tables contain the preprocessed textual data (in a form of tokens) extracted from each training and test project. Besides the preprocessed textual data, this dataset also contains meta-data about the projects, GitHub topics, and GitHub collections. The GitHub projects are identified by the tuple “Owner” and “Name”. The descriptions of the table fields are attached to their respective data descriptions.

  4. Project GitHub

    • data.nasa.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Project GitHub [Dataset]. https://data.nasa.gov/dataset/project-github
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Release of CertWare was announced 23 Mar 2012 on: code.nasa.gov The announcement points to the Certware project on NASA’s GitHub repository at: nasa.github.com/CertWare The project site contains install instructions as an Eclipse feature, various tutorials and resources, and a link to the GitHub source repository. CertWare is released under the NASA Open Source Agreement (NOSA).

  5. GitHub Programming Languages Data

    • kaggle.com
    zip
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
    Explore at:
    zip(41198 bytes)Available download formats
    Dataset updated
    Jan 2, 2022
    Authors
    Isaac Wen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

    One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

    Content

    This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

    Source

    This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

    Limitations

    Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.

  6. h

    github-code-clean

    • huggingface.co
    • opendatalab.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2022
    Dataset authored and provided by
    CodeParrot
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

  7. Z

    Data from: A Dataset for GitHub Repository Deduplication

    • data-staging.niaid.nih.gov
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris (2020). A Dataset for GitHub Repository Deduplication [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3653919
    Explore at:
    Dataset updated
    Feb 9, 2020
    Dataset provided by
    University of Tennessee
    Athens University of Economics and Business
    Authors
    Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

    The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.

    The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.

  8. GitHub Issues and Comments

    • figshare.com
    html
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Donisan (2022). GitHub Issues and Comments [Dataset]. http://doi.org/10.6084/m9.figshare.20024303.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Alexandra Donisan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset containing GitHub issues (that are labeled using technical debt keywords) together with their comments. Both issues and comments have their GitHub reactions. The dataset is a MongoDB exported JSON.

  9. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    pdf, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  10. S

    GitHub Statistics 2025: Data That Changes Dev Work

    • sqmagazine.co.uk
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SQ Magazine (2025). GitHub Statistics 2025: Data That Changes Dev Work [Dataset]. https://sqmagazine.co.uk/github-statistics/
    Explore at:
    Dataset updated
    Oct 3, 2025
    Dataset authored and provided by
    SQ Magazine
    License

    https://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/

    Time period covered
    Jan 1, 2024 - Dec 31, 2025
    Area covered
    Global
    Description

    GitHub remains the central hub for software collaboration, and its reach, impact, and complexity continue to expand. From small open‑source projects to enterprise deployments, GitHub shapes how code is built, shared, and maintained. For instance, large corporations rely on GitHub actions and CI/CD pipelines to streamline release cycles, while open‑source...

  11. g

    Detailed Epidemiological Data from the COVID-19 Outbreak

    • github.com
    • catalog.midasnetwork.us
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open COVID-19 Data Working Group, Detailed Epidemiological Data from the COVID-19 Outbreak [Dataset]. https://github.com/beoutbreakprepared/nCoV2019
    Explore at:
    Dataset provided by
    Open COVID-19 Data Working Group
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data and code repository for the Open COVID-19 Data Working Group: a global and multi-organizational initative that aims to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.

  12. g

    Coronavirus (Covid-19) Data in the United States

    • github.com
    • openicpsr.org
    • +4more
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
    Explore at:
    csvAvailable download formats
    Dataset provided by
    New York Times
    License

    https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE

    Description

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

  13. GitHub Public Repository Metadata

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
    Explore at:
    zip(606866859 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

    The Github API Terms of Service apply.

    You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

    Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

    Example entry

    {
     "owner": "pelmers",
     "name": "text-rewriter",
     "stars": 13,
     "forks": 5,
     "watchers": 4,
     "isFork": false,
     "isArchived": false,
     "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
     "languageCount": 3,
     "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
     "topicCount": 1,
     "diskUsageKb": 75,
     "pullRequests": 4,
     "issues": 12,
     "description": "Webextension to rewrite phrases in pages",
     "primaryLanguage": "JavaScript",
     "createdAt": "2015-03-14T22:35:11Z",
     "pushedAt": "2022-02-11T14:26:00Z",
     "defaultBranchCommitCount": 54,
     "license": null,
     "assignableUserCount": 1,
     "codeOfConduct": null,
     "forkingAllowed": true,
     "nameWithOwner": "pelmers/text-rewriter",
     "parent": null
    }
    

    The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.

  14. GitSED: GitHub Socially Enhanced Dataset

    • zenodo.org
    xz
    Updated Jul 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão (2021). GitSED: GitHub Socially Enhanced Dataset [Dataset]. http://doi.org/10.5281/zenodo.5021329
    Explore at:
    xzAvailable download formats
    Dataset updated
    Jul 2, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel P. Oliveira; Gabriel P. Oliveira; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão; Mirella M. Moro; Mirella M. Moro; Ana Flávia C. Moura; Natércia A. Batista; Michele A. Brandão
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Engineering has evolved as a field to study not only the many ways software is created but also how it evolves, becomes successful, is effective and efficient in its objectives, satisfies its quality attributes, and much more. Nonetheless, there are still many open issues during its conception, development, and maintenance phases. Especially, understanding how developers collaborate may help in all such phases, but it is also challenging. Luckily, we may now explore a novel angle to deal with such a challenge: studying the social aspects of software development over social networks.

    With GitHub becoming the main representative of collaborative software development online tools, there are approaches to assess the follow-network, stargazer-network, and contributors-network. Moreover, having such networks built from real software projects offers support for relevant applications, such as detection of key developers, recommendation of collaboration among developers, detection of developer communities, and analyses of collaboration patterns in agile development.

    GitSED is a dataset based on GitHub that is curated (cleaned and reduced), augmented with external data, and enriched with social information on developers’ interactions. The original data is extracted from GHTorrent (an offline repository of data collected through the GitHub REST API). Our final dataset contains data from up to June 2019. It comprises:

    • 8,556,778 repositories
    • 32,411,674 developers
    • 6 programming languages (Assembly, JavaScript, Pascal, Python, Ruby, Visual Basic)
    • 13 collaboration metrics

    There are two previous versions of GitSED, which were originally built for the following conference papers:

    v2 (May 2017): Gabriel P. Oliveira, Natércia A. Batista, Michele A. Brandão, and Mirella M. Moro. Tie Strength in GitHub Heterogeneous Networks. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (WebMedia'18), 2018.

    v1 (Sep 2015): Natércia A. Batista, Michele A. Brandão, Gabriela B. Alves, Ana Paula Couto da Silva, and Mirella M. Moro. Collaboration strength metrics and analyses on GitHub. In Proceedings of the International Conference on Web Intelligence (WI'17), 2017.

  15. h

    github-r-repos

    • huggingface.co
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Falbel (2023). github-r-repos [Dataset]. https://huggingface.co/datasets/dfalbel/github-r-repos
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2023
    Authors
    Daniel Falbel
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub R repositories dataset

    R source files from GitHub. This dataset has been created using the public GitHub datasets from Google BigQuery. This is the actual query that has been used to export the data: EXPORT DATA OPTIONS ( uri = 'gs://your-bucket/gh-r/*.parquet', format = 'PARQUET') as ( select f.id, f.repo_name, f.path, c.content, c.size from ( SELECT distinct id, repo_name, path FROM bigquery-public-data.github_repos.files where ends_with(path… See the full description on the dataset page: https://huggingface.co/datasets/dfalbel/github-r-repos.

  16. g

    Data from: Data Science Problems

    • github.com
    • opendatalab.com
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
    Explore at:
    Dataset updated
    Feb 8, 2022
    License

    https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt

    Description

    Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.

  17. o

    Open Source Software: GitHub User and Activity Data (2008-2019)

    • openicpsr.org
    delimited
    Updated Jan 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brandon Kramer (2022). Open Source Software: GitHub User and Activity Data (2008-2019) [Dataset]. http://doi.org/10.3886/E158826V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jan 6, 2022
    Dataset provided by
    Edge & Node
    Authors
    Brandon Kramer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GitHub user data for past and ongoing open-source software project with contributors from National Center for Science and Engineering Statistics, Bureau of Economic Analysis, University of Virginia, Coleridge Initiative, and Edge & Node

  18. Data from: GHALogs: Large-Scale Dataset of GitHub Actions Runs

    • zenodo.org
    application/gzip, zip
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon (2024). GHALogs: Large-Scale Dataset of GitHub Actions Runs [Dataset]. http://doi.org/10.5281/zenodo.10154920
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Oct 2023
    Description

    In recent years, continuous integration and deployment (CI/CD) has become increasingly popular in both the open-source community and industry. Evaluating CI/CD performance is a critical aspect of software development, as it not only helps minimize execution costs but also ensures faster feedback for developers. Despite its importance, there is limited fine-grained knowledge about the performance of CI/CD processes—knowledge that is essential for identifying bottlenecks and optimization opportunities.
    Moreover, the availability of large-scale, publicly accessible datasets of CI/CD logs remains scarce. The few datasets that do exist are often outdated and lack comprehensive coverage. To address this gap, we introduce a new dataset comprising 116k CI/CD workflows executed using GitHub Actions (GHA) across 25k public code projects spanning 20 different programming languages.
    This dataset includes 513k workflow runs encompassing 2.3 million individual steps. For each workflow run, we provide detailed metadata along with complete run logs. To the best of our knowledge, this is the largest dataset of CI/CD runs that includes full log data. The inclusion of these logs enables more in-depth analysis of CI/CD pipelines, offering insights that cannot be gleaned solely from code repositories.
    We postulate that this dataset will facilitate future CI/CD pipeline behavior research through log-based analysis. Potential applications include performance evaluation (e.g., measuring task execution times) and root cause analysis (e.g., identifying reasons for pipeline failures).

  19. d

    WebAutomation Employee Data | Github Developer Profiles | Global 40M+...

    • datarade.ai
    .json, .csv
    Updated Dec 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webautomation (2022). WebAutomation Employee Data | Github Developer Profiles | Global 40M+ Developer Records | Explore Developer Repositories, Contributions and more [Dataset]. https://datarade.ai/data-products/webautomation-github-developer-profiles-dataset-global-webautomation
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Dec 5, 2022
    Dataset authored and provided by
    Webautomation
    Area covered
    Canada, Falkland Islands (Malvinas), Montserrat, Uruguay, Estonia, Paraguay, Greenland, Ukraine, Guadeloupe, Suriname
    Description

    Extensive Developer Coverage: Our employee dataset includes a diverse range of developer profiles from GitHub, spanning various skill levels, industries, and expertise. Access information on developers from all corners of the software development world.

    Developer Profiles: Explore detailed developer profiles, including user bios, locations, company affiliations, and skills. Understand developer backgrounds, experiences, and areas of expertise.

    Repositories and Contributions: Access information about the repositories created by developers and their contributions to open-source projects. Analyze the projects they've worked on, their coding activity, and the impact they've made on the developer community.

    Programming Languages: Gain insights into the programming languages that developers are proficient in. Identify skilled developers in specific programming languages that align with your project needs.

    Customizable Data Delivery: The dataset is available in flexible formats, such as CSV, JSON, or API integration, allowing seamless integration with your existing data infrastructure. Customize the data to meet your specific research and analysis requirements.

  20. A Representative User-centric GitHub Developers Dataset for Malicious...

    • figshare.com
    png
    Updated Dec 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yushan Liu (2022). A Representative User-centric GitHub Developers Dataset for Malicious Account Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21789566.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Yushan Liu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.

    We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.

    Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abhishek Ranjan (2023). Github Dataset - Data Analysis [Dataset]. https://www.kaggle.com/datasets/abhishekrp1517/github-dataset-data-analysis
Organization logo

Github Dataset - Data Analysis

Github Dataset which can be used for Data Analysis , Regression Task

Explore at:
zip(422672 bytes)Available download formats
Dataset updated
Feb 24, 2023
Authors
Abhishek Ranjan
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Dataset

This dataset was created by Abhishek Ranjan

Released under Database: Open Database, Contents: Database Contents

Contents

Search
Clear search
Close search
Google apps
Main menu