100+ datasets found

h
github-code
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
Explore at:
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.
h
github-code-clean
huggingface.co
opendatalab.com
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2022
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
P
GitHub-Python Dataset
paperswithcode.com
opendatalab.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
Explore at:
Dataset updated
Jun 15, 2021
Authors
Michihiro Yasunaga; Percy Liang
Description
Repair AST parse (syntax) errors in Python code
h
github-code-dataset
huggingface.co
Updated Jun 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wayner Barrios (2024). github-code-dataset [Dataset]. https://huggingface.co/datasets/waybarrios/github-code-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2024
Authors
Wayner Barrios
Description
Dataset Card for "github-code-dataset"

More Information needed
P
CodeSearchNet Dataset
paperswithcode.com
opendatalab.com
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt (2024). CodeSearchNet Dataset [Dataset]. https://paperswithcode.com/dataset/codesearchnet
Explore at:
Dataset updated
Dec 30, 2024
Authors
Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc Brockschmidt
Description
The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
h
python-github-code
huggingface.co
Updated Mar 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelica Chen (2023). python-github-code [Dataset]. https://huggingface.co/datasets/angie-chen55/python-github-code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2023
Authors
Angelica Chen
Description
angie-chen55/python-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spielmann, David (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
Spielmann, David
Sokolowski, Daniel
Salvaneschi, Guido
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
E
GitHub Java Corpus
dtechtive.com
find.data.gov.scot
gz, txt
Updated Jan 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Edinburgh: School of Informatics (2017). GitHub Java Corpus [Dataset]. http://doi.org/10.7488/ds/1690
Explore at:
gz(0.6836 MB), gz(1836.032 MB), txt(0.0028 MB), txt(0.0166 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1690
Dataset updated
Jan 10, 2017
Dataset provided by
University of Edinburgh: School of Informatics
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GitHub Java Corpus is a snapshot of all open-source Java code on GitHub in October 2012 that is contained in open-source projects that at the time had at least one fork. It contains code from 14,785 projects amounting to about 352 million lines of code. The dataset has been used to study coding practice in Java at a large scale.
P
Public Git Archive Dataset
paperswithcode.com
opendatalab.com
Updated Nov 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Public Git Archive Dataset [Dataset]. https://paperswithcode.com/dataset/public-git-archive
Explore at:
Dataset updated
Nov 27, 2021
Description
The Public Git Archive is a dataset of 182,014 top-bookmarked Git repositories from GitHub totalling 6 TB. The dataset provides the source code of the projects, the related metadata, and development history.
Zenodo Code Images
kaggle.com
zip
Updated Jun 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Research Computing Center (2018). Zenodo Code Images [Dataset]. https://www.kaggle.com/datasets/stanfordcompute/code-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 18, 2018
Dataset authored and provided by
Stanford Research Computing Center
Description
Code Images

Context

This is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.

Content

Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.

tree -L 1 . ├── c ├── cc ├── cpp ├── cs ├── css ├── csv ├── cxx ├── data ├── f90 ├── go ├── html ├── java ├── js ├── json ├── m ├── map ├── md ├── txt └── xml

And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.

$ tree map -L 1 map ├── 1001104 ├── 1001659 ├── 1001793 ├── 1008839 ├── 1009700 ├── 1033697 ├── 1034342 ... ├── 836482 ├── 838329 ├── 838961 ├── 840877 ├── 840881 ├── 844050 ├── 845960 ├── 848163 ├── 888395 ├── 891478 └── 893858 154 directories, 0 files

Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.

$ tree m/891531/ -L 1 m/891531/ ├── 891531_0.png ├── 891531_10.png ├── 891531_11.png ├── 891531_12.png ├── 891531_13.png ├── 891531_14.png ├── 891531_15.png ├── 891531_16.png ├── 891531_17.png ├── 891531_18.png ├── 891531_19.png ├── 891531_1.png ├── 891531_20.png ├── 891531_21.png ├── 891531_22.png ├── 891531_23.png ├── 891531_24.png ├── 891531_25.png ├── 891531_26.png ├── 891531_27.png ├── 891531_28.png ├── 891531_29.png ├── 891531_2.png ├── 891531_30.png ├── 891531_3.png ├── 891531_4.png ├── 891531_5.png ├── 891531_6.png ├── 891531_7.png ├── 891531_8.png └── 891531_9.png 0 directories, 31 files

So what's the difference?

The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.

How many images total?

We can count the number of total images:

find "." -type f -name *.png | wc -l 3,026,993

Dataset Curation

The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).

Saving the Image

I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.

import cv2 cv2.imwrite(image_path, image)

Loading the Image

Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.

image_path = '/tmp/data1/data/csv/1009185/1009185_0.png' from imageio import imread image = imread(image_path) array([[116, 105, 109, ..., 32, 32, 32], [ 48, 44, 48, ..., 32, 32, 32], [ 48, 46, 49, ..., 32, 32, 32], ..., [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8) image.shape (80,80) # Deprecated from scipy import misc misc.imread(image_path) Image([[116, 105, 109, ..., 32, 32, 32], [ 48, 44, 48, ..., 32, 32, 32], [ 48, 46, 49, ..., 32, 32, 32], ..., [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)

Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?

ord(' ') 32 # And thus if you wanted to convert it back... chr(32)

So how t...
h
github-jupyter-code-to-text
huggingface.co
Updated Aug 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). github-jupyter-code-to-text [Dataset]. https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2022
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset description

This dataset consists of sequences of Python code followed by a a docstring explaining its function. It was constructed by concatenating code and text pairs from this dataset that were originally code and markdown cells in Jupyter Notebooks. The content of each example the following: [CODE] """ Explanation: [TEXT] End of explanation """ [CODE] """ Explanation: [TEXT] End of explanation """ ...

How to use it

from datasets import load_dataset

ds =… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text.

Dataset for class comment analysis

zenodo.org
data.niaid.nih.gov

zip

Updated Feb 22, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Pooja Rani; Pooja Rani (2022). Dataset for class comment analysis [Dataset]. http://doi.org/10.5281/zenodo.4311839

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4311839

Dataset updated

Feb 22, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Pooja Rani; Pooja Rani

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A list of different projects selected to analyze class comments (available in the source code) of various languages such as Java, Python, and Pharo. The projects vary in terms of size, contributors, and domain.

## Structure
```
Projects/
  Java_projects/
    eclipse.zip
    guava.zip
    guice.zip
    hadoop.zip
    spark.zip
    vaadin.zip

  Pharo_projects/
    images/
      GToolkit.zip
      Moose.zip
      PetitParser.zip
      Pillar.zip
      PolyMath.zip
      Roassal2.zip
      Seaside.zip

    vm/
      70-x64/Pharo

    Scripts/
      ClassCommentExtraction.st
      SampleSelectionScript.st    

  Python_projects/
    django.zip
    ipython.zip
    Mailpile.zip
    pandas.zip
    pipenv.zip
    pytorch.zip   
    requests.zip 
  
```

## Contents of the Replication Package
---

**Projects/** contains the raw projects of each language that are used to analyze class comments.
- **Java_projects/**
  - `eclipse.zip` - Eclipse project downloaded from the GitHub. More detail about the project is available on GitHub [Eclipse](https://github.com/eclipse).
  - `guava.zip` - Guava project downloaded from the GitHub. More detail about the project is available on GitHub [Guava](https://github.com/google/guava).
  - `guice.zip` - Guice project downloaded from the GitHub. More detail about the project is available on GitHub [Guice](https://github.com/google/guice)
  - `hadoop.zip` - Apache Hadoop project downloaded from the GitHub. More detail about the project is available on GitHub [Apache Hadoop](https://github.com/apache/hadoop)
  - `spark.zip` - Apache Spark project downloaded from the GitHub. More detail about the project is available on GitHub [Apache Spark](https://github.com/apache/spark)
  - `vaadin.zip` - Vaadin project downloaded from the GitHub. More detail about the project is available on GitHub [Vaadin](https://github.com/vaadin/framework)

- **Pharo_projects/**
 - **images/** - 
    - `GToolkit.zip` - Gtoolkit project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image. 
    - `Moose.zip` - Moose project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image. 
    - `PetitParser.zip` - Petit Parser project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image.
    - `Pillar.zip` - Pillar project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image.
    - `PolyMath.zip` - PolyMath project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image.
    - `Roassal2.zip` - Roassal2 project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image.
    - `Seaside.zip` - Seaside project is imported into the Pharo image. We can run this image with the virtual machine given in the `vm/` folder. The script to extract the comments is already provided in the image.

 - **vm/** - 
  - **70-x64/Pharo** - Pharo7 (version 7 of Pharo) virtual machine to instantiate the Pharo images given in the `images/` folder. The user can run the vm on macOS and select any of the Pharo image. 

 - **Scripts/** - It contains the sample Smalltalk scripts to extract class comments from various projects. 
  - `ClassCommentExtraction.st` - A Smalltalk script to show how class comments are extracted from various Pharo projects. This script is already provided in the respective project image.
  - `SampleSelectionScript.st` - A Smalltalk script to show sample class comments of Pharo projects are selected. This script can be run in any of the Pharo images given in the images/ folder.


- **Python_projects/**
  - `django.zip` - Django project downloaded from the GitHub. More detail about the project is available on GitHub [Django](https://github.com/django)
  - `ipython.zip` - IPython project downloaded from the GitHub. More detail about the project is available on GitHub on [IPython](https://github.com/ipython/ipython)
  - `Mailpile.zip` - Mailpile project downloaded from the GitHub. More detail about the project is available on GitHub on [Mailpile](https://github.com/mailpile/Mailpile)
  - `pandas.zip` - pandas project downloaded from the GitHub. More detail about the project is available on GitHub on [pandas](https://github.com/pandas-dev/pandas)
  - `pipenv.zip` - Pipenv project downloaded from the GitHub. More detail about the project is available on GitHub on [Pipenv](https://github.com/pypa/pipenv)
  - `pytorch.zip` - PyTorch project downloaded from the GitHub. More detail about the project is available on GitHub on [PyTorch](https://github.com/pytorch/pytorch)
  - `requests.zip` - Requests project downloaded from the GitHub. More detail about the project is available on GitHub on [Requests](https://github.com/psf/requests/)

g
Data from: Data Science Problems
github.com
opendatalab.com
Updated Feb 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
Explore at:
Dataset updated
Feb 8, 2022
License
https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt
Description
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.
f
Rev-rec Source - Code Reviews Dataset (SEAA2018)
figshare.com
zip
Updated Jun 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakub Lipčák; Bruno Rossi (2018). Rev-rec Source - Code Reviews Dataset (SEAA2018) [Dataset]. http://doi.org/10.6084/m9.figshare.6462380.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6462380.v1
Dataset updated
Jun 8, 2018
Dataset provided by
figshare
Authors
Jakub Lipčák; Bruno Rossi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains source code reviews of 51 projects mined from Gerrit (14 projects, ~133K pull requests) and GitHub (37 projects, ~159K pull requests). The dataset has been used in the upcoming article: Lipcak, J., Rossi, B. (2018). A Large-Scale Study on Source Code Reviewer Recommendation, in 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2018, IEEE.Included files: 51 mined JSON files (zip), summary project list with descriptive statistics (pdf), readme file (md).
Data from: TerraDS: A Dataset for Terraform HCL Programs
zenodo.org
application/gzip, bin
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi (2024). TerraDS: A Dataset for Terraform HCL Programs [Dataset]. http://doi.org/10.5281/zenodo.14217386
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14217386
Dataset updated
Nov 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Bühler; Christoph Bühler; David Spielmann; David Spielmann; Roland Meier; Roland Meier; Guido Salvaneschi; Guido Salvaneschi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TerraDS

The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.

Structure of the Database

The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:

1. Repository Data:

Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.

Provides cloneable URLs for access and analysis.

Tracks additional metrics like repository size and the latest commit details.

2. Module Data:

Consists of 279,344 modules identified within the repositories.

Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.

3. Resource Data:

Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.

Each resource entry details its type, provider, and whether it is managed or read-only.

Structure of the Archive

The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.

Tools

The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.

One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
P
The Stack Dataset
paperswithcode.com
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denis Kocetkov; Raymond Li; Loubna Ben allal; Jia Li; Chenghao Mou; Carlos Muñoz Ferrandis; Yacine Jernite; Margaret Mitchell; Sean Hughes; Thomas Wolf; Dzmitry Bahdanau; Leandro von Werra; Harm de Vries (2022). The Stack Dataset [Dataset]. https://paperswithcode.com/dataset/the-stack
Explore at:
Dataset updated
Oct 28, 2022
Authors
Denis Kocetkov; Raymond Li; Loubna Ben allal; Jia Li; Chenghao Mou; Carlos Muñoz Ferrandis; Yacine Jernite; Margaret Mitchell; Sean Hughes; Thomas Wolf; Dzmitry Bahdanau; Leandro von Werra; Harm de Vries
Description
The Stack contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub. The dataset was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs).
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
Developer Community and Code Datasets
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Oxylabs
Area covered
El Salvador, Bahamas, Tuvalu, South Sudan, Marshall Islands, Philippines, United Kingdom, Guyana, Saint Pierre and Miquelon, Djibouti
Description
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

Data Sources:

GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

With our datasets, you'll receive:

Usernames;

Companies;

Locations;

Job Titles;

Follower Counts;

Contact Details;

Employability Statuses;

And More.

Choose from various output formats, storage options, and delivery frequencies:

Get datasets in CSV, JSON, or other preferred formats.

Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.

Receive datasets either once or as per your agreed-upon schedule.

Why choose our Datasets?

Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
f
Data from: An Empirical Evaluation of GitHub Copilot’s Code Suggestions
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nhan Nguyen; Sarah Nadi (2023). An Empirical Evaluation of GitHub Copilot’s Code Suggestions [Dataset]. http://doi.org/10.6084/m9.figshare.18515141.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18515141.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Nhan Nguyen; Sarah Nadi
License
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
Description
Artifact accompanying MSR 2022 paper titled "An Empirical Evaluation of GitHub Copilot’s Code Suggestions" by Nhan Nguyen and Sarah Nadi

Facebook

Twitter

Click to copy link

Link copied

Cite

CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code

github-code

codeparrot/github-code

Explore at:

56 scholarly articles cite this dataset (View in Google Scholar)

Dataset provided by

Good Engineering, Inc

Authors

CodeParrot

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

Clear search

Close search

Google apps

Main menu

github-code

github-code-clean

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

GitHub-Python Dataset

github-code-dataset

CodeSearchNet Dataset

python-github-code

PIPr: A Dataset of Public Infrastructure as Code Programs

GitHub Java Corpus

Public Git Archive Dataset

Zenodo Code Images

Code Images

Context

Content

Dataset Curation

Saving the Image

Loading the Image

github-jupyter-code-to-text

Dataset for class comment analysis

Data from: Data Science Problems

Rev-rec Source - Code Reviews Dataset (SEAA2018)

Data from: TerraDS: A Dataset for Terraform HCL Programs

TerraDS

Structure of the Database

Structure of the Archive

Tools

The Stack Dataset

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

Developer Community and Code Datasets

Data from: An Empirical Evaluation of GitHub Copilot’s Code Suggestions

github-code

github-code

codeparrot/github-code