85 datasets found
  1. Influence of Continuous Integration on the Development Activity in GitHub...

    • zenodo.org
    csv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    1. were active for one year before the first build with Travis CI (before_ci),
    2. used Travis CI at least for one year (during_ci),
    3. had commit or merge activity on the default branch in both of these phases, and
    4. used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    The dataset contains the following files:

    tr_projects_sample_filtered.csv
    A CSV file with information about the 321 selected projects.

    tr_sample_commits_default_branch_before_ci.csv
    tr_sample_commits_default_branch_during_ci.csv

    One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The branch to which the commit was made.
    hash_value: The SHA1 hash value of the commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv
    tr_sample_merges_default_branch_during_ci.csv

    One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_").
    branch: The destination branch of the merge.
    hash_value: The SHA1 hash value of the merge commit.
    merged_commits: Unique hash value prefixes of the commits merged with this commit.
    author_name: The author name.
    author_email: The author email address.
    author_date: The authoring timestamp.
    commit_name: The committer name.
    commit_email: The committer email address.
    commit_date: The commit timestamp.
    log_message_length: The length of the git commit messages (in characters).
    file_count: Files changed with this commit.
    lines_added: Lines added to all files changed with this commit.
    lines_deleted: Lines deleted in all files changed with this commit.
    file_extensions: Distinct file extensions of files changed with this commit.
    pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
    source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
    source_branch : Source branch of the pull request (extracted from log message).

  2. GitTables 1M - CSV files

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains >800K CSV files behind the GitTables 1M corpus.

    For more information about the GitTables corpus, visit:

    - our website for GitTables, or

    - the main GitTables download page on Zenodo.

  3. g

    Coronavirus (Covid-19) Data in the United States

    • github.com
    • openicpsr.org
    • +3more
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
    Explore at:
    csvAvailable download formats
    Dataset provided by
    New York Times
    License

    https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE

    Description

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

  4. H

    Dataset metadata of known Dataverse installations, August 2024

    • dataverse.harvard.edu
    Updated Jan 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2025). Dataset metadata of known Dataverse installations, August 2024 [Dataset]. http://doi.org/10.7910/DVN/2SA6SN
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...

  5. Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
    Description

    Overview

    This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

    The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

    Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

    The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

    The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

    Options to access the dataset

    There are two ways how to get access to the dataset:

    1. Static dump of the dataset available in the CSV format
    2. Continuously updated dataset available via REST API

    In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

    References

    If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

    @inproceedings{SrbaMonantPlatform,
      author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
      booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
      pages = {1--7},
      title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
      year = {2019}
    }
    @inproceedings{SrbaMonantMedicalDataset,
      author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
      booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
      numpages = {11},
      title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
      year = {2022},
      doi = {10.1145/3477495.3531726},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3477495.3531726},
    }
    


    Dataset creation process

    In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.


    Ethical considerations

    The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

    The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

    As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

    Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.


    Reporting mistakes in the dataset

    The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.


    Dataset structure

    Raw data

    At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

    Raw data are contained in these CSV files (and corresponding REST API endpoints):

    • sources.csv
    • articles.csv
    • article_media.csv
    • article_authors.csv
    • discussion_posts.csv
    • discussion_post_authors.csv
    • fact_checking_articles.csv
    • fact_checking_article_media.csv
    • claims.csv
    • feedback_facebook.csv

    Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.


    Annotations

    Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

    Each annotation is described by the following attributes:

    1. category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
    2. type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
    3. method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
    4. its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.


    At the same time, annotations are associated with a particular object identified by:

    1. entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
    2. entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation

  6. d

    Data from: "A guide to using GitHub for developing and versioning data...

    • dataone.org
    • knb.ecoinformatics.org
    • +1more
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal (2023). Data from: "A guide to using GitHub for developing and versioning data standards and reporting formats" [Dataset]. http://doi.org/10.15485/1780565
    Explore at:
    Dataset updated
    Apr 6, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal
    Time period covered
    Sep 1, 2020 - Dec 3, 2020
    Description

    These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub. In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files. There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.

  7. Magic, Memory, and Curiosity (MMC) fMRI Dataset

    • openneuro.org
    Updated May 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefanie Meliss; Cristina Pascua-Martin; Jeremy Skipper; Kou Murayama (2023). Magic, Memory, and Curiosity (MMC) fMRI Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds004182.v1.0.1
    Explore at:
    Dataset updated
    May 1, 2023
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Stefanie Meliss; Cristina Pascua-Martin; Jeremy Skipper; Kou Murayama
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Magic, Memory, Curiosity (MMC) dataset contains data from 50 healthy human adults incidentally encoding 36 videos of magic tricks inside the MRI scanner across three runs.
    • Before and after incidental learning, a 10-min resting-state scan was acquired.
    • The MMC dataset includes contextual incentive manipulation, curiosity ratings for the magic tricks, as well as incidental memory performance tested a week later using a surprise cued recall and recognition test .
    • Working memory and constructs potentially relevant in the context of motivated learning (e.g., need for cognition, fear of failure) were additionally assessed.

    Stimuli

    The stimuli used here were short videos of magic tricks taken from a validated stimulus set (MagicCATs, Ozono et al., 2021) specifically created for the usage in fMRI studies. All final stimuli are available upon request. The request procedure is outlined in the Open Science Framework repository associated with the MagicCATs stimulus set (https://osf.io/ad6uc/).

    Participant responses

    Participants’ responses to demographic questions, questionnaires, and performance in the working memory assessment as well as both tasks are available in comma-separated value (CSV) files. Demographic (MMC_demographics.csv), raw questionnaire (MMC_raw_quest_data.csv) and other score data (MMC_scores.csv) as well as other information (MMC_other_information.csv) are structured as one line per participant with questions and/or scores as columns. Explicit wordings and naming of variables can be found in the supplementary information. Participant scan summaries (MMC_scan_subj_sum.csv) contain descriptives of brain coverage, TSNR, and framewise displacement (one row per participant) averaged first within acquisitions and then within participants. Participants’ responses and reaction times in the magic trick watching and memory task (MMC_experimental_data.csv) are stored as one row per trial per participant.

    Preprocessing

    Data was preprocessed using the AFNI (version 21.2.03) software suite. As a first step, the EPI timeseries were distortion-corrected along the encoding axis (P>>A) using the phase difference map (‘epi_b0_correct.py’). The resulting distortion-corrected EPIs were then processed separately for each task, but scans from the same task were processed together. The same blocks were applied to both task and resting-state distortion-corrected EPI data using afni_proc.py (see below): despiking, slice-timing and head-motion correction, intrasubject alignment between anatomy and EPI, intersubject registration to MNI, masking, smoothing, scaling, and denoising. For more details, please refer to the data descriptor (LINK) or the Github repository (https://github.com/stefaniemeliss/MMC_dataset).

    afni_proc.py -subj_id "${subjstr}" \
      -blocks despike tshift align tlrc volreg mask blur scale regress \
      -radial_correlate_blocks tcat volreg \
      -copy_anat $derivindir/$anatSS \
      -anat_has_skull no \
      -anat_follower anat_w_skull anat $derivindir/$anatUAC \
      -anat_follower_ROI aaseg anat $sswindir/$fsparc \
      -anat_follower_ROI aeseg epi $sswindir/$fsparc \
      -anat_follower_ROI FSvent epi $sswindir/$fsvent \
      -anat_follower_ROI FSWMe epi $sswindir/$fswm \
      -anat_follower_ROI FSGMe epi $sswindir/$fsgm \
      -anat_follower_erode FSvent FSWMe \
      -dsets $epi_dpattern \
      -outlier_polort $POLORT \
      -tcat_remove_first_trs 0 \
      -tshift_opts_ts -tpattern altplus \
      -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
      -align_epi_strip_method 3dSkullStrip \
      -tlrc_base MNI152_2009_template_SSW.nii.gz \
      -tlrc_NL_warp \
      -tlrc_NL_warped_dsets $sswindir/$anatQQ $sswindir/$matrix $sswindir/$warp \
      -volreg_base_ind 1 $min_out_first_run \
      -volreg_post_vr_allin yes \
      -volreg_pvra_base_index MIN_OUTLIER \
      -volreg_align_e2a \
      -volreg_tlrc_warp \
      -volreg_no_extent_mask \
      -mask_dilate 8 \
      -mask_epi_anat yes \
      -blur_to_fwhm -blur_size 8 \
      -regress_motion_per_run \
      -regress_ROI_PC FSvent 3 \
      -regress_ROI_PC_per_run FSvent \
      -regress_make_corr_vols aeseg FSvent \
      -regress_anaticor_fast \
      -regress_anaticor_label FSWMe \
      -regress_censor_motion 0.3 \
      -regress_censor_outliers 0.1 \
      -regress_apply_mot_types demean deriv \
      -regress_est_blur_epits \
      -regress_est_blur_errts \
      -regress_run_clustsim no \
      -regress_polort 2 \
      -regress_bandpass 0.01 1 \
      -html_review_style pythonic
    

    Derivatives

    The anat folder contains derivatives associated with the anatomical scan. The skull-stripped image created using @SSwarper is available in original and ICBM 2009c Nonlinear Asymmetric Template space as sub-[group][ID]_space-[space]_desc-skullstripped_T1w.nii.gz together with the corresponding affine matrix (sub-[group][ID]_aff12.1D) and incremental warp (sub-[group][ID]_warp.nii.gz). Output generated using @SUMA_Make_Spec_FS (defaced anatomical image, whole brain and tissue masks, as well as FreeSurfer discrete segmentations based on the Desikan-Killiany cortical atlas and the Destrieux cortical atlas) are also available as sub-[group][ID]_space-orig_desc-surfvol_T1w.nii.gz, sub-[group][ID]_space-orig_label-[label]_mask.nii.gz, and sub-[group][ID]_space-orig_desc-[atlas]_dseg.nii.gz, respectively.

    The func folder contains derivatives associated with the functional scans. To enhance re-usability, the fully preprocessed and denoised files are shared as sub-[group][ID]_task-[task]_desc-fullpreproc_bold.nii.gz. Additionally, partially preprocessed files (distortion corrected, despiked, slice-timing/head-motion corrected, aligned to anatomy and template space) are uploaded as sub-[group][ID]_task-[task]_run-[1-3]_desc-MNIaligned_bold.nii.gz together with slightly dilated brain mask in EPI resolution and template space where white matter and lateral ventricle were removed (sub-[group][ID]_task-[task]_space-MNI152NLin2009cAsym_label-dilatedGM_mask.nii.gz) as well as tissue masks in EPI resolution and template space (sub-[group][ID]_task-[task]_space-MNI152NLin2009cAsym_label-[tissue]_mask.nii.gz).

    The regressors folder contains nuisance regressors stemming from the output of the full afni_proc.py preprocessing pipeline. They are provided as space-delimited text values where each row represents one volume concatenated across all runs for each task separately. Those estimates that are provided per run contain the data for the volumes of one run and zeros for the volumes of other runs. This allows them to be regressed out separately for each run. The motion estimates show rotation (degree counterclockwise) in roll, pitch, and yaw and displacement (mm) in superior, left, and posterior direction. In addition to the motion parameters with respect to the base volume (sub-[group][ID]_task-[task]_label-mot_regressor.1D), motion derivatives (sub-[group][ID]_task-[task]_run[1-3]_label-motderiv_regressor.1D) and demeaned motion parameters (sub-[group][ID]_task-[task]_run[1-3]_label-motdemean_regressor.1D) are also available for each run separately. The sub-[group][ID]_task-[task]_run[1-3]_label-ventriclePC_regressor.1D files contain time course of the first three PCs of the lateral ventricle per run. Additionally, outlier fractions for each volume are provided (sub-[group][ID]_task-[task]_label-outlierfrac_regressor.1D) and sub-[group][ID]_task-[task]_label-censorTRs_regressor.1D shows which volumes were censored because motion or outlier fraction exceeded the limits specified. The voxelwise time course of local WM regressors created using fast ANATICOR is shared as sub-[group][ID]_task-[task]_label-localWM_regressor.nii.gz.

  8. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  9. Z

    Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mastropaolo, Antonio (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200098
    Explore at:
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Canfora, Gerardo
    Mastropaolo, Antonio
    Pepe, Federica
    Di Penta, Massimiliano
    Nardone, Vittoria
    BAVOTA, Gabriele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    Root directory

    • statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
    • modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)
    • script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    Dataset

    • Dataset/Dataset_HF-models-list.csv: list of HF models analyzed
    • Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library
    • Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model
    • Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project
    • Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    RQ1

    • RQ1/RQ1_dataset-list.txt: list of HF datasets
    • RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets
    • RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script
    • RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
    • RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py
    • RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

    RQ2

    • RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task
    • RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling
    • RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias
    • RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories
    • RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    RQ3

    • RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses
    • RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness
    • RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name
    • RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
    • RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)
    • RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

    scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  10. Annotated 12-lead ECG dataset

    • zenodo.org
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12-lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3765642
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    Companion python scripts are available in:
    https://github.com/antonior92/automatic-ecg-diagnosis
    
    --------
    
    Citation
    ```
    Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
    ```
    
    Bibtex:
    ```
    @article{ribeiro_automatic_2020,
     title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network},
     author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.},
     year = {2020},
     volume = {11},
     pages = {1760},
     doi = {https://doi.org/10.1038/s41467-020-15432-4},
     journal = {Nature Communications},
     number = {1}
    }
    ```
    -----
    
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: this file is not available on github repository because of the size. But it can be downloaded [here](https://doi.org/10.5281/zenodo.3625006). The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVL, AVF, AVR, V1, V2, V3, V4, V5, V6}`. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  11. Z

    steinbock results of IMC example data

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Windhager, Jonas (2023). steinbock results of IMC example data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6043599
    Explore at:
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Eling, Nils
    Windhager, Jonas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you are working with these files, please cite them as follows:Windhager, J., Zanotelli, V.R.T., Schulz, D. et al. An end-to-end workflow for multiplexed image processing and analysis. Nat Protoc (2023). https://doi.org/10.1038/s41596-023-00881-0 This repository hosts the results of processing example imaging mass cytometry (IMC) data hosted at zenodo.org/record/5949116 using the steinbock framework available at github.com/BodenmillerGroup/steinbock. Please refer to steinbock.sh for how these data were generated from the raw data. The following files are part of this repository:

    panel.csv: contains channel information regarding the used antibodies in steinbock format img.zip: contains hot pixel filtered multi-channel images derived from the IMC raw data. One file per acquisition is generated images.csv: contains metadata per acquisition pixel_classifier.ilp: ilastik pixel classifier (same as the one in zenodo.org/record/6043544) ilastik_crops.zip: image crops on which the ilastik classifier was trained (same as the ones in zenodo.org/record/6043544) ilastik_img.zip: contains multi-channel images (one per acquisition) in .h5 format for ilastik pixel classification ilastik_probabilities.zip: 3 channel images containing the pixel probabilities after pixel classification masks_ilastik.zip: segmentation masks derived from the ilastik pixel probabilities using the cell_segmentation.cppipe pipeline masks_deepcell.zip: segmentation masks derived by deepcell segmentation intensities.zip: Contains one .csv file per acquisition. Each file contains single-cell measures of the mean pixel intensity per cell and channel based on the files in img.zip and masks_deepcell.zip. regionprops.zip: Contains one .csv file per acquisition. Each file contains single-cell measures of the morphological features and location of cells based on masks_deepcell.zip. neighbors.zip: Contains one .csv file per acquisition. Each file contains an edge list of cell IDs indicating cells in close proximity based on masks_deepcell.zip. ome.zip: contains .ome.tiff files derived from img.zip; one file per acquisition histocat.zip: contains single-channel .tiff files with segmentation masks derived from masks_deepcell.zip for upload to histoCAT (bodenmillergroup.github.io/histoCAT) cells.csv: contains intensity and regionprop measurements of all cells cells_csv.zip: contains intensity and regionprop measurements of all cells per acquisition cells.fcs: contains intensity and regionprop measurements of all cells in fcs format cells_fcs.zip: contains intensity and regionprop measurements of all cells per acquisition in fcs format cells.h5ad: contains intensity, regionprop and neighbor measurements of all cells in anndata format cells_h5ad: contains intensity regionprop and neighbor measurements of all cells per acquisition in anndata format graphs.zip: contains spatial object graphs in .graphml format; one file per acquisition

  12. Readme files in 16,000,000 public GitHub repositories (October 2016)

    • zenodo.org
    • explore.openaire.eu
    • +1more
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markovtsev Vadim; Markovtsev Vadim (2020). Readme files in 16,000,000 public GitHub repositories (October 2016) [Dataset]. http://doi.org/10.5281/zenodo.285419
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markovtsev Vadim; Markovtsev Vadim
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Format

    index.csv.gz - CSV comma separated file with 3 columns:

    The flag is either "s" (readme found) or "r" (readme does not exist on the root directory level). Readme file name may be any from the list:

    "README.md", "readme.md", "Readme.md", "README.MD", "README.txt", "readme.txt", "Readme.txt", "README.TXT", "README", "readme", "Readme", "README.rst", "readme.rst", "Readme.rst", "README.RST"

    100 part-r-00xxx files are in "new" Hadoop API format with the following settings:

    1. inputFormatClass is org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

    2. keyClass is org.apache.hadoop.io.Text - repository name

    3. valueClass is org.apache.hadoop.io.BytesWritable - gzipped readme file

  13. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Oxylabs
    Area covered
    Guyana, Saint Pierre and Miquelon, Bahamas, El Salvador, Tuvalu, Djibouti, Marshall Islands, South Sudan, Philippines, United Kingdom
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  14. S1: One Tree Reef Foraminifera: a relic of the pre-colonial Great Barrier...

    • geolsoc.figshare.com
    zip
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yvette Bauder; Briony Mamo; Glenn A. Brock; Matthew A. Kosnik (2022). S1: One Tree Reef Foraminifera: a relic of the pre-colonial Great Barrier Reef [Dataset]. http://doi.org/10.6084/m9.figshare.21229562.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Geological Society of Londonhttp://www.geolsoc.org.uk/
    Authors
    Yvette Bauder; Briony Mamo; Glenn A. Brock; Matthew A. Kosnik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Great Barrier Reef, One Tree Island Reef
    Description

    Foraminifera and sample data. https://github.com/makosnik/whoForams. Mamo_Lagoon_Forams.csv contains Foraminifera abundance data for surface grab samples (Mamo 2016). Mamo_Water_depth.csv contains site location and water depth data for the Mamo collection sites. OTR_Core_Forams.csv contains the Foraminifera abundance data for the OTR core. The columns have the layer depth in cm and the fraction size in um. OTR_Core_pb210_CIC_Ages.csv contains the Pb-210 dating results used for this paper, originally published in Kosnik et al. (2015). All_sed_results.csv contains sediment grain size analyses used for Figure 2, originally published in Kosnik et al. (2015).

  15. k

    Evaluated Artifact for "Quantifying Software Reliability via Model-Counting"...

    • radar.kit.edu
    • radar-service.eu
    tar
    Updated Jun 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Weigl; Samuel Teuber (2023). Evaluated Artifact for "Quantifying Software Reliability via Model-Counting" [Dataset]. http://doi.org/10.35097/1520
    Explore at:
    tar(677317632 bytes)Available download formats
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    Teuber, Samuel
    Karlsruhe Institute of Technology
    Authors
    Alexander Weigl; Samuel Teuber
    Description

    counterSharp Experiment and Play Environment

    This repository contains the reproducible experimental evaluation of the counterSharp tool. The repository contains a Docker file which configures the counterSharp tool, two model counters (ApproxMC and Ganak) as well as the tool by Dimovski et al. for our experiments. Furthermore the repository contains the benchmarks on which we ran our experiments as well as the logs of our experiments and scripts for transforming the log files into LaTeX tables.

    Getting Started

    In order to pull and run the docker container from docker hub you should execute docker run. Or, you can load the archived and evaluated artifact into docker with docker load < countersharp-experiments.tar.gz If the image is loaded, docker run opens a shell allowing the execution of further commands: bash docker run -it -v `pwd`/results:/experiments/results samweb/countersharp-experiments By using a volume, the results are written to the host system rather than the docker container You can remove the volume mounting option (-v ...), and create /experiments/results inside the container if you can spare the results. If you are using the volume and run into permission problems, then you need to give rights via SELinux: chcon -Rt svirt_sandbox_file_tpwd/results. This will create a writable folder results in your current folder which will hold any logs from experiments. A minimal example can be executed by running (this takes approximately 70 seconds): bash ./showcase.sh This will create benchmark log files for the benchmarks for_bounded_loop1.c and overflow.c in the folder results. For example, /experiments/results/for_bounded_loop1.c/0X/ contains five folder for the five repeated runs of the experiments on this file. Each folder /experiments/results/for_bounded_loop1.c/0X/ contains the folders for the different tools, which includes the log and output files. A full run can be executed by running (this takes approximately a little under 2 days): bash ./run-all.sh Additionally single benchmarks can be executed through the following commands: bash run-instance approx program.c "[counterSharp arguments]" # Runs countersharp with ApproxMC on program.c run-instance ganak program.c "[counterSharp arguments]" # Runs countersharp with Ganak on program.c Probab.native -single -domain polyhedra program.c # Runs the tool by Dimovski et al. for deterministic programs Probab.native -single -domain polyhedra -nondet program.c # Runs the tool by Dimovski et al. for nondeterministic programs For example we can execute run-instance approx /experiments/benchmarks/confidence.c "--function testfun --unwind 1" to obtain the outcome of counterSharp and ApproxMC for the benchmark confidence.c. Note that the time information produced by runlim is always only for one part of the entire execution (i.e. for counterSharp or one ApproxMC run or one Ganak run). The script run-instance is straightforwarded, we have the call to our tool counterSharp: bash python3 -m counterSharp --amm /tmp/amm.dimacs --amh /tmp/amh.dimacs --asm /tmp/asm.dimacs --ash /tmp/ash.dimacs --con /tmp/con.dimacs -d $3 $2 which is followed by the call of ApproxMC oder ganak.

    Benchmarks

    The benchmarks are contained in the folder benchmarks which also includes an overview on the sources and modifications to the benchmarks
    Note that benchmark versions for the tool by Dimovski et al. are contained in folder benchmarks-dimovski

    Benchmark Results

    The results are contained in the folder results in which all logs from benchmark runs reside. The log files from the evaluation are not available in the Docker Image, but just on GitHub. The logs are split-up by benchmark instance (first level folder), run number (second level folder) and tool (third level folder)
    For example, the file results/bwd_loop1a.c/01/approxmc/stdout.log contains the stdout and stderr of running approxmc on the instance bwd_loop1a.c in run 01

    Machine Details

    All runs were executed on a Linux machine housing an Intel(R) Core(TM) i5-6500 CPU (3.20GHz) and 16GB of memory. Note that for every benchmark log 01/counterSharp/init.log contains information on the machine used for benchmark execution as well as on the commits used in the experiments.

    Running benchmarks

    For all cases of automated benchmark execution we assume a CSV file containing relevant information on the instances to run: The first column is the benchmark's name, the second column are parameters passed to counterSharp (see instances.csv) or the tool by Dimovski (see instances-dimovski.csv). All scripts produce benchmarking results for "missing" instances, i.e. instances for which no folder can be found in the results folder. - Run counterSharp on the benchmarks:
    run-counterSharp instances.csv - Run ApproxMC on benchmarks:
    run-approxmc instances.csv - Only after counterSharp has been run - Run GANAK on benchmarks:
    run-ganak instances.csv - Only after counterSharp has been run - Run Dimovski's tool on benchmarks:
    run-dimovski instances-dimovski.csv

    Log summarization

    Summarization is possible through the python script in logParsing/parse.py within the container. The script takes as input a list of benchmarks to process and returns (parts of) a LaTeX table. Note, that there must exist logs for all benchmarks provided in the CSV file for the call to succeed! - To obtain (sorted) results for deterministic benchmarks:
    cat logParsing/deterministic-sorted.csv| python3 logParsing/parse.py results aggregate2 - To obtain (sorted) results for nondeterministic benchmarks:
    cat logParsing/nondeterministic-sorted.csv| python3 logParsing/parse.py results nondet

    Building the docker container

    All tools are packaged into a Dockerfile which makes any installation unnecessary. There is, however, the need for a running Docker installation. The Dockerfile build depends on the accessibility of the following GitHub Repositories: - CryptoMiniSat - ApproxMC - Ganak - Probab_Analyzer - counterSharp The Docker image is hosted at Dockerhub.

  16. Z

    Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Wingkvist (2020). TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2593141
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Anna Wingkvist
    Morgan Ericsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

    TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

    Data collection and processing

    The dataset is mainly collected from existing datasets. We used data from:

    The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

    We use the regular expression tech(nical)?[\s\-_]*?debt to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag technical-debt.

    Data Format

    The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

    • id: the id used in the original source. We use the URL path to identify Medium posts.
    • body: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
    • created_utc: the time the item was posted in seconds since epoch in UTC.
    • author: the author of the item. We use the username or userid from the source.
    • source: where the item was posted. Valid sources are:
      • HackerNews Comment
      • HackerNews Job
      • HackerNews Submission
      • Reddit Comment
      • Reddit Submission
      • StackExchange Answer
      • StackExchange Comment
      • StackExchange Question
      • Medium Post
    • meta: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., score and num_comments for keys that have the same meaning/information across multiple sources.

    This is a sample item from Reddit:

    {
     "id": "ab8auf",
     "body": "Technical Debt Explained (x-post r/Eve)",
     "created_utc": 1546271789,
     "author": "totally_100_human",
     "source": "Reddit Submission",
     "meta": {
      "title": "Technical Debt Explained (x-post r/Eve)",
      "score": 1,
      "num_comments": 0,
      "url": "http://jestertrek.com/eve/technical-debt-2.png",
      "subreddit": "RCBRedditBot"
     }
    }
    

    Sample Analyses

    We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use jq to process the JSON.

    How many items are there for each source?

    lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
    

    How many submissions that mentioned technical debt were posted each month?

    lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
    

    What are the titles of items that link (meta.url) to PDF documents?

    lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
    

    Please, I want CSV!

    lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
    

    Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

    Please see https://github.com/sse-lnu/tdmentions for more analyses

    Limitations and Future updates

    The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.

  17. e

    Covid-19 JHU (Johns Hopkins University)

    • data.europa.eu
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruno Adelé, Covid-19 JHU (Johns Hopkins University) [Dataset]. https://data.europa.eu/data/datasets/5eb2f0fec170a3c7c331a101?locale=en
    Explore at:
    csv(2621440)Available download formats
    Dataset authored and provided by
    Bruno Adelé
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Data from a Covid-19 data extraction from Johns Hopkins University (JHU)

    The data were processed with the project script world-datas-analysis in order to add additional columns, including the ratio of cases in relation to the number of inhabitants, it was then exported in CSV format.

    Initial source: https://github.com/CSSEGISandData/COVID-19 File exported from world-datas-analysis: CSV file

    The project world-datas-analysis can export in gnuplot format filtered data according to your needs, see example below

    Example rendered

    Example rendering with gnuplot

    enter image description here

  18. Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase...

    • data.nist.gov
    • datasets.ai
    • +1more
    Updated Apr 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2021). Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase Material Database [Dataset]. http://doi.org/10.18434/mds2-2586
    Explore at:
    Dataset updated
    Apr 22, 2021
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The MaCFP Condensed Phase Subgroup has been designed to enable the fire research community to make significant progress towards establishing a common framework for the selection of experiments and the methodologies used to analyze these experiments when developing pyrolysis models. Experimental measurements prepared for the MaCFP Condensed Phase Working Group are submitted electronically by participating institutions and are organized and made publicly available in the MaCFP repository, which is hosted on GitHub [https://github.com/MaCFP/matl-db]. This database is version controlled, with each addition to (or edit of) measurement data saved with a unique identifier (i.e., commit tag). The repository was created and is managed by members of the MaCFP Organizing Committee.

    As of October, 2021, the MaCFP Condensed Phase Material Database contains measurement data from more than 200 unique experiments (conducted under 35 different test conditions on the same exact poly(methyl methacrylate), PMMA). All measurement data submitted by each institution is organized in a single folder with the institution's name. A consistent file naming convention is used for all test data (i.e., across all folders). File names indicate the institution name, experimental apparatus, and basic test conditions (e.g., gaseous environment and incident heat flux or heating rate). Measurement data from repeated experiments is saved in separate, ASCII comma-delimited (.csv) files, each numbered sequentially. Written description of sample preparation, test setup, and test procedure (which define the conditions associated with the experiments conducted) are included in each folder as a README.md file; this file is automatically interpreted by GitHub as Markdown (.md) text and provides a brief description of an institution's data.

    How to cite this data

    You may cite the use of this data as follows: Batiot, B., Bruns, M., Hostikka, S., Leventon, I., Nakamura, Y., Reszka, P., Rogaume, T., Stoliarov, S., Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase Material Database, https://github.com/MaCFP/matl-db, Commit Tag: [give commit; e.g., 7f89fd8], https://doi.org/10.18434/mds2-2586 (Accessed: [give download date]) This data is publicly available according to the NIST statements of copyright, fair use and licensing; see:

    https://www.nist.gov/director/copyright-fair-use-and-licensing-statements-srd-data-and-software

    Version History

    The MaCFP repository, which is hosted on GitHub [https://github.com/MaCFP/matl-db], is version controlled, with each addition (or edit) saved with a unique identifier (i.e., commit tag). When citing this database, you must include the commit tag that identifies the version of the repository you are working with.

    Experiments Conducted

    ----- 1. Milligram-Scale Tests: ----- 1.1 Thermogravimetric Analysis (TGA) 1.2 Differential Scanning Calorimetry (DSC) 1.3 Microscale Combustion Calorimetry (MCC) ----- 2. Gram-Scale Tests ----- 2.1 Cone Calorimeter 2.2 Anaerobic Gasification

    2.3 Thermal Conductivity and Diffusivity (Hot Disk and Laser Flash)

    How to interpret and use data in this repository for pyrolysis model calibration and validation

    Further information regarding the use and interpretation of the data in this repository is available online: https://github.com/MaCFP/matl-db/tree/master/Non-charring/PMMA This information includes: Key factors influencing material response during tests

    Outlier Criteria: Identification of clearly incorrect behavior in measurement data

    Methodological Information

    A preliminary summary of the measurement data contained in this repository is available online: https://github.com/MaCFP/matl-db/releases

  19. Data from: Burmese-Microbiology-1K

    • kaggle.com
    • huggingface.co
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Si Thu (2024). Burmese-Microbiology-1K [Dataset]. https://www.kaggle.com/datasets/minsithu/burmese-microbiology-1k/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Min Si Thu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Burmese-Microbiology-1K

    Min Si Thu, min@globalmagicko.com

    Microbiology 1K QA pairs in Burmese Language

    Purpose

    Before this Burmese Clinical Microbiology 1K dataset, the open-source resources to train the Burmese Large Language Model in Medical fields were rare. Thus, the high-quality dataset needs to be curated to cover medical knowledge for the development of LLM in the Burmese language

    Motivation

    I found an old notebook in my box. The book was from 2019. It contained written notes on microbiology when I was a third-year medical student. Because of the need for Burmese language resources in medical fields, I added more facts, and more notes and curated a dataset on microbiology in the Burmese language.

    About

    The dataset for microbiology in the Burmese language contains 1262 rows of instruction and output pairs in CSV format. The dataset mainly focuses on clinical microbiology foundational knowledge, abstracting basic facts on culture medium, microbes - bacteria, viruses, fungi, parasites, and diseases caused by these microbes.

    Examples

    • ငှက်ဖျားရောဂါဆိုတာ ဘာလဲ?,ငှက်ဖျားရောဂါသည် Plasmodium ကပ်ပါးကောင်ကြောင့် ဖြစ်ပွားသော အသက်အန္တရာယ်ရှိနိုင်သည့် သွေးရောဂါတစ်မျိုးဖြစ်သည်။ ၎င်းသည် ငှက်ဖျားခြင်ကိုက်ခြင်းမှတဆင့် ကူးစက်ပျံ့နှံ့သည်။

    • Influenza virus အကြောင်း အကျဉ်းချုပ် ဖော်ပြပါ။,Influenza virus သည် တုပ်ကွေးရောဂါ ဖြစ်စေသော RNA ဗိုင်းရပ်စ် ဖြစ်သည်။ Orthomyxoviridae မိသားစုဝင် ဖြစ်ပြီး type A၊ B၊ C နှင့် D ဟူ၍ အမျိုးအစား လေးမျိုး ရှိသည်။

    • Clostridium tetani ဆိုတာ ဘာလဲ,Clostridium tetani သည် မေးခိုင်ရောဂါ ဖြစ်စေသော gram-positive၊ anaerobic bacteria တစ်မျိုး ဖြစ်သည်။ မြေဆီလွှာတွင် တွေ့ရလေ့ရှိသည်။

    • Onychomycosis ဆိုတာ ဘာလဲ?,Onychomycosis သည် လက်သည်း သို့မဟုတ် ခြေသည်းများတွင် ဖြစ်ပွားသော မှိုကူးစက်မှုဖြစ်သည်။ ၎င်းသည် လက်သည်း သို့မဟုတ် ခြေသည်းများကို ထူထဲစေပြီး အရောင်ပြောင်းလဲစေသည်။

    Where to download the dataset

    Applications

    Burmese Microbiology 1K Dataset can be used in building various medical-related NLP applications.

    • The dataset can be used for pretraining or finetuning the dataset on Burmese Large Langauge Models.
    • The dataset is ready to use in building RAG-based Applications.

    Acknowledgments

    Special thanks to magickospace.org for supporting the curation process of Burmese Microbiology 1K Dataset.

    References for this datasets

    License - CC BY SA 4.0

    How to cite the dataset

    Si Thu, M. (2024). Burmese MicroBiology 1K Dataset (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12803638
    
    Si Thu, Min, Burmese-Microbiology-1K (July 24, 2024). Available at SSRN: https://ssrn.com/abstract=4904320
    
  20. O

    Sample of Providers from QHP provider.json files

    • healthdata.demo.socrata.com
    csv, xlsx, xml
    Updated Apr 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Sample of Providers from QHP provider.json files [Dataset]. https://healthdata.demo.socrata.com/CMS-Insurance-Plans/Sample-of-Providers-from-QHP-provider-json-files/axbq-xnwy
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Apr 16, 2016
    Description
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
Organization logo

Influence of Continuous Integration on the Development Activity in GitHub Projects

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
csvAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

  1. were active for one year before the first build with Travis CI (before_ci),
  2. used Travis CI at least for one year (during_ci),
  3. had commit or merge activity on the default branch in both of these phases, and
  4. used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

The dataset contains the following files:

tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv

One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv

One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).

Search
Clear search
Close search
Google apps
Main menu