85 datasets found

Influence of Continuous Integration on the Development Activity in GitHub...
zenodo.org
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1140261
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

were active for one year before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

The dataset contains the following files:

tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
GitTables 1M - CSV files
zenodo.org
explore.openaire.eu
zip
Updated Jun 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6515973
Dataset updated
Jun 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains >800K CSV files behind the GitTables 1M corpus.

For more information about the GitTables corpus, visit:

- our website for GitTables, or

- the main GitTables download page on Zenodo.
g
Coronavirus (Covid-19) Data in the United States
github.com
openicpsr.org
+3more
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
Explore at:
csvAvailable download formats
Dataset provided by
New York Times
License
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
Description
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
H
Dataset metadata of known Dataverse installations, August 2024
dataverse.harvard.edu
Updated Jan 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2025). Dataset metadata of known Dataverse installations, August 2024 [Dataset]. http://doi.org/10.7910/DVN/2SA6SN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/2SA6SN
Dataset updated
Jan 1, 2025
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
zenodo.org
data.niaid.nih.gov
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5996864
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation
d
Data from: "A guide to using GitHub for developing and versioning data...
dataone.org
knb.ecoinformatics.org
+1more
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal (2023). Data from: "A guide to using GitHub for developing and versioning data standards and reporting formats" [Dataset]. http://doi.org/10.15485/1780565
Explore at:
Unique identifier
https://doi.org/10.15485/1780565
Dataset updated
Apr 6, 2023
Dataset provided by
ESS-DIVE
Authors
Robert Crystal-Ornelas; Charuleka Varadharajan; Ben Bond-Lamberty; Kristin Boye; Shreyas Cholia; Michael Crow; Ranjeet Devarakonda; Kim S. Ely; Amy Goldman; Susan Heinz; Valerie Hendrix; Joan Damerow; Stephanie Pennington; Madison Burrus; Zarine Kakalia; Emily Robles; Maegen Simmonds; Alistair Rogers; Terri Velliquette; Helen Weierbach; Pamela Weisenhorn; Jessica N. Welch; Deborah A. Agarwal
Time period covered
Sep 1, 2020 - Dec 3, 2020
Description
These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub. In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files. There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.
Magic, Memory, and Curiosity (MMC) fMRI Dataset
openneuro.org
Updated May 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefanie Meliss; Cristina Pascua-Martin; Jeremy Skipper; Kou Murayama (2023). Magic, Memory, and Curiosity (MMC) fMRI Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds004182.v1.0.1
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004182.v1.0.1
Dataset updated
May 1, 2023
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Stefanie Meliss; Cristina Pascua-Martin; Jeremy Skipper; Kou Murayama
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Magic, Memory, Curiosity (MMC) dataset contains data from 50 healthy human adults incidentally encoding 36 videos of magic tricks inside the MRI scanner across three runs.

Before and after incidental learning, a 10-min resting-state scan was acquired.

The MMC dataset includes contextual incentive manipulation, curiosity ratings for the magic tricks, as well as incidental memory performance tested a week later using a surprise cued recall and recognition test .

Working memory and constructs potentially relevant in the context of motivated learning (e.g., need for cognition, fear of failure) were additionally assessed.

Stimuli

The stimuli used here were short videos of magic tricks taken from a validated stimulus set (MagicCATs, Ozono et al., 2021) specifically created for the usage in fMRI studies. All final stimuli are available upon request. The request procedure is outlined in the Open Science Framework repository associated with the MagicCATs stimulus set (https://osf.io/ad6uc/).

Participant responses

Participants’ responses to demographic questions, questionnaires, and performance in the working memory assessment as well as both tasks are available in comma-separated value (CSV) files. Demographic (MMC_demographics.csv), raw questionnaire (MMC_raw_quest_data.csv) and other score data (MMC_scores.csv) as well as other information (MMC_other_information.csv) are structured as one line per participant with questions and/or scores as columns. Explicit wordings and naming of variables can be found in the supplementary information. Participant scan summaries (MMC_scan_subj_sum.csv) contain descriptives of brain coverage, TSNR, and framewise displacement (one row per participant) averaged first within acquisitions and then within participants. Participants’ responses and reaction times in the magic trick watching and memory task (MMC_experimental_data.csv) are stored as one row per trial per participant.

Preprocessing

Data was preprocessed using the AFNI (version 21.2.03) software suite. As a first step, the EPI timeseries were distortion-corrected along the encoding axis (P>>A) using the phase difference map (‘epi_b0_correct.py’). The resulting distortion-corrected EPIs were then processed separately for each task, but scans from the same task were processed together. The same blocks were applied to both task and resting-state distortion-corrected EPI data using afni_proc.py (see below): despiking, slice-timing and head-motion correction, intrasubject alignment between anatomy and EPI, intersubject registration to MNI, masking, smoothing, scaling, and denoising. For more details, please refer to the data descriptor (LINK) or the Github repository (https://github.com/stefaniemeliss/MMC_dataset).

afni_proc.py -subj_id "${subjstr}" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat $derivindir/$anatSS \ -anat_has_skull no \ -anat_follower anat_w_skull anat $derivindir/$anatUAC \ -anat_follower_ROI aaseg anat $sswindir/$fsparc \ -anat_follower_ROI aeseg epi $sswindir/$fsparc \ -anat_follower_ROI FSvent epi $sswindir/$fsvent \ -anat_follower_ROI FSWMe epi $sswindir/$fswm \ -anat_follower_ROI FSGMe epi $sswindir/$fsgm \ -anat_follower_erode FSvent FSWMe \ -dsets $epi_dpattern \ -outlier_polort $POLORT \ -tcat_remove_first_trs 0 \ -tshift_opts_ts -tpattern altplus \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -align_epi_strip_method 3dSkullStrip \ -tlrc_base MNI152_2009_template_SSW.nii.gz \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets $sswindir/$anatQQ $sswindir/$matrix $sswindir/$warp \ -volreg_base_ind 1 $min_out_first_run \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -volreg_no_extent_mask \ -mask_dilate 8 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size 8 \ -regress_motion_per_run \ -regress_ROI_PC FSvent 3 \ -regress_ROI_PC_per_run FSvent \ -regress_make_corr_vols aeseg FSvent \ -regress_anaticor_fast \ -regress_anaticor_label FSWMe \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic

Derivatives

The anat folder contains derivatives associated with the anatomical scan. The skull-stripped image created using @SSwarper is available in original and ICBM 2009c Nonlinear Asymmetric Template space as sub-[group][ID]_space-[space]_desc-skullstripped_T1w.nii.gz together with the corresponding affine matrix (sub-[group][ID]_aff12.1D) and incremental warp (sub-[group][ID]_warp.nii.gz). Output generated using @SUMA_Make_Spec_FS (defaced anatomical image, whole brain and tissue masks, as well as FreeSurfer discrete segmentations based on the Desikan-Killiany cortical atlas and the Destrieux cortical atlas) are also available as sub-[group][ID]_space-orig_desc-surfvol_T1w.nii.gz, sub-[group][ID]_space-orig_label-[label]_mask.nii.gz, and sub-[group][ID]_space-orig_desc-[atlas]_dseg.nii.gz, respectively.

The func folder contains derivatives associated with the functional scans. To enhance re-usability, the fully preprocessed and denoised files are shared as sub-[group][ID]_task-[task]_desc-fullpreproc_bold.nii.gz. Additionally, partially preprocessed files (distortion corrected, despiked, slice-timing/head-motion corrected, aligned to anatomy and template space) are uploaded as sub-[group][ID]_task-[task]_run-[1-3]_desc-MNIaligned_bold.nii.gz together with slightly dilated brain mask in EPI resolution and template space where white matter and lateral ventricle were removed (sub-[group][ID]_task-[task]_space-MNI152NLin2009cAsym_label-dilatedGM_mask.nii.gz) as well as tissue masks in EPI resolution and template space (sub-[group][ID]_task-[task]_space-MNI152NLin2009cAsym_label-[tissue]_mask.nii.gz).

The regressors folder contains nuisance regressors stemming from the output of the full afni_proc.py preprocessing pipeline. They are provided as space-delimited text values where each row represents one volume concatenated across all runs for each task separately. Those estimates that are provided per run contain the data for the volumes of one run and zeros for the volumes of other runs. This allows them to be regressed out separately for each run. The motion estimates show rotation (degree counterclockwise) in roll, pitch, and yaw and displacement (mm) in superior, left, and posterior direction. In addition to the motion parameters with respect to the base volume (sub-[group][ID]_task-[task]_label-mot_regressor.1D), motion derivatives (sub-[group][ID]_task-[task]_run[1-3]_label-motderiv_regressor.1D) and demeaned motion parameters (sub-[group][ID]_task-[task]_run[1-3]_label-motdemean_regressor.1D) are also available for each run separately. The sub-[group][ID]_task-[task]_run[1-3]_label-ventriclePC_regressor.1D files contain time course of the first three PCs of the lateral ventricle per run. Additionally, outlier fractions for each volume are provided (sub-[group][ID]_task-[task]_label-outlierfrac_regressor.1D) and sub-[group][ID]_task-[task]_label-censorTRs_regressor.1D shows which volumes were censored because motion or outlier fraction exceeded the limits specified. The voxelwise time course of local WM regressors created using fast ANATICOR is shared as sub-[group][ID]_task-[task]_label-localWM_regressor.nii.gz.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Z
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
data.niaid.nih.gov
zenodo.org
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mastropaolo, Antonio (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200098
Explore at:
Dataset updated
Jan 16, 2024
Dataset provided by
Canfora, Gerardo
Mastropaolo, Antonio
Pepe, Federica
Di Penta, Massimiliano
Nardone, Vittoria
BAVOTA, Gabriele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

Root directory

statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)

script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

Dataset

Dataset/Dataset_HF-models-list.csv: list of HF models analyzed

Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library

Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model

Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project

Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

RQ1

RQ1/RQ1_dataset-list.txt: list of HF datasets

RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets

RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script

RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py

RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

RQ2

RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task

RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling

RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias

RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories

RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

RQ3

RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses

RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness

RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name

RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)

RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

scripts

Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

Annotated 12-lead ECG dataset

zenodo.org

zip

Updated Jun 7, 2021

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12-lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3765642

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3765642

Dataset updated

Jun 7, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

# Annotated 12 lead ECG dataset

Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.

It contain annotations about 6 different ECGs abnormalities:
- 1st degree AV block (1dAVb);
- right bundle branch block (RBBB);
- left bundle branch block (LBBB);
- sinus bradycardia (SB);
- atrial fibrillation (AF); and,
- sinus tachycardia (ST).

Companion python scripts are available in:
https://github.com/antonior92/automatic-ecg-diagnosis

--------

Citation
```
Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
```

Bibtex:
```
@article{ribeiro_automatic_2020,
 title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network},
 author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.},
 year = {2020},
 volume = {11},
 pages = {1760},
 doi = {https://doi.org/10.1038/s41467-020-15432-4},
 journal = {Nature Communications},
 number = {1}
}
```
-----


## Folder content:

- `ecg_tracings.hdf5`: this file is not available on github repository because of the size. But it can be downloaded [here](https://doi.org/10.5281/zenodo.3625006). The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVL, AVF, AVR, V1, V2, V3, V4, V5, V6}`. 

The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.

In python, one can read this file using the following sequence:
```python
import h5py
with h5py.File(args.tracings, "r") as f:
  x = np.array(f['tracings'])
```

- The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
- `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
 3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.
 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).

Z
steinbock results of IMC example data
data.niaid.nih.gov
explore.openaire.eu
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Windhager, Jonas (2023). steinbock results of IMC example data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6043599
Explore at:
Dataset updated
Nov 27, 2023
Dataset provided by
Eling, Nils
Windhager, Jonas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you are working with these files, please cite them as follows:Windhager, J., Zanotelli, V.R.T., Schulz, D. et al. An end-to-end workflow for multiplexed image processing and analysis. Nat Protoc (2023). https://doi.org/10.1038/s41596-023-00881-0 This repository hosts the results of processing example imaging mass cytometry (IMC) data hosted at zenodo.org/record/5949116 using the steinbock framework available at github.com/BodenmillerGroup/steinbock. Please refer to steinbock.sh for how these data were generated from the raw data. The following files are part of this repository:

panel.csv: contains channel information regarding the used antibodies in steinbock format img.zip: contains hot pixel filtered multi-channel images derived from the IMC raw data. One file per acquisition is generated images.csv: contains metadata per acquisition pixel_classifier.ilp: ilastik pixel classifier (same as the one in zenodo.org/record/6043544) ilastik_crops.zip: image crops on which the ilastik classifier was trained (same as the ones in zenodo.org/record/6043544) ilastik_img.zip: contains multi-channel images (one per acquisition) in .h5 format for ilastik pixel classification ilastik_probabilities.zip: 3 channel images containing the pixel probabilities after pixel classification masks_ilastik.zip: segmentation masks derived from the ilastik pixel probabilities using the cell_segmentation.cppipe pipeline masks_deepcell.zip: segmentation masks derived by deepcell segmentation intensities.zip: Contains one .csv file per acquisition. Each file contains single-cell measures of the mean pixel intensity per cell and channel based on the files in img.zip and masks_deepcell.zip. regionprops.zip: Contains one .csv file per acquisition. Each file contains single-cell measures of the morphological features and location of cells based on masks_deepcell.zip. neighbors.zip: Contains one .csv file per acquisition. Each file contains an edge list of cell IDs indicating cells in close proximity based on masks_deepcell.zip. ome.zip: contains .ome.tiff files derived from img.zip; one file per acquisition histocat.zip: contains single-channel .tiff files with segmentation masks derived from masks_deepcell.zip for upload to histoCAT (bodenmillergroup.github.io/histoCAT) cells.csv: contains intensity and regionprop measurements of all cells cells_csv.zip: contains intensity and regionprop measurements of all cells per acquisition cells.fcs: contains intensity and regionprop measurements of all cells in fcs format cells_fcs.zip: contains intensity and regionprop measurements of all cells per acquisition in fcs format cells.h5ad: contains intensity, regionprop and neighbor measurements of all cells in anndata format cells_h5ad: contains intensity regionprop and neighbor measurements of all cells per acquisition in anndata format graphs.zip: contains spatial object graphs in .graphml format; one file per acquisition
Readme files in 16,000,000 public GitHub repositories (October 2016)
zenodo.org
explore.openaire.eu
+1more
application/gzip, bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markovtsev Vadim; Markovtsev Vadim (2020). Readme files in 16,000,000 public GitHub repositories (October 2016) [Dataset]. http://doi.org/10.5281/zenodo.285419
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.285419
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Markovtsev Vadim; Markovtsev Vadim
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Format

index.csv.gz - CSV comma separated file with 3 columns:

The flag is either "s" (readme found) or "r" (readme does not exist on the root directory level). Readme file name may be any from the list:

"README.md", "readme.md", "Readme.md", "README.MD", "README.txt", "readme.txt", "Readme.txt", "README.TXT", "README", "readme", "Readme", "README.rst", "readme.rst", "Readme.rst", "README.RST"

100 part-r-00xxx files are in "new" Hadoop API format with the following settings:

inputFormatClass is org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

keyClass is org.apache.hadoop.io.Text - repository name

valueClass is org.apache.hadoop.io.BytesWritable - gzipped readme file
Developer Community and Code Datasets
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Oxylabs
Area covered
Guyana, Saint Pierre and Miquelon, Bahamas, El Salvador, Tuvalu, Djibouti, Marshall Islands, South Sudan, Philippines, United Kingdom
Description
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

Data Sources:

GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

With our datasets, you'll receive:

Usernames;

Companies;

Locations;

Job Titles;

Follower Counts;

Contact Details;

Employability Statuses;

And More.

Choose from various output formats, storage options, and delivery frequencies:

Get datasets in CSV, JSON, or other preferred formats.

Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.

Receive datasets either once or as per your agreed-upon schedule.

Why choose our Datasets?

Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
S1: One Tree Reef Foraminifera: a relic of the pre-colonial Great Barrier...
geolsoc.figshare.com
zip
Updated Sep 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yvette Bauder; Briony Mamo; Glenn A. Brock; Matthew A. Kosnik (2022). S1: One Tree Reef Foraminifera: a relic of the pre-colonial Great Barrier Reef [Dataset]. http://doi.org/10.6084/m9.figshare.21229562.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21229562.v1
Dataset updated
Sep 29, 2022
Dataset provided by
Geological Society of Londonhttp://www.geolsoc.org.uk/
Authors
Yvette Bauder; Briony Mamo; Glenn A. Brock; Matthew A. Kosnik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Great Barrier Reef, One Tree Island Reef
Description
Foraminifera and sample data. https://github.com/makosnik/whoForams. Mamo_Lagoon_Forams.csv contains Foraminifera abundance data for surface grab samples (Mamo 2016). Mamo_Water_depth.csv contains site location and water depth data for the Mamo collection sites. OTR_Core_Forams.csv contains the Foraminifera abundance data for the OTR core. The columns have the layer depth in cm and the fraction size in um. OTR_Core_pb210_CIC_Ages.csv contains the Pb-210 dating results used for this paper, originally published in Kosnik et al. (2015). All_sed_results.csv contains sediment grain size analyses used for Figure 2, originally published in Kosnik et al. (2015).
k
Evaluated Artifact for "Quantifying Software Reliability via Model-Counting"...
radar.kit.edu
radar-service.eu
tar
Updated Jun 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Weigl; Samuel Teuber (2023). Evaluated Artifact for "Quantifying Software Reliability via Model-Counting" [Dataset]. http://doi.org/10.35097/1520
Explore at:
tar(677317632 bytes)Available download formats
Unique identifier
https://doi.org/10.35097/1520
Dataset updated
Jun 23, 2023
Dataset provided by
Teuber, Samuel
Karlsruhe Institute of Technology
Authors
Alexander Weigl; Samuel Teuber
Description
counterSharp Experiment and Play Environment

This repository contains the reproducible experimental evaluation of the counterSharp tool. The repository contains a Docker file which configures the counterSharp tool, two model counters (ApproxMC and Ganak) as well as the tool by Dimovski et al. for our experiments. Furthermore the repository contains the benchmarks on which we ran our experiments as well as the logs of our experiments and scripts for transforming the log files into LaTeX tables.

Getting Started

In order to pull and run the docker container from docker hub you should execute docker run. Or, you can load the archived and evaluated artifact into docker with docker load < countersharp-experiments.tar.gz If the image is loaded, docker run opens a shell allowing the execution of further commands: bash docker run -it -v `pwd`/results:/experiments/results samweb/countersharp-experiments By using a volume, the results are written to the host system rather than the docker container You can remove the volume mounting option (-v ...), and create /experiments/results inside the container if you can spare the results. If you are using the volume and run into permission problems, then you need to give rights via SELinux: chcon -Rt svirt_sandbox_file_tpwd/results. This will create a writable folder results in your current folder which will hold any logs from experiments. A minimal example can be executed by running (this takes approximately 70 seconds): bash ./showcase.sh This will create benchmark log files for the benchmarks for_bounded_loop1.c and overflow.c in the folder results. For example, /experiments/results/for_bounded_loop1.c/0X/ contains five folder for the five repeated runs of the experiments on this file. Each folder /experiments/results/for_bounded_loop1.c/0X/ contains the folders for the different tools, which includes the log and output files. A full run can be executed by running (this takes approximately a little under 2 days): bash ./run-all.sh Additionally single benchmarks can be executed through the following commands: bash run-instance approx program.c "[counterSharp arguments]" # Runs countersharp with ApproxMC on program.c run-instance ganak program.c "[counterSharp arguments]" # Runs countersharp with Ganak on program.c Probab.native -single -domain polyhedra program.c # Runs the tool by Dimovski et al. for deterministic programs Probab.native -single -domain polyhedra -nondet program.c # Runs the tool by Dimovski et al. for nondeterministic programs For example we can execute run-instance approx /experiments/benchmarks/confidence.c "--function testfun --unwind 1" to obtain the outcome of counterSharp and ApproxMC for the benchmark confidence.c. Note that the time information produced by runlim is always only for one part of the entire execution (i.e. for counterSharp or one ApproxMC run or one Ganak run). The script run-instance is straightforwarded, we have the call to our tool counterSharp: bash python3 -m counterSharp --amm /tmp/amm.dimacs --amh /tmp/amh.dimacs --asm /tmp/asm.dimacs --ash /tmp/ash.dimacs --con /tmp/con.dimacs -d $3 $2 which is followed by the call of ApproxMC oder ganak.

Benchmarks

The benchmarks are contained in the folder benchmarks which also includes an overview on the sources and modifications to the benchmarks
Note that benchmark versions for the tool by Dimovski et al. are contained in folder benchmarks-dimovski

Benchmark Results

The results are contained in the folder results in which all logs from benchmark runs reside. The log files from the evaluation are not available in the Docker Image, but just on GitHub. The logs are split-up by benchmark instance (first level folder), run number (second level folder) and tool (third level folder)
For example, the file results/bwd_loop1a.c/01/approxmc/stdout.log contains the stdout and stderr of running approxmc on the instance bwd_loop1a.c in run 01

Machine Details

All runs were executed on a Linux machine housing an Intel(R) Core(TM) i5-6500 CPU (3.20GHz) and 16GB of memory. Note that for every benchmark log 01/counterSharp/init.log contains information on the machine used for benchmark execution as well as on the commits used in the experiments.

Running benchmarks

For all cases of automated benchmark execution we assume a CSV file containing relevant information on the instances to run: The first column is the benchmark's name, the second column are parameters passed to counterSharp (see instances.csv) or the tool by Dimovski (see instances-dimovski.csv). All scripts produce benchmarking results for "missing" instances, i.e. instances for which no folder can be found in the results folder. - Run counterSharp on the benchmarks:
run-counterSharp instances.csv - Run ApproxMC on benchmarks:
run-approxmc instances.csv - Only after counterSharp has been run - Run GANAK on benchmarks:
run-ganak instances.csv - Only after counterSharp has been run - Run Dimovski's tool on benchmarks:
run-dimovski instances-dimovski.csv

Log summarization

Summarization is possible through the python script in logParsing/parse.py within the container. The script takes as input a list of benchmarks to process and returns (parts of) a LaTeX table. Note, that there must exist logs for all benchmarks provided in the CSV file for the call to succeed! - To obtain (sorted) results for deterministic benchmarks:
cat logParsing/deterministic-sorted.csv| python3 logParsing/parse.py results aggregate2 - To obtain (sorted) results for nondeterministic benchmarks:
cat logParsing/nondeterministic-sorted.csv| python3 logParsing/parse.py results nondet

Building the docker container

All tools are packaged into a Dockerfile which makes any installation unnecessary. There is, however, the need for a running Docker installation. The Dockerfile build depends on the accessibility of the following GitHub Repositories: - CryptoMiniSat - ApproxMC - Ganak - Probab_Analyzer - counterSharp The Docker image is hosted at Dockerhub.
Z
Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Wingkvist (2020). TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2593141
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Anna Wingkvist
Morgan Ericsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

Data collection and processing

The dataset is mainly collected from existing datasets. We used data from:

the archive of Reddit posts by Jason Baumgartner (available at https://pushshift.io,

the archive of Hacker News available at Google's BigQuery (available at https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news), and the Stack Exchange data dump (available at https://archive.org/details/stackexchange).

the GHTorrent project

the GH Archive

The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

We use the regular expression tech(nical)?[\s\-_]*?debt to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag technical-debt.

Data Format

The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

id: the id used in the original source. We use the URL path to identify Medium posts.

body: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).

created_utc: the time the item was posted in seconds since epoch in UTC.

author: the author of the item. We use the username or userid from the source.

source: where the item was posted. Valid sources are:

HackerNews Comment

HackerNews Job

HackerNews Submission

Reddit Comment

Reddit Submission

StackExchange Answer

StackExchange Comment

StackExchange Question

Medium Post

meta: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., score and num_comments for keys that have the same meaning/information across multiple sources.

This is a sample item from Reddit:

{ "id": "ab8auf", "body": "Technical Debt Explained (x-post r/Eve)", "created_utc": 1546271789, "author": "totally_100_human", "source": "Reddit Submission", "meta": { "title": "Technical Debt Explained (x-post r/Eve)", "score": 1, "num_comments": 0, "url": "http://jestertrek.com/eve/technical-debt-2.png", "subreddit": "RCBRedditBot" } }

Sample Analyses

We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use jq to process the JSON.

How many items are there for each source?

lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c

How many submissions that mentioned technical debt were posted each month?

lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c

What are the titles of items that link (meta.url) to PDF documents?

lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'

Please, I want CSV!

lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'

Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

Please see https://github.com/sse-lnu/tdmentions for more analyses

Limitations and Future updates

The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
e
Covid-19 JHU (Johns Hopkins University)
data.europa.eu
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruno Adelé, Covid-19 JHU (Johns Hopkins University) [Dataset]. https://data.europa.eu/data/datasets/5eb2f0fec170a3c7c331a101?locale=en
Explore at:
csv(2621440)Available download formats
Dataset authored and provided by
Bruno Adelé
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Data from a Covid-19 data extraction from Johns Hopkins University (JHU)

The data were processed with the project script world-datas-analysis in order to add additional columns, including the ratio of cases in relation to the number of inhabitants, it was then exported in CSV format.

Initial source: https://github.com/CSSEGISandData/COVID-19 File exported from world-datas-analysis: CSV file

The project world-datas-analysis can export in gnuplot format filtered data according to your needs, see example below

Example rendered

Example rendering with gnuplot

enter image description here
Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase...
data.nist.gov
datasets.ai
+1more
Updated Apr 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2021). Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase Material Database [Dataset]. http://doi.org/10.18434/mds2-2586
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2586, https://identifiers.org/ark:/88434/mds2-2586
Dataset updated
Apr 22, 2021
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The MaCFP Condensed Phase Subgroup has been designed to enable the fire research community to make significant progress towards establishing a common framework for the selection of experiments and the methodologies used to analyze these experiments when developing pyrolysis models. Experimental measurements prepared for the MaCFP Condensed Phase Working Group are submitted electronically by participating institutions and are organized and made publicly available in the MaCFP repository, which is hosted on GitHub [https://github.com/MaCFP/matl-db]. This database is version controlled, with each addition to (or edit of) measurement data saved with a unique identifier (i.e., commit tag). The repository was created and is managed by members of the MaCFP Organizing Committee.

As of October, 2021, the MaCFP Condensed Phase Material Database contains measurement data from more than 200 unique experiments (conducted under 35 different test conditions on the same exact poly(methyl methacrylate), PMMA). All measurement data submitted by each institution is organized in a single folder with the institution's name. A consistent file naming convention is used for all test data (i.e., across all folders). File names indicate the institution name, experimental apparatus, and basic test conditions (e.g., gaseous environment and incident heat flux or heating rate). Measurement data from repeated experiments is saved in separate, ASCII comma-delimited (.csv) files, each numbered sequentially. Written description of sample preparation, test setup, and test procedure (which define the conditions associated with the experiments conducted) are included in each folder as a README.md file; this file is automatically interpreted by GitHub as Markdown (.md) text and provides a brief description of an institution's data.

How to cite this data

You may cite the use of this data as follows: Batiot, B., Bruns, M., Hostikka, S., Leventon, I., Nakamura, Y., Reszka, P., Rogaume, T., Stoliarov, S., Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase Material Database, https://github.com/MaCFP/matl-db, Commit Tag: [give commit; e.g., 7f89fd8], https://doi.org/10.18434/mds2-2586 (Accessed: [give download date]) This data is publicly available according to the NIST statements of copyright, fair use and licensing; see:

https://www.nist.gov/director/copyright-fair-use-and-licensing-statements-srd-data-and-software

Version History

The MaCFP repository, which is hosted on GitHub [https://github.com/MaCFP/matl-db], is version controlled, with each addition (or edit) saved with a unique identifier (i.e., commit tag). When citing this database, you must include the commit tag that identifies the version of the repository you are working with.

Experiments Conducted

----- 1. Milligram-Scale Tests: ----- 1.1 Thermogravimetric Analysis (TGA) 1.2 Differential Scanning Calorimetry (DSC) 1.3 Microscale Combustion Calorimetry (MCC) ----- 2. Gram-Scale Tests ----- 2.1 Cone Calorimeter 2.2 Anaerobic Gasification

2.3 Thermal Conductivity and Diffusivity (Hot Disk and Laser Flash)

How to interpret and use data in this repository for pyrolysis model calibration and validation

Further information regarding the use and interpretation of the data in this repository is available online: https://github.com/MaCFP/matl-db/tree/master/Non-charring/PMMA This information includes: Key factors influencing material response during tests

Outlier Criteria: Identification of clearly incorrect behavior in measurement data

Methodological Information

A preliminary summary of the measurement data contained in this repository is available online: https://github.com/MaCFP/matl-db/releases
Data from: Burmese-Microbiology-1K
kaggle.com
huggingface.co
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Si Thu (2024). Burmese-Microbiology-1K [Dataset]. https://www.kaggle.com/datasets/minsithu/burmese-microbiology-1k/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Min Si Thu
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Burmese-Microbiology-1K

Min Si Thu, min@globalmagicko.com

Microbiology 1K QA pairs in Burmese Language

Purpose

Before this Burmese Clinical Microbiology 1K dataset, the open-source resources to train the Burmese Large Language Model in Medical fields were rare. Thus, the high-quality dataset needs to be curated to cover medical knowledge for the development of LLM in the Burmese language

Motivation

I found an old notebook in my box. The book was from 2019. It contained written notes on microbiology when I was a third-year medical student. Because of the need for Burmese language resources in medical fields, I added more facts, and more notes and curated a dataset on microbiology in the Burmese language.

About

The dataset for microbiology in the Burmese language contains 1262 rows of instruction and output pairs in CSV format. The dataset mainly focuses on clinical microbiology foundational knowledge, abstracting basic facts on culture medium, microbes - bacteria, viruses, fungi, parasites, and diseases caused by these microbes.

Examples

ငှက်ဖျားရောဂါဆိုတာ ဘာလဲ?,ငှက်ဖျားရောဂါသည် Plasmodium ကပ်ပါးကောင်ကြောင့် ဖြစ်ပွားသော အသက်အန္တရာယ်ရှိနိုင်သည့် သွေးရောဂါတစ်မျိုးဖြစ်သည်။ ၎င်းသည် ငှက်ဖျားခြင်ကိုက်ခြင်းမှတဆင့် ကူးစက်ပျံ့နှံ့သည်။

Influenza virus အကြောင်း အကျဉ်းချုပ် ဖော်ပြပါ။,Influenza virus သည် တုပ်ကွေးရောဂါ ဖြစ်စေသော RNA ဗိုင်းရပ်စ် ဖြစ်သည်။ Orthomyxoviridae မိသားစုဝင် ဖြစ်ပြီး type A၊ B၊ C နှင့် D ဟူ၍ အမျိုးအစား လေးမျိုး ရှိသည်။

Clostridium tetani ဆိုတာ ဘာလဲ,Clostridium tetani သည် မေးခိုင်ရောဂါ ဖြစ်စေသော gram-positive၊ anaerobic bacteria တစ်မျိုး ဖြစ်သည်။ မြေဆီလွှာတွင် တွေ့ရလေ့ရှိသည်။

Onychomycosis ဆိုတာ ဘာလဲ?,Onychomycosis သည် လက်သည်း သို့မဟုတ် ခြေသည်းများတွင် ဖြစ်ပွားသော မှိုကူးစက်မှုဖြစ်သည်။ ၎င်းသည် လက်သည်း သို့မဟုတ် ခြေသည်းများကို ထူထဲစေပြီး အရောင်ပြောင်းလဲစေသည်။

Where to download the dataset

Github - https://github.com/MinSiThu/Burmese-Microbiology-1K/blob/main/data/Microbiology.csv

Zenodo - https://zenodo.org/records/12803638

Hugginface - https://huggingface.co/datasets/jojo-ai-mst/Burmese-Microbiology-1K

Kaggle - https://www.kaggle.com/datasets/minsithu/burmese-microbiology-1k

Applications

Burmese Microbiology 1K Dataset can be used in building various medical-related NLP applications.

The dataset can be used for pretraining or finetuning the dataset on Burmese Large Langauge Models.

The dataset is ready to use in building RAG-based Applications.

Acknowledgments

Special thanks to magickospace.org for supporting the curation process of Burmese Microbiology 1K Dataset.

References for this datasets

https://openstax.org/details/books/microbiology - For medical facts

https://moh.nugmyanmar.org/my/ - For burmese words for disease names

https://myordbok.com/dictionary/english - English-Myanmar Translation Dictionary

License - CC BY SA 4.0

How to cite the dataset

Si Thu, M. (2024). Burmese MicroBiology 1K Dataset (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12803638 Si Thu, Min, Burmese-Microbiology-1K (July 24, 2024). Available at SSRN: https://ssrn.com/abstract=4904320
O
Sample of Providers from QHP provider.json files
healthdata.demo.socrata.com
csv, xlsx, xml
Updated Apr 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Sample of Providers from QHP provider.json files [Dataset]. https://healthdata.demo.socrata.com/CMS-Insurance-Plans/Sample-of-Providers-from-QHP-provider-json-files/axbq-xnwy
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Apr 16, 2016
Description
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py

Facebook

Twitter

Click to copy link

Link copied

Cite

Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261

Influence of Continuous Integration on the Development Activity in GitHub Projects

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1140261

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

were active for one year before the first build with Travis CI (before_ci),
used Travis CI at least for one year (during_ci),
had commit or merge activity on the default branch in both of these phases, and
used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

The dataset contains the following files:

tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).

Clear search

Close search

Google apps

Main menu

Influence of Continuous Integration on the Development Activity in GitHub...

GitTables 1M - CSV files

Coronavirus (Covid-19) Data in the United States

Dataset metadata of known Dataverse installations, August 2024

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Data from: "A guide to using GitHub for developing and versioning data...

Magic, Memory, and Curiosity (MMC) fMRI Dataset

Overview

Stimuli

Participant responses

Preprocessing

Derivatives

UCI and OpenML Data Sets for Ordinal Quantification

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Root directory

Dataset

RQ1

RQ2

RQ3

scripts

Annotated 12-lead ECG dataset

steinbock results of IMC example data

Readme files in 16,000,000 public GitHub repositories (October 2016)

Developer Community and Code Datasets

S1: One Tree Reef Foraminifera: a relic of the pre-colonial Great Barrier...

Evaluated Artifact for "Quantifying Software Reliability via Model-Counting"...

counterSharp Experiment and Play Environment

Getting Started

Benchmarks

Benchmark Results

Machine Details

Running benchmarks

Log summarization

Building the docker container

Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts

TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

Data collection and processing

Data Format

Sample Analyses

How many items are there for each source?

How many submissions that mentioned technical debt were posted each month?

What are the titles of items that link (meta.url) to PDF documents?

Please, I want CSV!

Limitations and Future updates

Covid-19 JHU (Johns Hopkins University)

Data from a Covid-19 data extraction from Johns Hopkins University (JHU)

Example rendered

Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase...

How to cite this data

https://www.nist.gov/director/copyright-fair-use-and-licensing-statements-srd-data-and-software

Version History

Experiments Conducted

2.3 Thermal Conductivity and Diffusivity (Hot Disk and Laser Flash)

How to interpret and use data in this repository for pyrolysis model calibration and validation

Outlier Criteria: Identification of clearly incorrect behavior in measurement data

Methodological Information

Data from: Burmese-Microbiology-1K

Burmese-Microbiology-1K

Min Si Thu, min@globalmagicko.com

Purpose

Motivation

About

Examples

Where to download the dataset

Applications

Acknowledgments

References for this datasets

License - CC BY SA 4.0

How to cite the dataset

Sample of Providers from QHP provider.json files

Influence of Continuous Integration on the Development Activity in GitHub ProjectsSee More Versions

`counterSharp` Experiment and Play Environment

What are the titles of items that link (`meta.url`) to PDF documents?

Influence of Continuous Integration on the Development Activity in GitHub Projects