Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.
We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:
To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.
We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).
The dataset contains the following files:
tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.
tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform,
author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
pages = {1--7},
title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
numpages = {11},
title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
year = {2022},
doi = {10.1145/3477495.3531726},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531726},
}
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
At the same time, annotations are associated with a particular object identified by:
entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation
These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub. In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files. There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The stimuli used here were short videos of magic tricks taken from a validated stimulus set (MagicCATs, Ozono et al., 2021) specifically created for the usage in fMRI studies. All final stimuli are available upon request. The request procedure is outlined in the Open Science Framework repository associated with the MagicCATs stimulus set (https://osf.io/ad6uc/).
Participants’ responses to demographic questions, questionnaires, and performance in the working memory assessment as well as both tasks are available in comma-separated value (CSV) files. Demographic (MMC_demographics.csv), raw questionnaire (MMC_raw_quest_data.csv) and other score data (MMC_scores.csv) as well as other information (MMC_other_information.csv) are structured as one line per participant with questions and/or scores as columns. Explicit wordings and naming of variables can be found in the supplementary information. Participant scan summaries (MMC_scan_subj_sum.csv) contain descriptives of brain coverage, TSNR, and framewise displacement (one row per participant) averaged first within acquisitions and then within participants. Participants’ responses and reaction times in the magic trick watching and memory task (MMC_experimental_data.csv) are stored as one row per trial per participant.
Data was preprocessed using the AFNI (version 21.2.03) software suite. As a first step, the EPI timeseries were distortion-corrected along the encoding axis (P>>A) using the phase difference map (‘epi_b0_correct.py’). The resulting distortion-corrected EPIs were then processed separately for each task, but scans from the same task were processed together. The same blocks were applied to both task and resting-state distortion-corrected EPI data using afni_proc.py (see below): despiking, slice-timing and head-motion correction, intrasubject alignment between anatomy and EPI, intersubject registration to MNI, masking, smoothing, scaling, and denoising. For more details, please refer to the data descriptor (LINK) or the Github repository (https://github.com/stefaniemeliss/MMC_dataset).
afni_proc.py -subj_id "${subjstr}" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat $derivindir/$anatSS \
-anat_has_skull no \
-anat_follower anat_w_skull anat $derivindir/$anatUAC \
-anat_follower_ROI aaseg anat $sswindir/$fsparc \
-anat_follower_ROI aeseg epi $sswindir/$fsparc \
-anat_follower_ROI FSvent epi $sswindir/$fsvent \
-anat_follower_ROI FSWMe epi $sswindir/$fswm \
-anat_follower_ROI FSGMe epi $sswindir/$fsgm \
-anat_follower_erode FSvent FSWMe \
-dsets $epi_dpattern \
-outlier_polort $POLORT \
-tcat_remove_first_trs 0 \
-tshift_opts_ts -tpattern altplus \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-align_epi_strip_method 3dSkullStrip \
-tlrc_base MNI152_2009_template_SSW.nii.gz \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets $sswindir/$anatQQ $sswindir/$matrix $sswindir/$warp \
-volreg_base_ind 1 $min_out_first_run \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-volreg_no_extent_mask \
-mask_dilate 8 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size 8 \
-regress_motion_per_run \
-regress_ROI_PC FSvent 3 \
-regress_ROI_PC_per_run FSvent \
-regress_make_corr_vols aeseg FSvent \
-regress_anaticor_fast \
-regress_anaticor_label FSWMe \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
The anat folder contains derivatives associated with the anatomical scan. The skull-stripped image created using @SSwarper is available in original and ICBM 2009c Nonlinear Asymmetric Template space as sub-[group][ID]_space-[space]_desc-skullstripped_T1w.nii.gz together with the corresponding affine matrix (sub-[group][ID]_aff12.1D) and incremental warp (sub-[group][ID]_warp.nii.gz). Output generated using @SUMA_Make_Spec_FS (defaced anatomical image, whole brain and tissue masks, as well as FreeSurfer discrete segmentations based on the Desikan-Killiany cortical atlas and the Destrieux cortical atlas) are also available as sub-[group][ID]_space-orig_desc-surfvol_T1w.nii.gz, sub-[group][ID]_space-orig_label-[label]_mask.nii.gz, and sub-[group][ID]_space-orig_desc-[atlas]_dseg.nii.gz, respectively.
The func folder contains derivatives associated with the functional scans. To enhance re-usability, the fully preprocessed and denoised files are shared as sub-[group][ID]_task-[task]_desc-fullpreproc_bold.nii.gz. Additionally, partially preprocessed files (distortion corrected, despiked, slice-timing/head-motion corrected, aligned to anatomy and template space) are uploaded as sub-[group][ID]_task-[task]_run-[1-3]_desc-MNIaligned_bold.nii.gz together with slightly dilated brain mask in EPI resolution and template space where white matter and lateral ventricle were removed (sub-[group][ID]_task-[task]_space-MNI152NLin2009cAsym_label-dilatedGM_mask.nii.gz) as well as tissue masks in EPI resolution and template space (sub-[group][ID]_task-[task]_space-MNI152NLin2009cAsym_label-[tissue]_mask.nii.gz).
The regressors folder contains nuisance regressors stemming from the output of the full afni_proc.py preprocessing pipeline. They are provided as space-delimited text values where each row represents one volume concatenated across all runs for each task separately. Those estimates that are provided per run contain the data for the volumes of one run and zeros for the volumes of other runs. This allows them to be regressed out separately for each run. The motion estimates show rotation (degree counterclockwise) in roll, pitch, and yaw and displacement (mm) in superior, left, and posterior direction. In addition to the motion parameters with respect to the base volume (sub-[group][ID]_task-[task]_label-mot_regressor.1D), motion derivatives (sub-[group][ID]_task-[task]_run[1-3]_label-motderiv_regressor.1D) and demeaned motion parameters (sub-[group][ID]_task-[task]_run[1-3]_label-motdemean_regressor.1D) are also available for each run separately. The sub-[group][ID]_task-[task]_run[1-3]_label-ventriclePC_regressor.1D files contain time course of the first three PCs of the lateral ventricle per run. Additionally, outlier fractions for each volume are provided (sub-[group][ID]_task-[task]_label-outlierfrac_regressor.1D) and sub-[group][ID]_task-[task]_label-censorTRs_regressor.1D shows which volumes were censored because motion or outlier fraction exceeded the limits specified. The voxelwise time course of local WM regressors created using fast ANATICOR is shared as sub-[group][ID]_task-[task]_label-localWM_regressor.nii.gz.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
statistics.r
: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreementsmodelsInfo.zip
: zip file containing all the downloaded model cards (in JSON format)script
: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.Dataset/Dataset_HF-models-list.csv
: list of HF models analyzedDataset/Dataset_github-prj-list.txt
: list of GitHub projects using the transformers libraryDataset/Dataset_github-Prj_model-Used.csv
: contains usage pairs: project, modelDataset/Dataset_prj-num-models-reused.csv
: number of models used by each GitHub projectDataset/Dataset_model-download_num-prj_correlation.csv
contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloadsRQ1/RQ1_dataset-list.txt
: list of HF datasetsRQ1/RQ1_datasetSample.csv
: sample set of models used for the manual analysis of datasetsRQ1/RQ1_analyzeDatasetTags.py
: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip
in a directory with the same name (modelsInfo
) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py
scriptRQ1/RQ1_countDataset.py
: given the output of RQ2/analyzeDatasetTags.py
(passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysisRQ1/RQ1_datasetTags.csv
: output of RQ2/analyzeDatasetTags.py
RQ1/RQ1_dataset_usage_count.csv
: output of RQ2/countDataset.py
RQ2/tableBias.pdf
: table detailing the number of occurrences of different types of bias by model TaskRQ2/RQ2_bias_classification_sheet.csv
: results of the manual labelingRQ2/RQ2_isBiased.csv
: file to compute the inter-rater agreement of whether or not a model documents BiasRQ2/RQ2_biasAgrLabels.csv
: file to compute the inter-rater agreement related to bias categoriesRQ2/RQ2_final_bias_categories_with_levels.csv
: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate categoryRQ3/RQ3_LicenseValidation.csv
: manual validation of a sample of licensesRQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt
: lists of licenses with different permissivenessRQ3/RQ3_prjs_license.csv
: for each project linked to models, among other fields it indicates the license tag and nameRQ3/RQ3_models_license.csv
: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of licenseRQ3/RQ3_model-prj-license_contingency_table.csv
: usage contingency table between projects' licenses (columns) and models' licenses (rows)RQ3/RQ3_models_prjs_licenses_with_type.csv
: pairs project-model, with their respective licenses and permissiveness levelContains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Annotated 12 lead ECG dataset Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. It contain annotations about 6 different ECGs abnormalities: - 1st degree AV block (1dAVb); - right bundle branch block (RBBB); - left bundle branch block (LBBB); - sinus bradycardia (SB); - atrial fibrillation (AF); and, - sinus tachycardia (ST). Companion python scripts are available in: https://github.com/antonior92/automatic-ecg-diagnosis -------- Citation ``` Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4 ``` Bibtex: ``` @article{ribeiro_automatic_2020, title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network}, author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.}, year = {2020}, volume = {11}, pages = {1760}, doi = {https://doi.org/10.1038/s41467-020-15432-4}, journal = {Nature Communications}, number = {1} } ``` ----- ## Folder content: - `ecg_tracings.hdf5`: this file is not available on github repository because of the size. But it can be downloaded [here](https://doi.org/10.5281/zenodo.3625006). The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVL, AVF, AVR, V1, V2, V3, V4, V5, V6}`. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V. In python, one can read this file using the following sequence: ```python import h5py with h5py.File(args.tracings, "r") as f: x = np.array(f['tracings']) ``` - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line. - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST` corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`). 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist. 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score. 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset). 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset). 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you are working with these files, please cite them as follows:Windhager, J., Zanotelli, V.R.T., Schulz, D. et al. An end-to-end workflow for multiplexed image processing and analysis. Nat Protoc (2023). https://doi.org/10.1038/s41596-023-00881-0 This repository hosts the results of processing example imaging mass cytometry (IMC) data hosted at zenodo.org/record/5949116 using the steinbock framework available at github.com/BodenmillerGroup/steinbock. Please refer to steinbock.sh for how these data were generated from the raw data. The following files are part of this repository:
panel.csv: contains channel information regarding the used antibodies in steinbock format img.zip: contains hot pixel filtered multi-channel images derived from the IMC raw data. One file per acquisition is generated images.csv: contains metadata per acquisition pixel_classifier.ilp: ilastik pixel classifier (same as the one in zenodo.org/record/6043544) ilastik_crops.zip: image crops on which the ilastik classifier was trained (same as the ones in zenodo.org/record/6043544) ilastik_img.zip: contains multi-channel images (one per acquisition) in .h5 format for ilastik pixel classification ilastik_probabilities.zip: 3 channel images containing the pixel probabilities after pixel classification masks_ilastik.zip: segmentation masks derived from the ilastik pixel probabilities using the cell_segmentation.cppipe pipeline masks_deepcell.zip: segmentation masks derived by deepcell segmentation intensities.zip: Contains one .csv file per acquisition. Each file contains single-cell measures of the mean pixel intensity per cell and channel based on the files in img.zip and masks_deepcell.zip. regionprops.zip: Contains one .csv file per acquisition. Each file contains single-cell measures of the morphological features and location of cells based on masks_deepcell.zip. neighbors.zip: Contains one .csv file per acquisition. Each file contains an edge list of cell IDs indicating cells in close proximity based on masks_deepcell.zip. ome.zip: contains .ome.tiff files derived from img.zip; one file per acquisition histocat.zip: contains single-channel .tiff files with segmentation masks derived from masks_deepcell.zip for upload to histoCAT (bodenmillergroup.github.io/histoCAT) cells.csv: contains intensity and regionprop measurements of all cells cells_csv.zip: contains intensity and regionprop measurements of all cells per acquisition cells.fcs: contains intensity and regionprop measurements of all cells in fcs format cells_fcs.zip: contains intensity and regionprop measurements of all cells per acquisition in fcs format cells.h5ad: contains intensity, regionprop and neighbor measurements of all cells in anndata format cells_h5ad: contains intensity regionprop and neighbor measurements of all cells per acquisition in anndata format graphs.zip: contains spatial object graphs in .graphml format; one file per acquisition
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Format
index.csv.gz - CSV comma separated file with 3 columns:
The flag is either "s" (readme found) or "r" (readme does not exist on the root directory level). Readme file name may be any from the list:
"README.md", "readme.md", "Readme.md", "README.MD", "README.txt", "readme.txt", "Readme.txt", "README.TXT", "README", "readme", "Readme", "README.rst", "readme.rst", "Readme.rst", "README.RST"
100 part-r-00xxx files are in "new" Hadoop API format with the following settings:
inputFormatClass is org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
keyClass is org.apache.hadoop.io.Text - repository name
valueClass is org.apache.hadoop.io.BytesWritable - gzipped readme file
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.
Data Sources:
GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.
StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.
DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.
Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.
With our datasets, you'll receive:
Choose from various output formats, storage options, and delivery frequencies:
Why choose our Datasets?
Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.
Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.
Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Foraminifera and sample data. https://github.com/makosnik/whoForams. Mamo_Lagoon_Forams.csv contains Foraminifera abundance data for surface grab samples (Mamo 2016). Mamo_Water_depth.csv contains site location and water depth data for the Mamo collection sites. OTR_Core_Forams.csv contains the Foraminifera abundance data for the OTR core. The columns have the layer depth in cm and the fraction size in um. OTR_Core_pb210_CIC_Ages.csv contains the Pb-210 dating results used for this paper, originally published in Kosnik et al. (2015). All_sed_results.csv contains sediment grain size analyses used for Figure 2, originally published in Kosnik et al. (2015).
counterSharp
Experiment and Play EnvironmentThis repository contains the reproducible experimental evaluation of the counterSharp tool. The repository contains a Docker file which configures the counterSharp tool, two model counters (ApproxMC and Ganak) as well as the tool by Dimovski et al. for our experiments. Furthermore the repository contains the benchmarks on which we ran our experiments as well as the logs of our experiments and scripts for transforming the log files into LaTeX tables.
In order to pull and run the docker container from docker hub you should execute docker run
.
Or, you can load the archived and evaluated artifact into docker with
docker load < countersharp-experiments.tar.gz
If the image is loaded, docker run
opens a shell allowing the execution of further commands:
bash
docker run -it -v `pwd`/results:/experiments/results samweb/countersharp-experiments
By using a volume, the results are written to the host system rather than the docker container
You can remove the volume mounting option (-v ...
), and create /experiments/results
inside the container if you can spare the results.
If you are using the volume and run into permission problems, then you need to give rights via SELinux: chcon -Rt svirt_sandbox_file_t
pwd/results
.
This will create a writable folder results
in your current folder which will hold any logs from experiments.
A minimal example can be executed by running (this takes approximately 70 seconds):
bash
./showcase.sh
This will create benchmark log files for the benchmarks for_bounded_loop1.c
and overflow.c
in the folder results
.
For example, /experiments/results/for_bounded_loop1.c/0X/
contains five folder for the five repeated runs of the experiments on this file.
Each folder /experiments/results/for_bounded_loop1.c/0X/
contains the folders for the different tools, which includes the log and output files.
A full run can be executed by running (this takes approximately a little under 2 days):
bash
./run-all.sh
Additionally single benchmarks can be executed through the following commands:
bash
run-instance approx program.c "[counterSharp arguments]" # Runs countersharp with ApproxMC on program.c
run-instance ganak program.c "[counterSharp arguments]" # Runs countersharp with Ganak on program.c
Probab.native -single -domain polyhedra program.c # Runs the tool by Dimovski et al. for deterministic programs
Probab.native -single -domain polyhedra -nondet program.c # Runs the tool by Dimovski et al. for nondeterministic programs
For example we can execute run-instance approx /experiments/benchmarks/confidence.c "--function testfun --unwind 1"
to obtain the outcome of counterSharp and ApproxMC for the benchmark confidence.c
.
Note that the time information produced by runlim is always only for one part of the entire execution (i.e. for counterSharp or one ApproxMC run or one Ganak run). The script run-instance
is straightforwarded, we have the call to our tool counterSharp:
bash
python3 -m counterSharp --amm /tmp/amm.dimacs --amh /tmp/amh.dimacs --asm /tmp/asm.dimacs --ash /tmp/ash.dimacs --con /tmp/con.dimacs -d $3 $2
which is followed by the call of ApproxMC
oder ganak
.
The benchmarks are contained in the folder benchmarks
which also includes an overview on the sources and modifications to the benchmarks
Note that benchmark versions for the tool by Dimovski et al. are contained in folder benchmarks-dimovski
The results are contained in the folder results
in which all logs from benchmark runs reside.
The log files from the evaluation are not available in the Docker Image, but just on GitHub.
The logs are split-up by benchmark instance (first level folder), run number (second level folder) and tool (third level folder)
For example, the file results/bwd_loop1a.c/01/approxmc/stdout.log
contains the stdout and stderr of running approxmc on the instance bwd_loop1a.c
in run 01
All runs were executed on a Linux machine housing an Intel(R) Core(TM) i5-6500 CPU (3.20GHz) and 16GB of memory.
Note that for every benchmark log 01/counterSharp/init.log
contains information on the machine used for benchmark execution as well as on the commits used in the experiments.
For all cases of automated benchmark execution we assume a CSV file containing relevant information on the instances to run: The first column is the benchmark's name, the second column are parameters passed to counterSharp (see instances.csv
) or the tool by Dimovski (see instances-dimovski.csv
).
All scripts produce benchmarking results for "missing" instances, i.e. instances for which no folder can be found in the results
folder.
- Run counterSharp on the benchmarks:
run-counterSharp instances.csv
- Run ApproxMC on benchmarks:
run-approxmc instances.csv
- Only after counterSharp has been run
- Run GANAK on benchmarks:
run-ganak instances.csv
- Only after counterSharp has been run
- Run Dimovski's tool on benchmarks:
run-dimovski instances-dimovski.csv
Summarization is possible through the python script in logParsing/parse.py
within the container.
The script takes as input a list of benchmarks to process and returns (parts of) a LaTeX table.
Note, that there must exist logs for all benchmarks provided in the CSV file for the call to succeed!
- To obtain (sorted) results for deterministic benchmarks:
cat logParsing/deterministic-sorted.csv| python3 logParsing/parse.py results aggregate2
- To obtain (sorted) results for nondeterministic benchmarks:
cat logParsing/nondeterministic-sorted.csv| python3 logParsing/parse.py results nondet
All tools are packaged into a Dockerfile which makes any installation unnecessary.
There is, however, the need for a running Docker installation.
The Dockerfile
build depends on the accessibility of the following GitHub Repositories:
- CryptoMiniSat
- ApproxMC
- Ganak
- Probab_Analyzer
- counterSharp
The Docker image is hosted at Dockerhub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.
The dataset is mainly collected from existing datasets. We used data from:
The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.
We use the regular expression tech(nical)?[\s\-_]*?debt
to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag technical-debt
.
The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.
id
: the id used in the original source. We use the URL path to identify Medium posts.body
: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).created_utc
: the time the item was posted in seconds since epoch in UTC. author
: the author of the item. We use the username or userid from the source.source
: where the item was posted. Valid sources are:
meta
: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., score
and num_comments
for keys that have the same meaning/information across multiple sources.This is a sample item from Reddit:
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use jq
to process the JSON.
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
meta.url
) to PDF documents?lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.
Please see https://github.com/sse-lnu/tdmentions for more analyses
The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The data were processed with the project script world-datas-analysis in order to add additional columns, including the ratio of cases in relation to the number of inhabitants, it was then exported in CSV format.
Initial source: https://github.com/CSSEGISandData/COVID-19 File exported from world-datas-analysis: CSV file
The project world-datas-analysis can export in gnuplot format filtered data according to your needs, see example below
Example rendering with gnuplot
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The MaCFP Condensed Phase Subgroup has been designed to enable the fire research community to make significant progress towards establishing a common framework for the selection of experiments and the methodologies used to analyze these experiments when developing pyrolysis models. Experimental measurements prepared for the MaCFP Condensed Phase Working Group are submitted electronically by participating institutions and are organized and made publicly available in the MaCFP repository, which is hosted on GitHub [https://github.com/MaCFP/matl-db]. This database is version controlled, with each addition to (or edit of) measurement data saved with a unique identifier (i.e., commit tag). The repository was created and is managed by members of the MaCFP Organizing Committee.
You may cite the use of this data as follows: Batiot, B., Bruns, M., Hostikka, S., Leventon, I., Nakamura, Y., Reszka, P., Rogaume, T., Stoliarov, S., Measurement and Computation of Fire Phenomena (MaCFP) Condensed Phase Material Database, https://github.com/MaCFP/matl-db, Commit Tag: [give commit; e.g., 7f89fd8], https://doi.org/10.18434/mds2-2586 (Accessed: [give download date]) This data is publicly available according to the NIST statements of copyright, fair use and licensing; see:
----- 1. Milligram-Scale Tests: ----- 1.1 Thermogravimetric Analysis (TGA) 1.2 Differential Scanning Calorimetry (DSC) 1.3 Microscale Combustion Calorimetry (MCC) ----- 2. Gram-Scale Tests ----- 2.1 Cone Calorimeter 2.2 Anaerobic Gasification
Further information regarding the use and interpretation of the data in this repository is available online: https://github.com/MaCFP/matl-db/tree/master/Non-charring/PMMA This information includes: Key factors influencing material response during tests
A preliminary summary of the measurement data contained in this repository is available online: https://github.com/MaCFP/matl-db/releases
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Microbiology 1K QA pairs in Burmese Language
Before this Burmese Clinical Microbiology 1K dataset, the open-source resources to train the Burmese Large Language Model in Medical fields were rare. Thus, the high-quality dataset needs to be curated to cover medical knowledge for the development of LLM in the Burmese language
I found an old notebook in my box. The book was from 2019. It contained written notes on microbiology when I was a third-year medical student. Because of the need for Burmese language resources in medical fields, I added more facts, and more notes and curated a dataset on microbiology in the Burmese language.
The dataset for microbiology in the Burmese language contains 1262 rows of instruction and output pairs in CSV format. The dataset mainly focuses on clinical microbiology foundational knowledge, abstracting basic facts on culture medium, microbes - bacteria, viruses, fungi, parasites, and diseases caused by these microbes.
ငှက်ဖျားရောဂါဆိုတာ ဘာလဲ?,ငှက်ဖျားရောဂါသည် Plasmodium ကပ်ပါးကောင်ကြောင့် ဖြစ်ပွားသော အသက်အန္တရာယ်ရှိနိုင်သည့် သွေးရောဂါတစ်မျိုးဖြစ်သည်။ ၎င်းသည် ငှက်ဖျားခြင်ကိုက်ခြင်းမှတဆင့် ကူးစက်ပျံ့နှံ့သည်။
Influenza virus အကြောင်း အကျဉ်းချုပ် ဖော်ပြပါ။,Influenza virus သည် တုပ်ကွေးရောဂါ ဖြစ်စေသော RNA ဗိုင်းရပ်စ် ဖြစ်သည်။ Orthomyxoviridae မိသားစုဝင် ဖြစ်ပြီး type A၊ B၊ C နှင့် D ဟူ၍ အမျိုးအစား လေးမျိုး ရှိသည်။
Clostridium tetani ဆိုတာ ဘာလဲ,Clostridium tetani သည် မေးခိုင်ရောဂါ ဖြစ်စေသော gram-positive၊ anaerobic bacteria တစ်မျိုး ဖြစ်သည်။ မြေဆီလွှာတွင် တွေ့ရလေ့ရှိသည်။
Onychomycosis ဆိုတာ ဘာလဲ?,Onychomycosis သည် လက်သည်း သို့မဟုတ် ခြေသည်းများတွင် ဖြစ်ပွားသော မှိုကူးစက်မှုဖြစ်သည်။ ၎င်းသည် လက်သည်း သို့မဟုတ် ခြေသည်းများကို ထူထဲစေပြီး အရောင်ပြောင်းလဲစေသည်။
Github - https://github.com/MinSiThu/Burmese-Microbiology-1K/blob/main/data/Microbiology.csv
Zenodo - https://zenodo.org/records/12803638
Hugginface - https://huggingface.co/datasets/jojo-ai-mst/Burmese-Microbiology-1K
Kaggle - https://www.kaggle.com/datasets/minsithu/burmese-microbiology-1k
Burmese Microbiology 1K Dataset can be used in building various medical-related NLP applications.
Special thanks to magickospace.org for supporting the curation process of Burmese Microbiology 1K Dataset.
https://openstax.org/details/books/microbiology - For medical facts
https://moh.nugmyanmar.org/my/ - For burmese words for disease names
https://myordbok.com/dictionary/english - English-Myanmar Translation Dictionary
Si Thu, M. (2024). Burmese MicroBiology 1K Dataset (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12803638
Si Thu, Min, Burmese-Microbiology-1K (July 24, 2024). Available at SSRN: https://ssrn.com/abstract=4904320
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.
We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:
To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.
We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).
The dataset contains the following files:
tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.
tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).