Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Facebook
TwitterInformation about popular open source projects related to machine learning.
The goal of this dataset is to better undertand how open source machine learning projects evolve. Data collection date: early May 2018. Source: GitHub user interface and API. Contains original research.
name - name of the project. alignment - either corporate, academia or indie. Corporate projects are being developed by professional engineers, typically have a dedicated development team and trying to solve specific problems. Academical projects usually mention publications, they help to research. Independent projects are often a hobby. company - name of the company if the alignment is corporate. forecast - expected middle-term evolution of the project. 1 means positive, 0 means negative (stagnation) and -1 means factual death. year - when the project was created. Defaults to the GitHub repository creation date but can be earlier - this is a subject of manual adjustments. code of conduct - whether the project has a code of conduct. contributing - whether the project has a contributions guide. stars - number of stargazers on GitHub. issues - number of issues on GitHub, either open or closed. contributors - number of contributors as reported by GitHub. core - estimation of the core team aka "bus factor". team - number of people which commit to a project regularly. commits - number of commits in the project. team / all - ratio of the number of commits by the dedicated development team to the overall number of contributions. Indicates roughly which part of the project is own by the internal developers. link - URL of the project. language - API language. multi means several languages. implementation - the language which was mainly used for implementing the project. license - license of the project.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.
The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.
The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.
Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Codebase [Github] | Dataset [Zenodo]
Abstract
The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.
Usage
We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.
License
All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic, which is derived from Kandinsky-patterns and as such is distributed under the GPL-3.0 license.
Datasets Overview
The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].
References
[1] Xu et al., *Explainable Object-Induced Action Decision for Autonomous Vehicles*, CVPR 2020.
[2] Sawada and Nakamura, *Concept Bottleneck Model With Additional Unsupervised Concepts*, IEEE 2022.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning
To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.
My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.
A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
pip install zarr
import zarrraw = zarr.open(seg = zarr.open(
# optional:import numpy as npraw_np = np.array(raw)
Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.
We recommend to use napari to view the image data.
pip install "napari[all]"
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
python view_data.py
For more information on our selected metrics and formal definitions please see our paper.
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe,
title = {FISBe: A real-world benchmark dataset for instance
segmentation of long-range thin filamentous structures},
author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
year = 2024,
eprint = {2404.00130},
archivePrefix ={arXiv},
primaryClass = {cs.CV}
}
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.
There have been no changes to the dataset so far.
All future change will be listed on the changelog page.
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterThis dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input')
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.
Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.
Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.
| Package | #Graphs | #Nodes per Graph | #Edges per Graph | Split Type | Task Type | Metric |
|---|---|---|---|---|---|---|
ogb>=1.2.0 | 452,741 | 125.2 | 124.2 | Project | Sub-token prediction | F1 score |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.
We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.
Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
We strongly recommend that you read this content on the project page on GitHub, by clicking here
This project aims to develop solutions for vehicular perception through inertial sensor signals and Artificial Intelligence models. Vehicular perception comprises exteroception and proprioception. Exteroception aims to understand the environment outside the vehicle, recognizing the road features on which it travels. These features include transient events in the form of anomalies and obstacles, such as potholes, cracks, speed bumps, etc.; and persistent events, such as surface type, conservation condition, and the road surface quality. On the other hand, proprioception aims to understand vehicular movements to identify their own behavior. These identifications can also be transient in the form of driving events, such as lane change, braking, skidding, aquaplaning, turning right or left; and persistent, as a safe or dangerous driving behavior profile. This situational information (perceptions) has wide applicability in Intelligent Transport Systems (ITS) such as Advanced Driver Assistance Systems (ADAS) and autonomous vehicles.
For the development of this project, we collect nine datasets using GPS, camera, inertial sensors (accelerometers and gyroscopes), magnetometer, and temperature sensor. These data were produced with contextual variations including three different vehicles, driven by three different drivers, traveling through three different environments. To recognize and classify the vehicular perception patterns, we have developed several models based on Artificial Intelligence, among Classical Machine Learning and Deep Learning approaches. Below we describe the datasets produced, models developed and the results obtained, together with published scientific papers and source-codes.
The project is active and we are currently developing new models for new perception pattern recognition. Below are described the research progress, in chronological order of research development. At the Research Gate you can also find the published scientific papers and request a full text for free.
In this paper, we describe the state-of-the-art vehicle perception produced through inertial sensors and Artificial Intelligence techniques. Through a literature review, we compiled the data extracted from the selected studies and described each paper in detail and chronological order of publication.
Access here
In this paper, we present a structured literature mapping of the state-of-the-art vehicle perception produced through inertial sensors and Artificial Intelligence techniques. We describe a structured, approach, and technologies-oriented panorama of this field.
Access here
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains metadata for over 37,000 GitHub repositories, created for research on early-stage software project effort estimation.
It includes 37 attributes describing repository size, activity, collaboration, licensing, and language usage.
The dataset has been used with unsupervised machine learning models to analyze project segmentation according to different levels of effort.
No labeled data related to project or product complexity is included.
The data was gathered using both the GitHub GraphQL API and the GitHub REST API.
Repositories were included if they met all of the following criteria:
Most attributes were obtained via the GraphQL API.
The contributors attribute is not accessible via GraphQL, so it was collected using the REST API for each repository and then merged with the rest of the metadata.
Some attributes are synthetic features added during data analysis, such as:
language_countreponameBelow is a description of the 37 attributes included in the JSON schema:
owner/repository. Both datasets share identical variable definitions and schema.
This dataset is intended for research and experimentation in:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains the characteristics of 5000 of the most popular GitHub repositories, based on their total number of stars. It provides a comprehensive overview of each repository's essential features like name, language, description, URL, and growth pattern. Additionally, it offers insight into how these properties factor into the popularity and success of each repository. This can be especially helpful in understanding how certain languages or patterns are more successful in particular use cases or scenarios compared to others. By better understanding these factors and patterns developers can create projects that best suit their needs while having a higher chance at achieving success on GitHub
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive analysis of the domains of the most popular GitHub repositories, as measured by their total number of stars. It includes many valuable pieces of information that can be used to gain insight into current trends on the platform.
In order to use this dataset to its fullest potential, it's important to understand each piece of data provided and how it can be used.
- Comparing the popularity of various programming languages on GitHub.
- Examining the most common topics and domains represented in top repositories, to better understand how developers use GitHub for their projects.
- Identifying if certain growth patterns can be associated with higher popularity levels on GitHub, as measured by stars and forks
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Domains of 5,000 GitHub Repositories - Public - Domains.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------------------------------------------------------------------| | Name | The name of the repository. (String) | | Stars | The total number of stars which serve as a metric to measure popularity. (Integer) | | Forks | The total number of forks which indicate how much collaboration there is on a project. (Integer) | | Language | The programming language used in the repository. (String) | | Description | A brief overview describing what the repository does and its features. (String) | | URL | The URL associated with that specific repository. (String) | | Domain | The domain or area within which this particular project works. For example, artificial intelligence or machine learning. (String) | | Growth Pattern | This property gives insight into whether the popularity has been increasing steadily or if it’s plateaued out etc. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Facebook
TwitterThis document dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'documents.json' file is to be used together with the 'entities.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Hi, I’m Seniru Epasinghe 👋
I’m an AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.
🌐 Connect with me
There are 2 version of this dataset:
git-diff_to_commit_msg - 1.5K rows huggingface link kaggle link
git-diff_to_commit_msg_large - 1.75M rows huggingface link kaggle link… See the full description on the dataset page: https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. This is a repository to save datasets and codes related to this project. Please read and cite the following paper if you would like to use the data: Becker M., Han K., Werthmann A., Rezapour R., Lee H., Diesner J., and Witt A. (2024). Detecting Impact Relevant Sections in Scientific Research. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING). This folder contains the following files: evaluation_20220927.ods: Annotated German passages (Artificial Intelligence, Linguistics, and Music) - training data annotated_data.big_set.corrected.txt: Annotated German passages (Mobility) - training data incl_translation_all.csv: Annotated English passages (Artificial Intelligence, Linguistics, and Music) - training data incl_translation_mobility.csv: Annotated German passages (Mobility) - training data ttparagraph_addmob.txt: German corpus (unannotated passages) model_result_extraction.csv: Extracted impact-relevant passages from the German corpus based on the model we trained rf_model.joblib: The random forest model we trained to extract impact-relevant passages Data processing codes can be found at: https://github.com/khan1792/texttransfer
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE