Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Public Policy at Hugging Face
AI Policy at Hugging Face is a multidisciplinary and cross-organizational workstream. Instead of being part of a vertical communications or global affairs organization, our policy work is rooted in the expertise of our many researchers and developers, from Ethics and Society Regulars and legal team to machine learning engineers working on healthcare, art, and evaluations. What we work on is informed by our Hugging Face community needs and experiences… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/policy-docs.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains images used in the documentation of HuggingFace's libraries.
HF Team: Please make sure you optimize the assets before uploading them. My favorite tool for this is https://tinypng.com/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreementsmodelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.Dataset/Dataset_HF-models-list.csv: list of HF models analyzedDataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers libraryDataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, modelDataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub projectDataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloadsRQ1/RQ1_dataset-list.txt: list of HF datasetsRQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasetsRQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py scriptRQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysisRQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.pyRQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.pyRQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model TaskRQ2/RQ2_bias_classification_sheet.csv: results of the manual labelingRQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents BiasRQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categoriesRQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate categoryRQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licensesRQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissivenessRQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and nameRQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of licenseRQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness levelContains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
m-ric/huggingface_doc dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.
!pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
huggingface-course/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset contains images used in the documentation of HuggingFace's Optimum library.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Hugging Face Transformers documentation as markdown dataset
This dataset was created using Clipper.js. Clipper is a Node.js command line tool that allows you to easily clip content from web pages and convert it to Markdown. It uses Mozilla's Readability library and Turndown under the hood to parse web page content and convert it to Markdown. This dataset can be used to create RAG applications, which want to use the transformers documentation. Example document:… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/markdown-documentation-transformers.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card
This dataset is the code documenation dataset used in StarCoder2 pre-training, and it is also part of the-stack-v2-train-extras descried in the paper.
Dataset Details
Overview
This dataset comprises a comprehensive collection of crawled documentation and code-related resources sourced from various package manager platforms and programming language documentation sites. It focuses on popular libraries, free programming books, and other relevant… See the full description on the dataset page: https://huggingface.co/datasets/SivilTaram/starcoder2-documentation.
Facebook
Twittereustlb/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittertechtitans232/medical-documentation-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterdavidberenstein1957/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittertiiuae/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
glide-the/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
nateraw/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhf-internal-testing/example-documents dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterChunte/documentation-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittermicpst/hf-docs-retrieval dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjamescalam/langchain-docs-23-06-27 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhwaseem04/Documentation-files dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Public Policy at Hugging Face
AI Policy at Hugging Face is a multidisciplinary and cross-organizational workstream. Instead of being part of a vertical communications or global affairs organization, our policy work is rooted in the expertise of our many researchers and developers, from Ethics and Society Regulars and legal team to machine learning engineers working on healthcare, art, and evaluations. What we work on is informed by our Hugging Face community needs and experiences… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/policy-docs.