Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.
** Inventory of pre-trained learning models of the Etalab AI Lab** The publication of the inventory of pre-trained machine learning models is part of the roadmap of the Ministry of Transformation and Public Service (see p. 25 of the downloadable document here). This dataset lists the different algorithms trained by the Lab IA to date as part of the development of its shared tools (more information on the dedicated page of the IA Lab). Details of what the inventory contains For each algorithm, the column “link_model_card” provides a link to access a description of the algorithm. We followed the pattern description frame presented in Margaret Mitchell & al’s “Model Cards for Model Reporting” paper (downloadable here). The column “link_depot_github” returns to the GitHub repository containing the code that led to the algorithm. The column “model_entraine_open” has the value “no” if the trained model is not opened and is “yes” if the driven model is opened. In the latter case, the link to the driven model is entered in the column “link_modele_entraine_si_pertinent”. The column “date_last_mise_a_day” indicates the date of last update of the template.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Process Modeling Dataset
This dataset contains pairs of natural language process descriptions and their corresponding POWL (Process-Oriented Workflow Language) code implementations. It is designed for fine-tuning language models to translate informal process descriptions into formal process models.
Dataset Structure
The dataset consists of two splits:
train: Training examples for model fine-tuning validation: Validation examples for monitoring training progress
Each… See the full description on the dataset page: https://huggingface.co/datasets/maghwa/process-modeling-dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Model Library DB
Dataset Summary
The Model Library is a project that maps the risks associated with modern machine learning systems. Here, we assess some of the most recent and capable AI systems ever created. This is the database for the Model Library.
Supported Tasks and Leaderboards
This dataset serves as a catalog of machine learning models, all displayed in the Model Library.
Languages
English.
Dataset Structure
Data Instances… See the full description on the dataset page: https://huggingface.co/datasets/nicholasKluge/model-library.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains ecore metamodels from the MAR dataset transformed into tree representations. The original dataset can be found here: http://mar-search.org/experiments/models20/
The data contained in this repository were used to conduct the experiments in the paper: Recommending Metamodel Concepts during Modeling Activities with Pre-Trained Language Models. Link to the paper: https://arxiv.org/abs/2104.01642
The data are organized as follows:
This data repository is linked with the following Github repository containing our code: https://github.com/mweyssow/ecore-bert
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for "nepalitext-language-model-dataset"
Dataset Summary
"NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.
Supported Tasks and Leaderboards
This dataset is intended to pre-train language models and word representations on Nepali Language.
Languages
The data is… See the full description on the dataset page: https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the code of the three pre-processing tools pyGRETA, pyPRIMA and pyCLARA and an examplary database for the scope of Austria.
To run the code with full functionality additional data is needed. Check the documentation of the tools for further information.
Sources for data can be found here:
pyGRETA: https://pygreta.readthedocs.io/en/stable/user_manual.html#recommended-input-sources
pyPRIMA: https://pyprima.readthedocs.io/en/stable/user_manual.html#recommended-input-sources
pyCLARA: https://pyclara.readthedocs.io/en/stable/user_manual.html#recommended-input-sources
Dataset Card for TARA
Dataset Summary
TARA is a novel Tool-Augmented Reward modeling datAset that includes comprehensive comparison data of human preferences and detailed tool invocation processes. It was introduced in this paper and was used to train Themis-7b.
Supported Tools
TARA supports multiple tools including Calculator, Code, Translator, Google Search, Calendar, Weather, WikiSearch and Multi-tools.
Dataset Structure
calculator: preference… See the full description on the dataset page: https://huggingface.co/datasets/ernie-research/TARA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
THIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS.Different types of abusive content such as offensive language, hate speech, aggression, etc. have become prevalent in social media and many efforts have been dedicated to automatically detect this phenomenon in different resource-rich languages such as English. This is mainly due to the comparative lack of annotated data related to offensive language in low-resource languages, especially the ones spoken in Asian countries. To reduce the vulnerability among social media users from these regions, it is crucial to address the problem of offensive language in such low-resource languages. Hence, we present a new corpus of Persian offensive language consisting of 6,000 out of 520,000 randomly sampled micro-blog posts from X (Twitter) to deal with offensive language detection in Persian as a low-resource language in this area. We introduce a method for creating the corpus and annotating it according to the annotation practices of recent efforts for some benchmark datasets in other languages which results in categorizing offensive language and the target of offense as well. We perform extensive experiments with three classifiers in different levels of annotation with a number of classical Machine Learning (ML), Deep learning (DL), and transformer-based neural networks including monolingual and multilingual pre-trained language models. Furthermore, we propose an ensemble model integrating the aforementioned models to boost the performance of our offensive language detection task. Initial results on single models indicate that SVM trained on character or word n-grams are the best performing models accompanying monolingual transformer-based pre-trained language model ParsBERT in identifying offensive vs non-offensive content, targeted vs untargeted offense, and offensive towards individual or group. In addition, the stacking ensemble model outperforms the single models by a substantial margin, obtaining 5% respective macro F1-score improvement for three levels of annotation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset contains product reviews along with corresponding prices, names, review, summary and sentiment labels. The sentiment labels indicate whether the review expresses a positive, negative, or neutral sentiment towards the product. Based on the provided dataset, a possible application could be sentiment analysis of product reviews. This could involve using machine learning algorithms to automatically classify reviews as positive, negative, or neutral based on the textual content of the review and associated metadata such as the product name and price. Such a system could be used by businesses to track customer sentiment towards their products and identify areas for improvement. It could also be used by consumers to make more informed purchasing decisions based on the experiences of others.
CC0
Original Data Source: 171k product review with Sentiment Dataset
Point-BERT is a new paradigm for learning point cloud Transformers. It pre-trains standard point cloud Transformers with a Masked Point Modeling (MPM) task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Post-marketing reports of suspected adverse drug reactions are important for establishing the safety profile of a medicinal product. However, a high influx of reports poses a challenge for regulatory authorities as a delay in identification of previously unknown adverse drug reactions can potentially be harmful to patients. In this study, we use natural language processing (NLP) to predict whether a report is of serious nature based solely on the free-text fields and adverse event terms in the report, potentially allowing reports mislabelled at time of reporting to be detected and prioritized for assessment. We consider four different NLP models at various levels of complexity, bootstrap their train-validation data split to eliminate random effects in the performance estimates and conduct prospective testing to avoid the risk of data leakage. Using a Swedish BERT based language model, continued language pre-training and final classification training, we achieve close to human-level performance in this task. Model architectures based on less complex technical foundation such as bag-of-words approaches and LSTM neural networks trained with random initiation of weights appear to perform less well, likely due to the lack of robustness that a base of general language training provides.
This dataset was created by lmyybh
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The replication package of the paper "An Empirical Comparison of Pre-Trained Models of Source Code". For the source code, please refer to https://github.com/NougatCA/FineTuner.
Dataset Card for "LeNER-Br language modeling"
Dataset Summary
The LeNER-Br language modeling dataset is a collection of legal texts in Portuguese from the LeNER-Br dataset (official site). The legal texts were downloaded from this link (93.6MB) and processed to create a DatasetDict with train and validation dataset (20%). The LeNER-Br language modeling dataset allows the finetuning of language models as BERTimbau base and large.
Language
Portuguese from… See the full description on the dataset page: https://huggingface.co/datasets/pierreguillou/lener_br_finetuning_language_model.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
PixMo-AskModelAnything
PixMo-AskModelAnything is an instruction-tuning dataset for vision-language models. It contains human-authored question-answer pairs about diverse images with long-form answers. PixMo-AskModelAnything is a part of the PixMo dataset collection and was used to train the Molmo family of models Quick links:
📃 Paper 🎥 Blog with Videos
Loading
data = datasets.load_dataset("allenai/pixmo-ask-model-anything", split="train")
Data Format… See the full description on the dataset page: https://huggingface.co/datasets/allenai/pixmo-ask-model-anything.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was composed within the research project - NCN OPUS'21 project "Source-code-representations for machine-learning-based identification of defective code fragments" (2021/41/B/ST6/02510)] (https://ml4code.cs.put.poznan.pl/).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Model Card Dataset Mentions
Dataset Summary
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/model_card_dataset_mentions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
FIRE MODELS is a dataset for object detection tasks - it contains FIRE annotations for 201 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.