2 datasets found

m
ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
Explore at:
Unique identifier
https://doi.org/10.17632/g2sdzmssgh.1
Dataset updated
Aug 15, 2025
Authors
Christopher Lynch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

Tagged datasets (.csv): human-tagged gold labels for evaluation

Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative

Suitable for inference, semi-automatic labeling, or transfer learning

Python and R code for preprocessing, model training, evaluation, and visualization

Configuration files and environment specifications to enable end-to-end reproducibility

The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
d
Data for: Advances and critical assessment of machine learning techniques...
search.dataone.org
datadryad.org
+1more
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Bucinsky; MariÃ¡n Gall; JÃ¡n MatÃºÅ¡ka; Michal PitoÅˆÃ¡k; Marek Å teklÃ¡Ä (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccg7
Dataset updated
Nov 29, 2023
Dataset provided by
Dryad Digital Repository
Authors
Lukas Bucinsky; MariÃ¡n Gall; JÃ¡n MatÃºÅ¡ka; Michal PitoÅˆÃ¡k; Marek Å teklÃ¡Ä
Time period covered
Mar 3, 2023
Description
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 Î¼M or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only..., Molecular docking calculations and the machine learning approaches are described in the Computational details section of [1]. Reference[1] Lukas Bucinsky, MariÃ¡n Gall, JÃ¡n MatÃºÅ¡ka, Michal PitoÅˆÃ¡k, Marek Å teklÃ¡Ä . Advances and critical assessment of machine learning techniques for prediction of docking scores. Int. J. Quantum. Chem. (2023) DOI: 10.1002/qua.27110., ,
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

Explore at:

Unique identifier

https://doi.org/10.17632/g2sdzmssgh.1

Dataset updated

Aug 15, 2025

Authors

Christopher Lynch

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

Tagged datasets (.csv): human-tagged gold labels for evaluation
Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative
- Suitable for inference, semi-automatic labeling, or transfer learning
Python and R code for preprocessing, model training, evaluation, and visualization
Configuration files and environment specifications to enable end-to-end reproducibility

The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

Clear search

Close search

Google apps

Main menu

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

Data for: Advances and critical assessment of machine learning techniques...

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods