Facebook
TwitterThe Toxicity Reference Database (ToxRefDB) contains approximately 30 years and $2 billion worth of animal studies. ToxRefDB allows scientists and the interested public to search and download thousands of animal toxicity testing results for hundreds of chemicals that were previously found only in paper documents. Currently, there are 474 chemicals in ToxRefDB, primarily the data rich pesticide active ingredients, but the number will continue to expand.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://tdcommons.ai/single_pred_tasks/tox#acute-toxicity-ld50
Dataset Description: Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. The higher the dose, the more lethal of a drug. This dataset is kindly provided by the authors of [1].
Task Description: Regression. Given a drug SMILES string, predict its acute toxicity.
**Dataset Statistics: ** 7,385 drugs.
Dataset Split: Random Split, Scaffold Split
from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
References: [1] Zhu, Hao, et al. “Quantitative structure− activity relationship modeling of rat acute toxicity by oral exposure.” Chemical research in toxicology 22.12 (2009): 1913-1921.
Dataset License: CC BY 4.0.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
toxric_30_datasets.zip: The expanded predictive toxicology dataset is sourced from TOXRIC, a comprehensive and standardized toxicology database. The toxric_30_datasets contains 30 assay datasets with ~150,000 measurements related to five categories. These categories span a range of toxicity assessment, including genetic toxicity, organic toxicity, clinical toxicity, developmental and reproductive toxicity, and reactive toxicity. multiple_endpoint_acute_toxicity_dataset.zip & all_descriptors.txt: This 59-endpoint acute toxicity dataset is sourced from TOXRIC. It includes 59 various toxicity endpoints with 80,081 unique compounds represented using SMILES strings, and 122,594 usable toxicity measurements described by continuous values with a unified toxicity chemical unit: -log(mol/kg). The larger the measurement value, the stronger the toxicity intensity of the corresponding compound towards a certain endpoint. The 59 acute toxicity endpoints involve 15 different species including mouse, rat, rabbit, guinea pig, dog, cat, bird wild, quail, duck, chicken, frog, mammal, man, women, and human, 8 different administration routes including intraperitoneal, intravenous, oral, skin, subcutaneous, intramuscular, parenteral, and unreported, and 3 different measurement indicators including LD50 (lethal dose 50%), LDLo (lethal dose low), and TDLo (toxic dose low). In this dataset, each compound only has toxicity measurement values concerning a small number of toxicity endpoints, so this dataset is very sparse with nearly 97.4% of compound-to-endpoint measurements missing. Meanwhile, this dataset is also extremely data-unbalanced with some endpoints having tens of thousands of toxicity measurements available, e.g., mouse-intraperitoneal-LD50 has 36,295 measurements, mouse-oral-LD50 has 23,373 measurements, and rat-oral-LD50 has 10,190 measurements, etc, while some endpoints contain only around 100 measurements like mouse-intravenous-LDLo, rat-intravenous-LDLo, frog-subcutaneous-LD50, and human-oral-TDLo, etc. The sparsity and unbalance of this dataset present acute toxicity evaluation as a challenging issue. Among the 59 endpoints, 21 endpoints with less than 200 measurements were considered small-sized endpoints, and 11 endpoints with more than 1000 measurements were treated as large-sized endpoints. Three endpoints targeting humans, human-oral-TDLo, women-oral-TDLo, and man-oral-TDLo, are typical small-sized endpoints, with only 140, 156, and 163 available toxicity measurements, respectively (The acute toxicity intensity measurement values of the 80,081 compounds concerning 59 acute toxic endpoints, as well as the 5-fold random splits, were provided in the multiple_endpoint_acute_toxicity_dataset.zip. The molecular fingerprints or feature descripors of the 80,081 compounds, such as Avalon, Morgan, and AtomPair, were given in the all_descriptors.txt).115-endpoint_acute_toxiciy_dataset.zip: We collected more acute toxicity data of compounds from PubChem database through web crawling. We unified all the toxicity measurement units into -log(mol/kg) and retained the endpoints with no less than 30 available samples per endpoint. Thus, a brand-new acute toxicity dataset containing 115 endpoints was established. Compared with the previous 59-endpoint acute toxicity dataset from TOXRIC, the number of acute toxicity endpoints in this new dataset has doubled, adding more possible species (like goat, monkey, hamster, etc), administration routes (like intracerebral, intratracheal), and measurement indicators (like LD10, LD20). It should be emphasized that the sample imbalance among endpoints and the data missing rate of this dataset are more severe. Its sparsity rate reaches 98.7%, and it contains 68 small-sample acute toxicity endpoints (i.e., endpoints with less than 200 toxicity measurement data), among which the endpoint with the fewest samples has only 30 available measurement data. Therefore, this dataset is more challenging for all current acute toxicity prediction models.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a curated and feature-enhanced collection of** 7,413 chemical compounds for computational toxicology studies**. It's a good resource for a machine learning exercise. Its primary purpose is to enable the prediction of acute toxicity (LD₅₀) using machine learning methods.
Each row represents one unique chemical compound and includes:
Identifiers:
Target Variable:
Molecular Descriptors (25 features)
Facebook
TwitterThis dataset supports the study "Saving Mice: ChenseNet121, a New Deep Learning Architecture for LD50 Toxicity Estimation", and was specifically designed for training and evaluating multimodal deep learning models for acute oral toxicity (LD50) prediction in pesticides. It integrates multiple data representations for each compound:
2D images of molecular structures (folder: images/, PNG format), downloaded from PubChem and identified by compound CID.
3D voxelized volumes derived from molecular docking simulations against human acetylcholinesterase (hAChE, PDB: 7E3H), formatted as tensors and stored as .npy files (not shown in screenshot).
Physicochemical descriptors, extracted from SMILES using RDKit, including molecular weight, logP, TPSA, number of rotatable bonds, and docking binding affinities. These are stored in plain text files:
dataset_descriptores_bool.txt
dataset_descriptores_float.txt
dataset_descriptores_2x2x2_bool.txt
dataset_descriptores_2x2x2_float.txt
CSV files containing the integrated dataset (combined_dataset.csv) and a balanced test subset for classification tasks (balanced_test.csv).
The dataset is aligned with EFSA guidelines and enables the training of machine learning models using image-based, structural, and biochemical features. It was used to develop and evaluate the ChenseNet121 architecture, which outperforms ResNet, Inception, and EfficientNet variants in LD50 regression and WHO-aligned toxicity classification.
Suggested Citation:Junquera, E., Remeseiro, B., Febbraio, F., & Díaz, I. (2025). Multimodal Dataset for LD50 Toxicity Prediction of Pesticides Using Deep Learning. Zenodo.
Related Publication:Junquera et al. (2025). Saving Mice: ChenseNet121, a New Deep Learning Architecture for LD50 Toxicity Estimation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This excel file (DOI: https://doi.org/10.5281/zenodo.3755675) provides the collection of raw data used for developing the first integrative Quantitative Structure-Activity Relationship (QSAR) model using EFSA's OpenFoodTox, US-EPA ECOTOX and Pesticide Properties DataBase i) to predict acute contact toxicity (LD50) and ii) to profile the Mode of Action (MoA) of pesticides active substances in honey bees (Apis mellifera). Chemical identifiers (e.g. SMILES, CAS n., InChI) and acute contact toxicity data (LD50) on honey bees were used to develop and validate i) a two-category QSAR model (toxic/non-toxic; n=411) (sensitivity =0.93), specificity =0.85), balanced accuracy =0.90), Matthews correlation coefficient MCC=0.78), and ii) a regression-based model (n=113) (R2=0.74; MAE=0.52). Similarly, current study proposes the first MoA profiling for 113 pesticides active substances and the first harmonised MoA classification scheme for acute contact toxicity in honey bees, including LD50s data points from three different databases such as EFSA's OpenFoodTox, US-EPA ECOTOX and Pesticide Properties DataBase. Such classification allows to further define MoAs and the target site of Plant Protection Products (PPPs) active substances, thus enabling regulators and scientists to refine chemical grouping and toxicity extrapolations for single chemicals and component-based mixture risk assessment of multiple chemicals.
The full data collection and analysis of QSAR models, toxicity data (LD50) and Mode of Action (Moa) data are described in Carnesecchi et al., 2020 (DOI: doi.org/10.1016/j.scitotenv.2020.139243).
This work was supported by the European Food Safety Authority (EFSA) [contract number: OC/EFSA/SCER/2018/01 and NP/EFSA/AFSCO/2016/02 (Edoardo Carnesecchi)].
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data: Variation in pesticide toxicity in the western honey bee (Apis mellifera) associated with consuming phytochemically different monofloral honeys Includes: Identification and quantification of phenolic components of honeys: Raw_data_JOCE.xlsx – sheet: “HoneyPhytochemicals” Effects of honey phytochemicals on acute pesticide toxicity: Raw_data_JOCE.xlsx – sheet: “raw_LD50 Raw_data_JOCE.xlsx – sheet: “raw_LD50_hive_based”
Facebook
Twitterg a.s./ha: mass of active substance per hectare expressed in grams.µg a.s./bee: mass of active substance per bee expressed in micrograms.ng a.s./bee: nanogram of active substance per bee.E.D.: Experimental Data.LD50: Median Lethal Dose.HQ: Hazard Quotient (field rate (g/ha)/LD50 (µg/bee)).Revisited HQ (exposure (ng/bee)/LD50 (µg/bee)).N.C.: Not Calculated because of the low toxicity of the active substance.DAR EFSA: Draft Assessment Report of the European Food Safety Authority.PED US EPA: Pesticides Ecotoxicity Database of the United States Environmental Protection Agency.aTime at which the LD50 was determined. The LD50 values resulting from the experimental data were calculated at the time corresponding to a stabilized mortality.bFor each active substance, 2 scenarios of exposure are presented: the lowest and the highest homologated application rate.cFor each active substance, the highest and the lowest known LD50 values were compared to the lowest and highest homologated application rates, respectively.dHQ is the ratio between the application rate (g a.s./ha) and the LD50 (µg a.s./bee).eThe exposure was calculated from the application rate (ng a.s./cm2) and the mean exposure surface area determined with the 20 active substances (1.05 cm2/bee).fThe LD50 values from the experimental data were calculated with the BMD software from the US EPA.gThe revisited HQ is the ratio between the exposure (ng a.s./bee) and the LD50 (ng a.s./bee).hFor each active substance, the dose-mortality relationship was modeled at the time corresponding to a stabilized mortality.Determination and comparison of the HQ and the revisited HQ.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of larval honey bee (Apis mellifera) acute (LAO; single dose; OECD TG 237) and chronic (LCO; repeat dose; OECD GD 239) LD50 values (N = 15) expressed in terms of μg ai larva-1 day-1.
Facebook
Twitter"PhD obj3 LD50 data submitted April 21 2025": applied to Table 1 of manuscript; used to calculate the lethal dose at which 25%, 50%, and 90% of the organisms (Aedes aegypti) die (LD25, LD50, LD90) using Probit software; PoloPlus - Leora Software. These data were generated by RLA under the supervision of KJL in July-August, 2021, at the USDA-ARS Center for Medical, Agricultural and Veterinary Entomology (USDA-ARS-CMAVE) facility in Gainesville, FL. using dilutions of technical malathion, technical permethrin, and rapeseed methyl ester. These values were generated in order to identify an LD50 dose to apply to 17d old female Ae. aegypti (7 days old prior to blood meal), 10 days after feeding on a blood meal containing dengue-1 virus. Ae. aegypti to determine the LD50 were blood fed at 7 d old, and aged 10 d after blood feeding, then exposed to the dilutions of technical malathion, technical permethrin, and rapeseed methyl ester."PhD Obj3 dengue mortality check submitted April 21 2025": used to populate the results for Table 2 and Table 3; describing the differences in mortality associated with exposure to a dengue-1 treated bloodmeal and LD50 pesticide/control treatment exposure in Aedes aegypti mosquitoes. Analysis was conducted using R. These data were generated by RLA under the supervision of BWA at the University of Florida - Florida Medical Entomology Laboratory (UF-FMEL) facility in Vero Beach, FL between September and November 2021. Ae. aegypti females were aged to 7 d, provided a dengue-1 tainted blood meal or sham blood meal, allowed to age for 10 d after blood feeding, then a topical LD50 dose of technical permethrin, technical malathion diluted in rapeseed methyl ester or neat rapeseed methyl ester as a control were applied and mortality was recorded at 24 h and 48 h. Following mortality checks, Ae. aegypti were frozen (-80 deg C) and checked for the presence of virus using RT-qPCR.
Facebook
TwitterComputational methods to predict molecular properties regarding safety and toxicology represent alternative approaches to expedite drug development, screen environmental chemicals, and thus significantly reduce associated time and costs. There is a strong need and interest in the development of computational methods that yield reliable predictions of toxicity, and many approaches, including the recently introduced deep neural networks, have been leveraged towards this goal. Herein, we report on the collection, curation, and integration of data from the public data sets that were the source of the ChemIDplus database for systemic acute toxicity. These efforts generated the largest publicly available such data set comprising > 80,000 compounds measured against a total of 59 acute systemic toxicity end points. This data was used for developing multiple single- and multitask models utilizing random forest, deep neural networks, convolutional, and graph convolutional neural network approaches. For the first time, we also reported the consensus models based on different multitask approaches. To the best of our knowledge, prediction models for 36 of the 59 end points have never been published before. Furthermore, our results demonstrated a significantly better performance of the consensus model obtained from three multitask learning approaches that particularly predicted the 29 smaller tasks (less than 300 compounds) better than other models developed in the study. The curated data set and the developed models have been made publicly available at https://github.com/ncats/ld50-multitask, https://predictor.ncats.io/, and https://cactus.nci.nih.gov/download/acute-toxicity-db (data set only) to support regulatory and research applications.
Facebook
TwitterRTECS is a compendium of data extracted from the open scientific literature. The data are recorded in the format developed by the RTECS staff and arranged in alphabetical order by prime chemical name. Six types of toxicity data are included in the file: (1) primary irritation; (2) mutagenic effects; (3) reproductive effects; (4) tumorigenic effects; (5) acute toxicity; and (6) other multiple dose toxicity. Specific numeric toxicity values such as LD50, LC50, TDLo, and TCLo are noted as well as species studied and route of administration used. For each citation, the bibliographic source is listed thereby enabling the user to access the actual studies cited. No attempt has been made to evaluate the studies cited in RTECS. The user has the responsibility of making such assessments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicting acute oral toxicity LD50 of chemicals in rats is a challenge since many factors affect toxicity data. In this paper, 40 descriptors were successfully used to develop a quantitative structure–activity relationship model for 8448 rat acute oral toxicity logLD50 by applying the random forest (RF) algorithm. To develop the optimal RF model, a training set (5914 chemicals) was used to establish models, a validation set (1267 chemicals) used to tune RF parameters and a test set (1267 chemicals) used to assess the performance of RF models. It yielded correlation coefficients R of 0.9695 and rms errors (log unit) of 0.3171 for the training set, R = 0.8322 and rms = 0.2889 for the validation set and R = 0.8335 and rms = 0.3060 for the test set. More than 99% of rat acute oral toxicity logLD50 in the dataset can be accurately predicted, although the dataset is large.
Facebook
TwitterThe goal of this study was to develop and examine whether a management bait that can be used for selective control of grass carp. Our objectives were to 1) quantify the water-based 24-h LC50 of Antimycin-A for grass carp and rainbow trout, 2) quantify the 96-h LD50 of orally administered Antimycin-A laden bait for grass carp and rainbow trout, 3) quantify the leaching rate of Antimycin-A from the bait in water, and 4) determine if a management bait laden with Antimycin-A will be consumed by grass carp and cause lethality in the laboratory. To meet our objectives, Antimycin-A was encapsulated in a wax microparticle similar to Poole et al. (2018) and incorporated into a rapeseed bait for oral gavage feeding and consumption trials to demonstrate if Antimycin-A can be orally delivered, protected from degradation, and readily consumed by grass carp. The dataset includes raw files of toxicity survival information, water quality, bait leaching, and Antimycin-A analytical measurements.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Given the widespread presence of emerging contaminants in the environment, assessing and ensuring their biosafety is urgent. Under the Globally Harmonized System (GHS), the LD50 parameter of acute oral toxicity (AOT) is crucial for chemical safety classification. Animal testing limitations have highlighted the need for alternative methods, and machine learning offers a new approach to predict LD50 through quantitative structure-activity relationship (QSAR) models. This study developed and optimized a machine learning model for LD50 classification of emerging contaminants based on data from more than 6000 known AOT. Using molecular descriptors and fingerprints, the model achieves an accuracy above 0.86 and a recall score over 0.84, outperforming previous models. The model’s robustness was confirmed across various types of emerging contaminants. Shapley additive explanations (SHAP) identified key descriptors like BCUTp_1h, ATSC1pe, and SLogP_VSA4, while the information gain (IG) method highlighted alert substructures [P-O, P-S]. These findings suggest that compounds with high polarizability, mean electronegativity and significant surface area may adversely affect rats. This model enhances understanding of acute toxicity mechanisms and serves as a tool for early screening of safer compounds, promoting the design of greener chemicals.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Guidelines used in selecting LD50 values from multiple sources of data.
Facebook
TwitterOur tabular data consists of a very comprehensive collection of LC50 and LD50 values for imidacloprid across insect taxa. It is limited to adult insects and contains a range of variables such as body mass, pesticide formulation, exposure duration, geographic origin, insect strain etc. The file can be accessed by any program which can access spreadsheets (e.g. excel, R, matlab).
Facebook
TwitterThis dataset contains data of permethrin residues on adult mosquitoes and adult butterflies following their exposure to ultra-low volume (ULV) sprays containing permethrin. The dataset also contains toxicity information for permethrin; first for adult mosquitoes and adult butterflies following their exposure to the ULV sprays, and for adult mosquitoes exposed during toxicity tests to determine median lethal dose levels (LD50).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4. The list of descriptors used for models derivation are included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AAO- Adult Acute Oral, LAO- Larval Acute Oral, LCO- Larval Chronic Oral.
Facebook
TwitterThe Toxicity Reference Database (ToxRefDB) contains approximately 30 years and $2 billion worth of animal studies. ToxRefDB allows scientists and the interested public to search and download thousands of animal toxicity testing results for hundreds of chemicals that were previously found only in paper documents. Currently, there are 474 chemicals in ToxRefDB, primarily the data rich pesticide active ingredients, but the number will continue to expand.