100+ datasets found

ChEMBL EBI Small Molecules Database
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset authored and provided by
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash
b
ChEMBL database of bioactive drug-like small molecules - Cell lines section
bioregistry.io
Updated Dec 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ChEMBL database of bioactive drug-like small molecules - Cell lines section [Dataset]. https://bioregistry.io/chembl.cell
Explore at:
Dataset updated
Dec 11, 2022
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Chemistry resources
r
ChEMBL database
resodate.org
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jannis Born; Matteo Manica; Ali Oskooei; Joris Cadow; Karsten Borgwardt; María Rodríguez Martínez (2024). ChEMBL database [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvY2hlbWJsLWRhdGFiYXNl
Explore at:
Dataset updated
Dec 2, 2024
Dataset provided by
Leibniz Data Manager
Authors
Jannis Born; Matteo Manica; Ali Oskooei; Joris Cadow; Karsten Borgwardt; María Rodríguez Martínez
Description
The dataset used in this paper is the ChEMBL database, which contains drugs/molecules and their binding information for proteins Lyn, Lck, and Src.
r
ChEMBL
rrid.site
Updated Dec 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ChEMBL [Dataset]. http://identifiers.org/RRID:SCR_014042
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014042
Dataset updated
Dec 23, 2025
Description
Collection of bioactive drug-like small molecules that contains 2D structures, calculated properties and abstracted bioactivities. Used for drug discovery and chemical biology research. Clinical progress of new compounds is continuously integrated into the database.
t
The ChEMBL database in 2017 - Dataset - LDM
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). The ChEMBL database in 2017 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/the-chembl-database-in-2017
Explore at:
Dataset updated
Dec 3, 2024
Description
The ChEMBL database is a large collection of bioactive compounds and their biological activities.
Drug Targets and Drug Lists Data Package
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Drug Targets and Drug Lists Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/drug-targets-and-drug-lists-data-package/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Description
This data package contains information on approved, researched and proven drug targets and drug lists.
b
ChEMBL
bioregistry.io
Updated Apr 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ChEMBL [Dataset]. http://identifiers.org/re3data:r3d100010539
Explore at:
Unique identifier
https://identifiers.org/re3data:r3d100010539
Dataset updated
Apr 24, 2021
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
ChEMBL is a database of bioactive compounds, their quantitative properties and bioactivities (binding constants, pharmacology and ADMET, etc). The data is abstracted and curated from the primary scientific literature.
MultiTarget Bioactivity ChEMBL
kaggle.com
Updated Nov 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. (2025). MultiTarget Bioactivity ChEMBL [Dataset]. https://www.kaggle.com/datasets/xjoannax88/multitarget-bioactivity-chembl/code
Explore at:
Dataset updated
Nov 30, 2025
Authors
J.
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Bioactivity Dataset for EGFR, DRD2, BACE1, and HDAC1

This dataset contains curated molecular data for compounds tested against four pharmacologically important protein targets. All data are derived from the ChEMBL database.

Targets Included

Target Name ChEMBL Target ID Target Class
EGFR CHEMBL203 Kinase (Receptor TK)
DRD2 CHEMBL217 GPCR (Dopamine D2 receptor)
BACE1 CHEMBL1987 Enzyme (Aspartyl protease)
HDAC1 CHEMBL325 Enzyme (Histone deacetylase)

Each entry in the dataset includes: - ChEMBL compound ID - Canonical SMILES - Molecular properties (molecular weight, HBA, HBD, logP, and TPSA) - Target label

Format

The dataset is provided in CSV or Parquet format with the following columns:

ChEMBL ID: ChEMBL compound identifier

SMILES: Canonical SMILES string

Molecular weight: Molecular mass (Da)

LogP: Octanol-water partition coefficient

HBA: Number of hydrogen bond acceptors

HBD: Number of hydrogen bond donors

TPSA: Topological polar surface area

Protein: Protein target name

License

This dataset is a derivative of ChEMBL and is distributed under the same license:

Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/

Source

ChEMBL Database: https://www.ebi.ac.uk/chembl/

Data accessed via the ChEMBL WebResource Client (chembl_webresource_client)

Citation

If you use this dataset, please cite the following:

ChEMBL Database:
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017).
The ChEMBL database in 2017. Nucleic Acids Res., 45(D1): D945–D954.
https://doi.org/10.1093/nar/gkw1074

ChEMBL Web Services:
Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis LJ, Overington JP. (2015).
ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res., 43(W1): W612–W620.
https://doi.org/10.1093/nar/gkv352
f
Data from: hERG Me Out
acs.figshare.com
text/x-python
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Czodrowski (2023). hERG Me Out [Dataset]. http://doi.org/10.1021/ci400308z.s001
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400308z.s001
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Paul Czodrowski
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A detailed analysis of the hERG content inside the ChEMBL database is performed. The correlation between the outcome from binding assays and functional assays is probed. On the basis of descriptor distributions, design paradigms with respect to structural and physicochemical properties of hERG active and hERG inactive compounds are challenged. Finally, classification models with different data sets are trained. All source code is provided, which is based on the Python open source packages RDKit and scikit-learn to enable the community to rerun the experiments. The code is stored on github (https://github.com/pzc/herg_chembl_jcim).

Target Name	ChEMBL Target ID	Target Class
EGFR	CHEMBL203	Kinase (Receptor TK)
DRD2	CHEMBL217	GPCR (Dopamine D2 receptor)
BACE1	CHEMBL1987	Enzyme (Aspartyl protease)
HDAC1	CHEMBL325	Enzyme (Histone deacetylase)

Data from: A consensus compound/bioactivity dataset for data-driven drug...

zenodo.org

zip

Updated May 13, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6320761

Dataset updated

May 13, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Information

The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

Structure and content of the dataset

**Dataset structure**
ChEMBL ID	PubChem ID	IUPHAR ID	Target	Activity type	Assay type	Unit	Mean C (0)	...	Mean PC (0)	...	Mean B (0)	...	Mean I (0)	...	Mean PD (0)	...	Activity check annotation	Ligand names	Canonical SMILES C	...	Structure check	Source

The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

Column content:

ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
Target: biological target of the molecule expressed as the HGNC gene symbol
Activity type: for example, pIC₅₀
Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
Unit: unit of bioactivity measurement
Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
- no comment: bioactivity values are within one log unit;
- check activity data: bioactivity values are not within one log unit;
- only one data point: only one value was available, no comparison and no range calculated;
- no activity value: no precise numeric activity value was available;
- no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
Ligand names: all unique names contained in the five source databases are listed
Canonical SMILES columns: Molecular structure of the compound from each database
Structure check: To denote matching or differing compound structures in different source databases
- match: molecule structures are the same between different sources;
- no match: the structures differ;
- 1 source: no structure comparison is possible, because the molecule comes from only one source database.
Source: From which databases the data come from

f
Data from: PDEStrIAn: A Phosphodiesterase Structure and Ligand Interaction...
acs.figshare.com
zip
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chimed Jansen; Albert J. Kooistra; Georgi K. Kanev; Rob Leurs; Iwan J. P. de Esch; Chris de Graaf (2023). PDEStrIAn: A Phosphodiesterase Structure and Ligand Interaction Annotated Database As a Tool for Structure-Based Drug Design [Dataset]. http://doi.org/10.1021/acs.jmedchem.5b01813.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jmedchem.5b01813.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Chimed Jansen; Albert J. Kooistra; Georgi K. Kanev; Rob Leurs; Iwan J. P. de Esch; Chris de Graaf
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A systematic analysis is presented of the 220 phosphodiesterase (PDE) catalytic domain crystal structures present in the Protein Data Bank (PDB) with a focus on PDE–ligand interactions. The consistent structural alignment of 57 PDE ligand binding site residues enables the systematic analysis of PDE–ligand interaction fingerprints (IFPs), the identification of subtype-specific PDE–ligand interaction features, and the classification of ligands according to their binding modes. We illustrate how systematic mining of this phosphodiesterase structure and ligand interaction annotated (PDEStrIAn) database provides new insights into how conserved and selective PDE interaction hot spots can accommodate the large diversity of chemical scaffolds in PDE ligands. A substructure analysis of the cocrystallized PDE ligands in combination with those in the ChEMBL database provides a toolbox for scaffold hopping and ligand design. These analyses lead to an improved understanding of the structural requirements of PDE binding that will be useful in future drug discovery studies.
f
Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...
frontiersin.figshare.com
xlsx
Updated Jun 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov (2023). Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.XLSX [Dataset]. http://doi.org/10.3389/fchem.2018.00133.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00133.s003
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Pavel V. Pogodin; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Dmitry S. Druzhilovskiy; Mark C. Nicklaus; Vladimir V. Poroikov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
f
Data from: Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings...
datasetcatalog.nlm.nih.gov
Updated Jan 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bosc, Nicolas; Bento, A. Patrícia; Hersey, Anne; Hunter, Fiona M. I.; Gaulton, Anna; Leach, Andrew R. (2021). Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings and Withdrawn Drugs [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000801021
Explore at:
Dataset updated
Jan 28, 2021
Authors
Bosc, Nicolas; Bento, A. Patrícia; Hersey, Anne; Hunter, Fiona M. I.; Gaulton, Anna; Leach, Andrew R.
Description
The safety of marketed drugs is an ongoing concern, with some of the more frequently prescribed medicines resulting in serious or life-threatening adverse effects in some patients. Safety-related information for approved drugs has been curated to include the assignment of toxicity class(es) based on their withdrawn status and/or black box warning information described on medicinal product labels. The ChEMBL resource contains a wide range of bioactivity data types, from early “Discovery” stage preclinical data for individual compounds through to postclinical data on marketed drugs; the inclusion of the curated drug safety data set within this framework can support a wide range of safety-related drug discovery questions. The curated drug safety data set will be made freely available through ChEMBL and updated in future database releases.
Z
Analog series-based scaffolds from ChEMBL with associated activity...
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimova, Dilyana; Stumpfe, Dagmar; Hu, Ye; Bajorath, Jürgen (2020). Analog series-based scaffolds from ChEMBL with associated activity information [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_155302
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, 6 Rheinische Friedrich-Wilhelms-Universitat, Dahlmannstrasse 2, D-53113 Bonn, Germany
Authors
Dimova, Dilyana; Stumpfe, Dagmar; Hu, Ye; Bajorath, Jürgen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reported is the activity information for the 12,294 analog series-based (ASB) scaffolds extracted from ChEMBL database. For each ASB scaffold structural and activity information for all analogs comprising the analog series is provoded.
Tyrosine Kinases ligands with bioactivity data
kaggle.com
zip
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Romero Ochoa (2024). Tyrosine Kinases ligands with bioactivity data [Dataset]. https://www.kaggle.com/dsv/8626441
Explore at:
zip(2087515 bytes)Available download formats
Dataset updated
Jun 6, 2024
Authors
Ricardo Romero Ochoa
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains bioactivity, toxicity, and druglikeness information of 28314 small compounds, obtained from the ChEMBL database and subsequently curated, targeting tyrosine kinases proteins. It is comprised of the following elements:

ABL (Abelson tyrosine-protein kinase 1) 1970

EGFR (Epidermal growth factor receptor) 9508

PDGFR (Platelet-derived growth factor receptor) 1750

FGFR (Fibroblast growth factor receptor) 3156

MET (Hepatocyte growth factor receptor) 3950

VEGFR (Vascular endothelial growth factor receptor) 1211

KIT (Stem cell factor receptor) 1692

RET (Rearranged during transfection) 896

JAK (Janus kinase) 4203

ALK (Anaplastic lymphoma kinase) 2087

SRC 3846

The data can be used to train models that predict or classify bioactivity and druglike properties of small compounds targeting a tyrosine kinase protein.
Drug Design with Small Molecule SMILES
kaggle.com
zip
Updated Feb 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sgt. Peppers (2022). Drug Design with Small Molecule SMILES [Dataset]. https://www.kaggle.com/art3mis/chembl22
Explore at:
zip(26097905 bytes)Available download formats
Dataset updated
Feb 26, 2022
Authors
Sgt. Peppers
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Context

SMILES (simplified molecular-input line-entry system) strings are ASCII sequences that describe the structure of a compound. Due to their simplicity, SMILES representations have been widely used in drug design, mining, and repurposing using machine learning and natural language processing techniques.

See Cheminformania's seq2seq model for an excellent tutorial to get started.

Content

A single text file

Each row contains two columns for description of an individual molecule. The first column is the SMILES string. The second is a reference to the full ChEMBL entry for that particular molecule.

Acknowledgements

ChEMBL Database: Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017) 'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.

ChEMBL Web Services: Davies M, Nowotka M, Papadatos G, Dedman N, Gaulton A, Atkinson F, Bellis L, Overington JP. (2015) 'ChEMBL web services: streamlining access to drug discovery data and utilities.' Nucleic Acids Res., 43(W1) W612-W620.

ChEMBL RDF: S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, S.M Wimalaratne, M. Martin, N. Le Novère, H. Parkinson, E. Birney and A.M Jenkinson (2014) The EBI RDF Platform: Linked Open Data for the Life Sciences Bioinformatics 30 1338-1339
Neo4j open measurment
kaggle.com
zip
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Nijhof-Verhees (2023). Neo4j open measurment [Dataset]. https://www.kaggle.com/datasets/wagenrace/neo4j-open-measurment
Explore at:
zip(29854808766 bytes)Available download formats
Dataset updated
Feb 15, 2023
Authors
Tom Nijhof-Verhees
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Kickstart a chemical graph database

I have spent some time scrapping and shaping PubChem data into a Neo4j graph database. The process took a lot of time, mainly downloading, and loading it into Neo4j. The whole process took weeks. If you want to build your own I will show you how to download mine and set it up in less than an hour (most of the time you’ll just have to wait). The process of how this dataset is created is described in the following blogs: - https://medium.com/@nijhof.dns/exploring-neodash-for-197m-chemical-full-text-graph-e3baed9615b8 - https://medium.com/neo4j/combining-3-biochemical-datasets-in-a-graph-database-8e9aafbb5788 - https://medium.com/p/d9ee9779dfbe

What do you get?

The full database is a merge of 3 datasets, PubChem (compounds + synonyms), NCI60 (GI50), and ChEMBL (cell lines). It contains 6 nodes of interest: ● Compound: This is related to a compound of PubChem. It has 1 property. ○ pubChemCompId: The id within pubchem. So “compound:cid162366967” links to https://pubchem.ncbi.nlm.nih.gov/compound/162366967. This number can be used with both PubChem RDF and PUG. ● Synonym: A name found in the literature. This name can refer to zero, one, or more compounds. This helps find relations between natural language names and absolute compounds they are related to. ○ Name: Natural language name. Can contain letters, spaces, numbers, and any other Unicode character. ○ pubChemSynId: PubChem synonym id as used within the RDF ● CellLine: These are the ChEMBL cell lines. They hold a lot of information. ○ Name: The name of the cell line. ○ Uri: A unique URI for every element within the ChEMBL RDF. ○ cellosaurusId: The id to connect it to the Cellosaurus dataset. This is one of the most extensive cell line datasets out there. ● Measurement: A measurement you can do within a biomedical experiment. Currently, only GI50 (the concentration needed for Growth Inhibition of 50%) is added. ○ Name: Name of the measurement. ● Condition: A single condition of an experiment. A condition is part of an experiment. Examples are: an individual of the control group, a sample with drug A, or a sample with more CO2 ● Experiment: A collection of multiple conditions all done at the same time with the same bias. Meaning we assume all uncontrolled variables are the same. ○ Name: Name of experiment.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F442733%2F7dd804811e105390dfe20bb5cd1a68c0%2FUntitled%20graph.png?generation=1680113457794452&alt=media" alt="">

Overview of the graph design

How do download it Warning, you need 120 GB of free memory. The compressed file you download is already 30 GB. The uncompressed file is 30 GB. The database afterward is 60 GB. 60 GB is only for temporary files, the other 60 is for the database. If you do this on an HDD hard disk it will be slow.

If you load this into Neo4j desktop as a local database (like I do) it will scream and yell at you, just ignore this. We are pushing it far further than it is designed for, but it will still work.

Download the file

Go to this Kaggle dataset and download the dump file. Unzip the file, then delete the zipped file. This part needs 60 GB but only takes 30 by the end of it. Create a database Open the Neo4j desktop app, and click “Reveal files in File Explorer”. Move the .dump you downloaded into this folder.

Click on the ... behind the .dump file and click Create new DBMS from dump. This database is a dump from Neo4j V4, so your database also needs to be V4.x.x!

It will now create the database. This will take a long time, it might even say it has timed out. Do not believe this lie! In the background, it is still running. Every time you start it, it will time out. Just let it run and press start later again. The second time it will be started up directly.

Every time I start it up I get the timed-out error. After waiting 10 minutes and clicking start again the database, and with it, more than 200 million nodes, is ready. And you are done! Good luck and let me know what you build with it
Z
KinFragLib: Combinatorial library
data.niaid.nih.gov
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sydow, Dominique; Schmiel, Paula; Mortie, Jérémie; Buchthal, Katharina; Kramer, Paula Linh; Volkamer, Andrea (2024). KinFragLib: Combinatorial library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3954931
Explore at:
Dataset updated
Mar 21, 2024
Dataset provided by
Saarland University
Charité
Bayer AG
Authors
Sydow, Dominique; Schmiel, Paula; Mortie, Jérémie; Buchthal, Katharina; Kramer, Paula Linh; Volkamer, Andrea
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
KinFragLib: Exploring the Kinase Inhibitor Space Using Subpocket-Focused Fragmentation and Recombination.

Project description.

Protein kinases play a crucial role in many cell signaling processes, making them one of the most important families of drug targets. In this context, fragment-based drug design strategies have been successfully applied to develop novel kinase inhibitors, usually following a knowledge-driven approach to optimize a focused set of fragments to a potent kinase inhibitor.

Alternatively, KinFragLib is a new method that allows to explore and extend the chemical space of kinase inhibitors using data-driven fragmentation and recombination, built on available structural kinome data from the KLIFS database for over 3,200 kinase DFG-in complexes. The computational fragmentation method splits the co-crystallized non-covalent kinase inhibitors into fragments with respect to their 3D proximity to six predefined functionally relevant subpocket centers. The resulting fragment library consists of six subpocket pools with over 9,000 fragments, available at https://github.com/volkamerlab/KinFragLib.

KinFragLib offers two main applications: (i) In-depth analyses of the chemical space of known kinase inhibitors, subpocket characteristics and connections, as well as (ii) subpocket-informed recombination of fragments to generate potential novel inhibitors. The latter showed that recombining only a subset of 727 representative fragments generated a combinatorial library of 11.3 million molecules, containing, besides some known kinase inhibitors, more than 99% novel chemical matter compared to ChEMBL and 55% molecules compliant with Lipinski's rule of five.

Combinatorial library dataset.

The dataset offered here is part of the KinFragLib GitHub repository (https://github.com/volkamerlab/KinFragLib) and contains the metadata and properties of the KinFragLib combinatorial library.

Raw data

combinatorial_library.json: Full combinatorial library, please refer to notebooks/4_1_combinatorial_library_data_preparation.ipynb at https://github.com/volkamerlab/KinFragLib for detailed information about this data format.

combinatorial_library_deduplicated.json: Deduplicated combinatorial library (based on InChIs).

chembl_standardized_inchi.csv: Standardized ChEMBL 33 molecules in the form of InChI strings.

Processed data

Data extracted from combinatorial_library_deduplicated.json, performed in notebooks/4_1_combinatorial_library_data_preparation.ipynb at https://github.com/volkamerlab/KinFragLib.

n_atoms.csv: Number of atoms for each recombined ligand.

ro5.csv: Number of ligands that fulfill Lipinski's rule of five (Ro5) and its individual criteria; number of ligands in total.

subpockets.csv: Number of ligands per subpocket combination.

original_exact.json: Ligands with exact matches in original ligands, i.e. KLIFS ligands that were used for the fragmentation.

original_substructure.json: Ligands with substructure matches in original ligands, i.e. KLIFS ligands that were used for the fragmentation.

chembl_exact.json: Ligands with exact matches in ChEMBL.

chembl_most_similar.json: Most similar ligand in ChEMBL for each recombined ligand.

chembl_highly_similar.json: Most similar ligand in ChEMBL for each recombined ligand with similarity greater than 0.9.

Usage.

This dataset can be used to run the notebooks available on https://github.com/volkamerlab/KinFragLib.

Clone the KinFragLib repository.

Download the tar.bz2 file provided here.

Extract the archive content to the combinatorial library folder in your local KinFragLib folder and run the notebooks.

tar -xvf combinatorial_library.tar.bz2 -C /path_to_kinfraglib/data/combinatorial_library/

Citation.

This dataset is part of the KinFragLib publication:

Sydow, D., Schmiel, P., Mortier, J., and Volkamer, A. KinFragLib: Exploring the Kinase Inhibitor Space Using Subpocket-Focused Fragmentation and Recombination. J. Chem. Inf. Model. 2020. https://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c00839
ChEMBL Data
console.cloud.google.com
Updated Apr 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Patents%20Public%20Datasets&hl=es (2023). ChEMBL Data [Dataset]. https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/chembl?hl=es
Explore at:
Dataset updated
Apr 9, 2023
Dataset provided by
Googlehttp://google.com/
License
Description
ChEMBL Data is a manually curated database of small molecules used in drug discovery, including information about existing patented drugs.
f
List of highly ranked unknown drug-protein pairs in ChEMBL.
datasetcatalog.nlm.nih.gov
Updated Nov 21, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin, Daeyong; Kim, Shinhyuk; Lee, Hyunju (2013). List of highly ranked unknown drug-protein pairs in ChEMBL. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001625959
Explore at:
Dataset updated
Nov 21, 2013
Authors
Jin, Daeyong; Kim, Shinhyuk; Lee, Hyunju
Description
Among all 270,540 drug-protein pairs from the ChEMBL data set, the top 50 unknown pairs determined by the KL1LR method using data sets were checked, and the unknown pair was listed if it was found in the STITCH [4], DrugBank [1], KEGG [2], BindingDB [36], and CTD [35] data sets. Drugs in the second column and proteins in the third column are likely to interact, based on the probabilities shown in the fourth column. If interactions are found in more than two data sets, only one source is listed. Similarly, the results obtained using chemical structure similarities are shown.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2019). ChEMBL EBI Small Molecules Database [Dataset]. https://www.kaggle.com/bigquery/ebi-chembl

ChEMBL EBI Small Molecules Database

A large-scale bioactivity database for drug discovery (BigQuery)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Feb 12, 2019

Dataset authored and provided by

Google BigQuery

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Context

ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

Content

ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.

Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png

Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl

Banner photo by rawpixel on Unsplash

Clear search

Close search

Google apps

Main menu

ChEMBL EBI Small Molecules Database

Context

Content

Acknowledgements

ChEMBL database of bioactive drug-like small molecules - Cell lines section

ChEMBL database

ChEMBL

The ChEMBL database in 2017 - Dataset - LDM

Drug Targets and Drug Lists Data Package

ChEMBL

MultiTarget Bioactivity ChEMBL

Bioactivity Dataset for EGFR, DRD2, BACE1, and HDAC1

Targets Included

Format

License

Source

Citation

Data from: hERG Me Out

Data from: A consensus compound/bioactivity dataset for data-driven drug...

Data from: PDEStrIAn: A Phosphodiesterase Structure and Ligand Interaction...

Table_1_How to Achieve Better Results Using PASS-Based Virtual Screening:...

Data from: Drug Safety Data Curation and Modeling in ChEMBL: Boxed Warnings...

Analog series-based scaffolds from ChEMBL with associated activity...

Tyrosine Kinases ligands with bioactivity data

Drug Design with Small Molecule SMILES

Context

Content

Acknowledgements

Neo4j open measurment

Kickstart a chemical graph database

What do you get?

Overview of the graph design

Download the file

KinFragLib: Combinatorial library

ChEMBL Data

List of highly ranked unknown drug-protein pairs in ChEMBL.

ChEMBL EBI Small Molecules Database

A large-scale bioactivity database for drug discovery (BigQuery)

Context

Content

Acknowledgements