11 datasets found

Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
A
‘Titanic: cleaned data’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Titanic: cleaned data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-cleaned-data-cbf4/dc9cd7ff/?iid=055-046&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic: cleaned data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jamesleslie/titanic-cleaned-data on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Introduction

This dataset was created in this notebook as part of a three-part series. The data is in machine-learning-ready format, with all missing values for the Age, Fare and Embarked columns having been imputed.

Data imputation

Age: this column was imputed by using the median age for the passenger's title (Mr, Mrs, Dr etc).

Fare: the single missing value in this column was imputed using the median value for that passenger's class.

Embarked: the two missing values here were imputed using the Pandas backfill method.

Usage

This data is used in both the second and third parts of the series.

--- Original source retains full ownership of the source dataset ---
TMDB MOVIES DATASET
kaggle.com
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MAYUR DESAI88 (2022). TMDB MOVIES DATASET [Dataset]. https://www.kaggle.com/datasets/mayurdesai88/tmdb-movies-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MAYUR DESAI88
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains the data of the over 10000 TMDB movies including the id, title, release date,avg vote, vote count, overview and popularity, etc. This data was collected by using the TMDB API, requests, json and converted into a dataframe using pandas. This data set contains some null values as there are missing fields in the tmdb database. Thought it's good for a young analyst to deal with missing value and you can also use this data to make movies recommendation systems
Klib library python
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT
Covid19 Cleaned Data
kaggle.com
Updated Apr 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant Patel (2020). Covid19 Cleaned Data [Dataset]. https://www.kaggle.com/prashant268/covid-clean/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prashant Patel
Description
This is the cleaned data for covid19 forecasting with some important variables e.g. average temperature, the median age of the country. I have used the following data for information about the country and filled any missing value using Wikipedia and pandas. https://www.kaggle.com/koryto/countryinfo Feel free to use this data and upvote if it is useful.
n
Extirpated species in Berlin, dates of last detections, habitats, and number...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvia Keinath (2024). Extirpated species in Berlin, dates of last detections, habitats, and number of Berlin’s inhabitants [Dataset]. http://doi.org/10.5061/dryad.n5tb2rc4k
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.n5tb2rc4k
Dataset updated
Jul 9, 2024
Dataset provided by
Museum für Naturkunde
Authors
Silvia Keinath
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Berlin
Description
Species loss is highly scale-dependent, following the species-area relationship. We analysed spatio-temporal patterns of species’ extirpation on a multitaxonomic level using Berlin, the capital city of Germany. Berlin is one of the largest cities in Europe and has experienced a strong urbanisation trend since the late 19th century. We expected species’ extirpation to be exceptionally high due to the long history of urbanisation. Analysing regional Red Lists of Threatened Plants, Animals, and Fungi of Berlin (covering 9498 species), we found that 16 % of species were extirpated, a rate 5.9 times higher than at the German scale, and 47.1 times higher than at the European scale. Species’ extirpation in Berlin is comparable to that of another German city with a similarly broad taxonomic coverage, but much higher than in regional areas with less human impact. The documentation of species’ extirpation started in the 18th century and is well documented for the 19th and 20th centuries. We found an average annual extirpation of 3.6 species in the 19th century, 9.6 species in the 20th century, and the same number of extirpated species as in the 19th century were documented in the 21th century, despite the much shorter time period. Our results showed that species’ extirpation is higher at small than on large spatial scales, and might be negatively influenced by urbanisation, with different effects on different taxonomic groups and habitats. Over time, we found that species’ extirpation is highest during periods of high human alterations and is negatively affected by the number of people living in the city. But, there is still a lack of data to decouple the size of the area and the human impact of urbanisation. However, cities might be suitable systems for studying species’ extirpation processes due to their small scale and human impact. Methods Data extraction: To determine the proportion of extirpated species for Germany, we manually summarised the numbers of species classified in category 0 ‘extinct or extirpated’ and calculated the percentage in relation to the total number of species listed in the Red Lists of Threatened Species for Germany, taken from the website of the Red List Centre of Germany (Rote Liste Zentrum, 2024a). For Berlin, we used the 37 current Red Lists of Threatened Plants, Animals, and Fungi from the city-state of Berlin, covering the years from 2004 to 2023, taken from the official capital city portal of the Berlin Senate Department for Mobility, Transport, Climate Protection and Environment (SenMVKU, 2024a; see overview of Berlin Red Lists used in Table 1). We extracted all species that are listed as extinct/extirpated, i.e. classified in category 0, and additionally, if available, the date of the last record of the species in Berlin. The Red List of macrofungi of the order Boletales by Schmidt (2017) was not included in our study, as this Red List has only been compiled once in the frame of a pilot project and therefore lacks the category 0 ‘extinct or extirpated’. We used Python, version 3.7.9 (Van Rossum and Drake, 2009), the Python libraries Pandas (McKinney et al., 2010), and Camelot-py, version 0.11.0 (Vinayak Meta, 2023) in Jupyter Lab, version 4.0.6 (Project Jupyter, 2016) notebooks. In the first step, we created a metadata table of the Red Lists of Berlin to keep track of the extraction process, maintain the source reference links, and store summarised data from each Red List pdf file. At the extraction of each file, a data row was added to the metadata table which was updated throughout the rest of the process. In the second step, we identified the page range for extraction for each extracted Red List file. The extraction mechanism for each Red List file depended on the printed table layout. We extracted tables with lined rows with the Lattice parsing method (Camelot-py, 2024a), and tables with alternating-coloured rows with the Stream method (Camelot-py, 2024b). For proofing the consistency of extraction, we used the Camelot-py accuracy report along with the Pandas data frame shape property (Pandas, 2024). After initial data cleaning for consistent column counts and missing data, we filtered the data for species in category 0 only. We collated data frames together and exported them as a CSV file. In a further step, we proofread whether the filtered data was tallied with the summary tables, given in each Red List. Finally, we cleaned each Red List table to contain the species, the current hazard level (category 0), the date of the species’ last detection in Berlin, and the reference (codes and data available at: Github, 2023). When no date of last detection was given for a species, we contacted the authors of the respective Red Lists and/or used former Red Lists to find information on species’ last detections (Burger et al., 1998; Saure et al., 1998; 1999; Braasch et al., 2000; Saure, 2000). Determination of the recording time windows of the Berlin Red Lists We determined the time windows, the Berlin Red Lists look back on, from their methodologies. If the information was missing in the current Red Lists, we consulted the previous version (see all detailed time windows of the earliest assessments with references in Table B2 in Appendix B). Data classification: For the analyses of the percentage of species in the different hazard levels, we used the German Red List categories as described in detail by Saure and Schwarz (2005) and Ludwig et al. (2009). These are: Prewarning list, endangered (category 3), highly endangered (category 2), threatened by extinction or extirpation (category 1), and extinct or extirpated (category 0). To determine the number of indigenous unthreatened species in each Red List, we subtracted the number of species in the five categories and the number of non-indigenous species (neobiota) from the total number of species in each Red List. For further analyses, we pooled the taxonomic groups of the 37 Red Lists into more broadly defined taxonomic groups: Plants, lichens, fungi, algae, mammals, birds, amphibians, reptiles, fish and lampreys, molluscs, and arthropods (see categorisation in Table 1). We categorised slime fungi (Myxomycetes including Ceratiomyxomycetes) as ‘fungi’, even though they are more closely related to animals because slime fungi are traditionally studied by mycologists (Schmidt and Täglich, 2023). We classified ‘lichens’ in a separate category, rather than in ‘fungi’, as they are a symbiotic community of fungi and algae (Krause et al., 2017). For analyses of the percentage of extirpated species of each pooled taxonomic group, we set the number of extirpated species in relation to the sum of the number of unthreatened species, species in the prewarning list, and species in the categories one to three. We further categorised the extirpated species according to the habitats in which they occurred. We therefore categorised terrestrial species as ‘terrestrial’ and aquatic species as ‘aquatic’. Amphibians and dragonflies have life stages in both, terrestrial and aquatic habitats, and were categorised as ‘terrestrial/aquatic’. We also categorised plants and mosses as ‘terrestrial/aquatic’ if they depend on wetlands (see all habitat categories for each species in Table C1 in Appendix C). The available data considering the species’ last detection in Berlin ranked from a specific year, over a period of time up to a century. If a year of last detection was given with the auxiliary ‘around’ or ‘circa’, we used for further analyses the given year for temporal classification. If a year of last detection was given with the auxiliary ‘before’ or ‘after’, we assumed that the nearest year of last detection was given and categorised the species in the respective century. In this case, we used the species for temporal analyses by centuries only, not across years. If only a timeframe was given as the date of last detection, we used the respective species for temporal analyses between centuries, only. We further classified all of the extirpated species in centuries, in which species were lastly detected: 17th century (1601-1700); 18th century (1701-1800); 19th century (1801-1900); 20th century (1901-2000); 21th century (2001-now) (see all data on species’ last detection in Table C1 in Appendix C). For analyses of the effects of the number of inhabitants on species’ extirpation in Berlin, we used species that went extirpated between the years 1920 and 2012, because of Berlin’s was expanded to ‘Groß-Berlin’ in 1920 (Buesch and Haus, 1987), roughly corresponding to the cities’ current area. Therefore, we included the number of Berlin’s inhabitants for every year a species was last detected (Statistische Jahrbücher der Stadt Berlin, 1920, 1924-1998, 2000; see all data on the number of inhabitants for each year of species’ last detection in Table C1 in Appendix C). Materials and Methods from Keinath et al. (2024): 'High levels of species’ extirpation in an urban environment – A case study from Berlin, Germany, covering 1700-2023'.
f
CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...
figshare.com
txt
Updated Apr 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28736501.v1
Dataset updated
Apr 5, 2025
Dataset provided by
figshare
Authors
Tahir Bhatti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501
Multimodal Vision-Audio-Language Dataset
zenodo.org
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10060785
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
Annotation
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
Installation
pip install pandas pyarrow
Example
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
Description
The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
Data files
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Z
Data from: Redocking the PDB
data.niaid.nih.gov
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flachsenberg, Florian (2023). Redocking the PDB [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7579501
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Ehrt, Christiane
Rarey, Matthias
Gutermuth, Torben
Flachsenberg, Florian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints

All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '_', e.g., 'SOX_B_1559'). All files (except for the docking pose tables) have exactly the same number of rows as the dataset they were calculated on (e.g., PDBScan22 or PDBScan22-HQ). However, some CSV files may have missing values (see also the JSON metadata files) in some or even all columns (except for 'pdb' and 'name'). The docking pose tables also contain the 'pdb' and 'name' columns. However, these alone are not unique but only together with the 'rank' column (i.e., there might be multiple poses for each docking run or none). Example usage Using the pandas library (https://pandas.pydata.org/) in Python, we can calculate the number of protein-ligand complexes in the PDBScan22-HQ dataset with a top-ranked pose RMSD to the crystal structure ≤ 2.0 Å in the JAMDA redocking experiment and a molecular weight between 100 Da and 200 Da:

import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets

PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22

PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ

PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22

PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ

PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand
Z
DIPS-Plus: The Enhanced Database of Interacting Protein Structures for...
data.niaid.nih.gov
Updated Oct 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianlin Cheng (2021). DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4815266
Explore at:
Dataset updated
Oct 6, 2021
Dataset provided by
Alex Morehead
Ada Sedova
Chen Chen
Jianlin Cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains replication data for the paper titled "DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction". The dataset consists of pickled Pandas DataFrame files that can be used to train and validate protein interface prediction models. This dataset also contains the externally generated residue-level PSAIA and HH-suite3 features for users' convenience (e.g. raw MSAs and profile HMMs for each protein complex). Our GitHub repository linked in the "Additional notes" metadata section below provides more details on how we parsed through these files to create training and validation datasets. The GitHub repository for DIPS-Plus also includes scripts that can be used to impute missing feature values and convert the final "raw" complexes into DGL-compatible graph objects.
f
PANDA is able to recover information lost via adding noise to simulated...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly Glass; Curtis Huttenhower; John Quackenbush; Guo-Cheng Yuan (2023). PANDA is able to recover information lost via adding noise to simulated networks. [Dataset]. http://doi.org/10.1371/journal.pone.0064832.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0064832.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Kimberly Glass; Curtis Huttenhower; John Quackenbush; Guo-Cheng Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PANDA is able to recover information lost via adding noise to simulated networks.
Not seeing a result you expected?
Learn how you can add new datasets to our index.