11 datasets found
  1. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  2. A

    ‘Titanic: cleaned data’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Titanic: cleaned data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-cleaned-data-cbf4/dc9cd7ff/?iid=055-046&v=presentation
    Explore at:
    Dataset updated
    Sep 30, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Titanic: cleaned data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jamesleslie/titanic-cleaned-data on 30 September 2021.

    --- Dataset description provided by original source is as follows ---

    Introduction

    This dataset was created in this notebook as part of a three-part series. The data is in machine-learning-ready format, with all missing values for the Age, Fare and Embarked columns having been imputed.

    Data imputation

    • Age: this column was imputed by using the median age for the passenger's title (Mr, Mrs, Dr etc).
    • Fare: the single missing value in this column was imputed using the median value for that passenger's class.
    • Embarked: the two missing values here were imputed using the Pandas backfill method.

    Usage

    This data is used in both the second and third parts of the series.

    --- Original source retains full ownership of the source dataset ---

  3. TMDB MOVIES DATASET

    • kaggle.com
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MAYUR DESAI88 (2022). TMDB MOVIES DATASET [Dataset]. https://www.kaggle.com/datasets/mayurdesai88/tmdb-movies-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    MAYUR DESAI88
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains the data of the over 10000 TMDB movies including the id, title, release date,avg vote, vote count, overview and popularity, etc. This data was collected by using the TMDB API, requests, json and converted into a dataframe using pandas. This data set contains some null values as there are missing fields in the tmdb database. Thought it's good for a young analyst to deal with missing value and you can also use this data to make movies recommendation systems

  4. Klib library python

    • kaggle.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  5. Covid19 Cleaned Data

    • kaggle.com
    Updated Apr 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashant Patel (2020). Covid19 Cleaned Data [Dataset]. https://www.kaggle.com/prashant268/covid-clean/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prashant Patel
    Description

    This is the cleaned data for covid19 forecasting with some important variables e.g. average temperature, the median age of the country. I have used the following data for information about the country and filled any missing value using Wikipedia and pandas. https://www.kaggle.com/koryto/countryinfo Feel free to use this data and upvote if it is useful.

  6. n

    Extirpated species in Berlin, dates of last detections, habitats, and number...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvia Keinath (2024). Extirpated species in Berlin, dates of last detections, habitats, and number of Berlin’s inhabitants [Dataset]. http://doi.org/10.5061/dryad.n5tb2rc4k
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Museum für Naturkunde
    Authors
    Silvia Keinath
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Berlin
    Description

    Species loss is highly scale-dependent, following the species-area relationship. We analysed spatio-temporal patterns of species’ extirpation on a multitaxonomic level using Berlin, the capital city of Germany. Berlin is one of the largest cities in Europe and has experienced a strong urbanisation trend since the late 19th century. We expected species’ extirpation to be exceptionally high due to the long history of urbanisation. Analysing regional Red Lists of Threatened Plants, Animals, and Fungi of Berlin (covering 9498 species), we found that 16 % of species were extirpated, a rate 5.9 times higher than at the German scale, and 47.1 times higher than at the European scale. Species’ extirpation in Berlin is comparable to that of another German city with a similarly broad taxonomic coverage, but much higher than in regional areas with less human impact. The documentation of species’ extirpation started in the 18th century and is well documented for the 19th and 20th centuries. We found an average annual extirpation of 3.6 species in the 19th century, 9.6 species in the 20th century, and the same number of extirpated species as in the 19th century were documented in the 21th century, despite the much shorter time period. Our results showed that species’ extirpation is higher at small than on large spatial scales, and might be negatively influenced by urbanisation, with different effects on different taxonomic groups and habitats. Over time, we found that species’ extirpation is highest during periods of high human alterations and is negatively affected by the number of people living in the city. But, there is still a lack of data to decouple the size of the area and the human impact of urbanisation. However, cities might be suitable systems for studying species’ extirpation processes due to their small scale and human impact. Methods Data extraction: To determine the proportion of extirpated species for Germany, we manually summarised the numbers of species classified in category 0 ‘extinct or extirpated’ and calculated the percentage in relation to the total number of species listed in the Red Lists of Threatened Species for Germany, taken from the website of the Red List Centre of Germany (Rote Liste Zentrum, 2024a). For Berlin, we used the 37 current Red Lists of Threatened Plants, Animals, and Fungi from the city-state of Berlin, covering the years from 2004 to 2023, taken from the official capital city portal of the Berlin Senate Department for Mobility, Transport, Climate Protection and Environment (SenMVKU, 2024a; see overview of Berlin Red Lists used in Table 1). We extracted all species that are listed as extinct/extirpated, i.e. classified in category 0, and additionally, if available, the date of the last record of the species in Berlin. The Red List of macrofungi of the order Boletales by Schmidt (2017) was not included in our study, as this Red List has only been compiled once in the frame of a pilot project and therefore lacks the category 0 ‘extinct or extirpated’. We used Python, version 3.7.9 (Van Rossum and Drake, 2009), the Python libraries Pandas (McKinney et al., 2010), and Camelot-py, version 0.11.0 (Vinayak Meta, 2023) in Jupyter Lab, version 4.0.6 (Project Jupyter, 2016) notebooks. In the first step, we created a metadata table of the Red Lists of Berlin to keep track of the extraction process, maintain the source reference links, and store summarised data from each Red List pdf file. At the extraction of each file, a data row was added to the metadata table which was updated throughout the rest of the process. In the second step, we identified the page range for extraction for each extracted Red List file. The extraction mechanism for each Red List file depended on the printed table layout. We extracted tables with lined rows with the Lattice parsing method (Camelot-py, 2024a), and tables with alternating-coloured rows with the Stream method (Camelot-py, 2024b). For proofing the consistency of extraction, we used the Camelot-py accuracy report along with the Pandas data frame shape property (Pandas, 2024). After initial data cleaning for consistent column counts and missing data, we filtered the data for species in category 0 only. We collated data frames together and exported them as a CSV file. In a further step, we proofread whether the filtered data was tallied with the summary tables, given in each Red List. Finally, we cleaned each Red List table to contain the species, the current hazard level (category 0), the date of the species’ last detection in Berlin, and the reference (codes and data available at: Github, 2023). When no date of last detection was given for a species, we contacted the authors of the respective Red Lists and/or used former Red Lists to find information on species’ last detections (Burger et al., 1998; Saure et al., 1998; 1999; Braasch et al., 2000; Saure, 2000). Determination of the recording time windows of the Berlin Red Lists We determined the time windows, the Berlin Red Lists look back on, from their methodologies. If the information was missing in the current Red Lists, we consulted the previous version (see all detailed time windows of the earliest assessments with references in Table B2 in Appendix B). Data classification: For the analyses of the percentage of species in the different hazard levels, we used the German Red List categories as described in detail by Saure and Schwarz (2005) and Ludwig et al. (2009). These are: Prewarning list, endangered (category 3), highly endangered (category 2), threatened by extinction or extirpation (category 1), and extinct or extirpated (category 0). To determine the number of indigenous unthreatened species in each Red List, we subtracted the number of species in the five categories and the number of non-indigenous species (neobiota) from the total number of species in each Red List. For further analyses, we pooled the taxonomic groups of the 37 Red Lists into more broadly defined taxonomic groups: Plants, lichens, fungi, algae, mammals, birds, amphibians, reptiles, fish and lampreys, molluscs, and arthropods (see categorisation in Table 1). We categorised slime fungi (Myxomycetes including Ceratiomyxomycetes) as ‘fungi’, even though they are more closely related to animals because slime fungi are traditionally studied by mycologists (Schmidt and Täglich, 2023). We classified ‘lichens’ in a separate category, rather than in ‘fungi’, as they are a symbiotic community of fungi and algae (Krause et al., 2017). For analyses of the percentage of extirpated species of each pooled taxonomic group, we set the number of extirpated species in relation to the sum of the number of unthreatened species, species in the prewarning list, and species in the categories one to three. We further categorised the extirpated species according to the habitats in which they occurred. We therefore categorised terrestrial species as ‘terrestrial’ and aquatic species as ‘aquatic’. Amphibians and dragonflies have life stages in both, terrestrial and aquatic habitats, and were categorised as ‘terrestrial/aquatic’. We also categorised plants and mosses as ‘terrestrial/aquatic’ if they depend on wetlands (see all habitat categories for each species in Table C1 in Appendix C). The available data considering the species’ last detection in Berlin ranked from a specific year, over a period of time up to a century. If a year of last detection was given with the auxiliary ‘around’ or ‘circa’, we used for further analyses the given year for temporal classification. If a year of last detection was given with the auxiliary ‘before’ or ‘after’, we assumed that the nearest year of last detection was given and categorised the species in the respective century. In this case, we used the species for temporal analyses by centuries only, not across years. If only a timeframe was given as the date of last detection, we used the respective species for temporal analyses between centuries, only. We further classified all of the extirpated species in centuries, in which species were lastly detected: 17th century (1601-1700); 18th century (1701-1800); 19th century (1801-1900); 20th century (1901-2000); 21th century (2001-now) (see all data on species’ last detection in Table C1 in Appendix C). For analyses of the effects of the number of inhabitants on species’ extirpation in Berlin, we used species that went extirpated between the years 1920 and 2012, because of Berlin’s was expanded to ‘Groß-Berlin’ in 1920 (Buesch and Haus, 1987), roughly corresponding to the cities’ current area. Therefore, we included the number of Berlin’s inhabitants for every year a species was last detected (Statistische Jahrbücher der Stadt Berlin, 1920, 1924-1998, 2000; see all data on the number of inhabitants for each year of species’ last detection in Table C1 in Appendix C). Materials and Methods from Keinath et al. (2024): 'High levels of species’ extirpation in an urban environment – A case study from Berlin, Germany, covering 1700-2023'.

  7. f

    CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...

    • figshare.com
    txt
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 5, 2025
    Dataset provided by
    figshare
    Authors
    Tahir Bhatti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501

  8. Multimodal Vision-Audio-Language Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

    Details can be found in the attached report.

    Annotation

    The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

    The split into train, validation and test set follows the split of the original datasets.

    Installation

    pip install pandas pyarrow

    Example

    import pandas as pd
    df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
    print(df.iloc[0])

    dataset AudioSet

    filename train/---2_BBVHAA.mp3

    captions_visual [a man in a black hat and glasses.]

    captions_auditory [a man speaks and dishes clank.]

    tags [Speech]

    Description

    The annotation file consists of the following fields:

    filename: Name of the corresponding file (video or audio file)
    dataset: Source dataset associated with the data point
    captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
    captions_auditory: A list of captions related to the auditory content of the video
    tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

    Data files

    The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  9. Z

    Data from: Redocking the PDB

    • data.niaid.nih.gov
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flachsenberg, Florian (2023). Redocking the PDB [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7579501
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Ehrt, Christiane
    Rarey, Matthias
    Gutermuth, Torben
    Flachsenberg, Florian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints

    All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '_', e.g., 'SOX_B_1559'). All files (except for the docking pose tables) have exactly the same number of rows as the dataset they were calculated on (e.g., PDBScan22 or PDBScan22-HQ). However, some CSV files may have missing values (see also the JSON metadata files) in some or even all columns (except for 'pdb' and 'name'). The docking pose tables also contain the 'pdb' and 'name' columns. However, these alone are not unique but only together with the 'rank' column (i.e., there might be multiple poses for each docking run or none). Example usage Using the pandas library (https://pandas.pydata.org/) in Python, we can calculate the number of protein-ligand complexes in the PDBScan22-HQ dataset with a top-ranked pose RMSD to the crystal structure ≤ 2.0 Å in the JAMDA redocking experiment and a molecular weight between 100 Da and 200 Da:

    import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets

    PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22

    PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ

    PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22

    PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ

    PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand

  10. Z

    DIPS-Plus: The Enhanced Database of Interacting Protein Structures for...

    • data.niaid.nih.gov
    Updated Oct 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianlin Cheng (2021). DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4815266
    Explore at:
    Dataset updated
    Oct 6, 2021
    Dataset provided by
    Alex Morehead
    Ada Sedova
    Chen Chen
    Jianlin Cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains replication data for the paper titled "DIPS-Plus: The Enhanced Database of Interacting Protein Structures for Interface Prediction". The dataset consists of pickled Pandas DataFrame files that can be used to train and validate protein interface prediction models. This dataset also contains the externally generated residue-level PSAIA and HH-suite3 features for users' convenience (e.g. raw MSAs and profile HMMs for each protein complex). Our GitHub repository linked in the "Additional notes" metadata section below provides more details on how we parsed through these files to create training and validation datasets. The GitHub repository for DIPS-Plus also includes scripts that can be used to impute missing feature values and convert the final "raw" complexes into DGL-compatible graph objects.

  11. f

    PANDA is able to recover information lost via adding noise to simulated...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly Glass; Curtis Huttenhower; John Quackenbush; Guo-Cheng Yuan (2023). PANDA is able to recover information lost via adding noise to simulated networks. [Dataset]. http://doi.org/10.1371/journal.pone.0064832.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kimberly Glass; Curtis Huttenhower; John Quackenbush; Guo-Cheng Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PANDA is able to recover information lost via adding noise to simulated networks.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Organization logo

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

  • Chapter 1: Getting started with pandas
  • Chapter 2: Analysis: Bringing it all together and making decisions
  • Chapter 3: Appending to DataFrame
  • Chapter 4: Boolean indexing of dataframes
  • Chapter 5: Categorical data
  • Chapter 6: Computational Tools
  • Chapter 7: Creating DataFrames
  • Chapter 8: Cross sections of different axes with MultiIndex
  • Chapter 9: Data Types
  • Chapter 10: Dealing with categorical variables
  • Chapter 11: Duplicated data
  • Chapter 12: Getting information about DataFrames
  • Chapter 13: Gotchas of pandas
  • Chapter 14: Graphs and Visualizations
  • Chapter 15: Grouping Data
  • Chapter 16: Grouping Time Series Data
  • Chapter 17: Holiday Calendars
  • Chapter 18: Indexing and selecting data
  • Chapter 19: IO for Google BigQuery
  • Chapter 20: JSON
  • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
  • Chapter 22: Map Values
  • Chapter 23: Merge, join, and concatenate
  • Chapter 24: Meta: Documentation Guidelines
  • Chapter 25: Missing Data
  • Chapter 26: MultiIndex
  • Chapter 27: Pandas Datareader
  • Chapter 28: Pandas IO tools (reading and saving data sets)
  • Chapter 29: pd.DataFrame.apply
  • Chapter 30: Read MySQL to DataFrame
  • Chapter 31: Read SQL Server to Dataframe
  • Chapter 32: Reading files into pandas DataFrame
  • Chapter 33: Resampling
  • Chapter 34: Reshaping and pivoting
  • Chapter 35: Save pandas dataframe to a csv file
  • Chapter 36: Series
  • Chapter 37: Shifting and Lagging Data
  • Chapter 38: Simple manipulation of DataFrames
  • Chapter 39: String manipulation
  • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
  • Chapter 41: Working with Time Series
Search
Clear search
Close search
Google apps
Main menu