Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
00_Caver.tar.gz - Contains CAVER (https://caver.cz/fil/download/manual/caver_userguide.pdf) input and output files used for identification of transport pathways in LinB86(PDB ID: 5LKA). final_clustering├── tunnel_custers # contains caver output files for individual tunnels clusters │ ├── ...├── analysis # contains .csv output files for botttlenecks and tunnels charecteristics of individual tunnels clusters │ ├── ...
01_CaverDock_Tunnels_Profile.tar.gz - Contains tunnel clusters consisting of the top 100 tunnels and the CaverDock analysis files obtained. # Each folder (tun_cluster_p1a, tun_cluster_p1b, tun_cluster_p2, tun_cluster_p3) contains input raw files used for CaverDock calculations for individual snapshots of the respective tunnel clusters named as stripped_system*. The details of those files are:- calculations/*/caverdock.conf : The config file input for caverdock calculation. - calculations/*/DBE.pdbqt : Input file for the substrate DBE.- calculations/*/stripped_system*.pdbqt : Input file for the Protein/Receptor- calculations/*/stripped_system*.dsd : Tunnel discretization file. Notes: - calculations/*/stripped_system*.pdb : PDB file for the tunnel.
02_Minimization_and_Equilibration.tar.gz - Contains input and output files used for AMBER minimization and equilibration of the molecular systems and seed conformations used for adaptive sampling simulations.
03_HTMD_Bulk - separate Zenodo repository, see below for the link. Contains input, output and restart files used for HTMD (High-throughput molecular dynamics) adaptive sampling simulations at 310K for Bulk schemes. 04_HTMD_Cavity - separate Zenodo repository, see below for the link. Contains input, output and restart files used for HTMD (High-throughput molecular dynamics) adaptive sampling simulations at 310K for Cavity schemes.05_HTMD_Cavity_Bulk - separate Zenodo repository, see below for the link. Contains input, output and restart files used for HTMD (High-throughput molecular dynamics) adaptive sampling simulations at 310K for Cavity&Bulk schemes. 06_HTMD_Tunnels - separate Zenodo repository, see below for the link. Contains input, output and restart files used for HTMD (High-throughput molecular dynamics) adaptive sampling simulations at 310K for Tunnels schemes.
07_MD-Analysis.tar.gz - Contains MD analysis files obtained from 45 micro-seconds adaptive sampling simulations at 310K. # Each folder contains input raw files used to calculate epochs convergence, distances, RMSD, RMSF, kinetics, percentage, sample proportions and tunnel lengths. The details of those files are:- Epoch_Convergence/epochs_dist_counts.csv : Contains the counts of DBE distances from the active-site (0-5 Å), tunnel (5-19 Å), and bulk (>19 Å) for the 30 epochs. - Distances/*/dist_s_r*.csv : Contains .csv file for the frames wise for all studied schemes. The analysis were performed using the cpptraj program (https://amber-md.github.io/cpptraj/CPPTRAJ.xhtml). The following columns are present:D107_OD1_DBE_C1 D107_OD2_DBE_C1 D107_OD1_DBE_C2 D107_OD2_DBE_C2 N37_ND2_DBE_Br1 N37_ND2_DBE_Br2 W108_NE1_DBE_Br1 W108_NE1_DBE_Br2 D107_COM_DBE_COM W108_COM_DBE_COM N37_COM_DBE_COM catal_COM_p1aCOM catal_COM_p1bCOM catal_COM_p2COM catal_COM_p3COM p1aCOM_DBE_COM p1bCOM_DBE_COM p2COM_DBE_COM p3COM_DBE_COM catal_COM_DBE_COM p1aCOM_p1bCOM p1aCOM_p2COM p1aCOM_p3COM p1bCOM_p2COM p1bCOM_p3COM p2COM_p3COM - RMSD_RMSF/*/*.csv : Contains .csv files with RMSD and RMSF from the protein residues. For RMSF 1st row are residue number (1-295) and 2nd row are RMSF. For RMSD, 1st column are frame no. (0.1ns) and 2nd column are RMSD values respectively. The calcualtion were performed using pytraj (https://amber-md.github.io/pytraj/latest/index.html) program. Example input: pytraj.rmsd(traj, mask='1-295@CA') Example input: pytraj.rmsf(traj, mask=':1-295', options='byres')- COM_RMSF/.xlsx : Contains the center of mass (COM) distances calculated using the bottleneck residues for catalytic residues (N38, D108, W109), p1a (D147, F151, and V173), p1b (D147, W177, and L248), p2 (L211 and L248), and p3 (L143, F151, and I213)- Kinetics/kinetics.txt : Contains .csv file with kinetic information from studied scheme: Cavity, Cavity&Bulk and Tunnels for all the three replicates and calculated average kon, koff, koff/kon rates.- Percentages/.csv : Contains csv files for the percentages of DBE localization and distances from the active-site (0-5 Å), tunnel (5-19 Å), and bulk (>19 Å).- Tunnels_lengths/.csv : Contains tunnel_lengths.csv, Summary of tunnel lengths.xlsx files with lengths of top 100 tunnels snapshots for tunnel clusters p1a, p1b, p2 and p3 in tunnel_lengths.csv and summary of respective tunnel clusters generated from Caver output (for more details please check https://www.caver.cz/fil/download/manual/caver_userguide.pdf with keywork "summary.txt") in the Summary of tunnel lengths.xlsx file. - Sample_proportions/.csv : Contains sample_proportions.csv file with proportions or fraction individual metastable states while performing the transition pathway analysis. For more details please check https://software.acellera.com/htmd/htmd.kinetics.html or http://www.emma-project.org/v1.2.1/api/generated/pyemma.msm.flux.pathways.html?highlight=transition%20path - *.py : Python scripts used to build and analysis Markov state models with use case and distances of ligand.- Generated_models/ : Contains models_rep[].dat files representating the matric data used to build MSM for respective schemes and replicates. The dirs are arranged as below:├── Cavity│ ├── model_rep1.dat│ ├── model_rep2.dat│ └── model_rep3.dat├── Cavity_Bulk│ ├── model_rep1.dat│ ├── model_rep2.dat│ └── model_rep3.dat└── Tunnels ├── model_rep1.dat ├── model_rep2.dat └── model_rep3.dat
08_TransportTools.tar.gz - Contains TransportTools (TT) analysis output, log and summary files for Cavity, Cavity&Bulk and Tunnels schemes. For more details on the workflow of TT, please visit https://github.com/labbit-eu/transport_toolsresults-*_rep0 # results for replicate 1 for given schemes for example cavity, cavity&bulk or tunnels.├── data│ ├── super_clusters├── _internal│ ├── ...├── statistics│ ├── ...results-*_rep1 # results for replicate 2 for given schemes for example cavity, cavity&bulk or tunnels.├── data│ ├── super_clusters├── _internal│ ├── ...├── statistics│ ├── ...results-*_rep2 # results for replicate 3 for given schemes for example cavity, cavity&bulk or tunnels.├── data│ ├── super_clusters├── _internal│ ├── ...├── statistics│ ├── ...- event.csv file contains the aggregated summary of events inferred from the 4-filtered_events_statistics.txt files of each results of respective schemes
09_MSM_states.tar.gz - Contains the Markov state models (MSM) output files for Cavity, Cavity&Bulk and Tunnels schemes and three replicates. The MSMs were generated using the pyEMMA program and HTMD framework, for further details please follow https://software.acellera.com/htmd/documentation.html. The directories looks as below: ├── Bulk│ ├── rep1 # MSM states for replicate 1│ ├── rep2 # MSM states for replicate 2│ ├── rep3 # MSM states for replicate 3├── Cavity│ ├── rep1 │ ├── rep2 │ ├── rep3 ├── Cavity&Bulk│ ├── rep1 │ ├── rep2│ ├── rep3├── Tunnels│ ├── rep1 │ ├── rep2│ ├── rep3
10_MSM_fingerprints.tar.gz - Contains the Markov state models (MSMs) distances generated from repository dir 09_MSM_states consisting the model*.pdb files. The distances were calculated using the cpptraj program of AMBER18 package.- MSM_Distances/*/rep*/*.csv : Contains .csv file for the generated MSM models (0,1,2..). The following columns (calculated distances) are present in the .csv files:D107_OD1_DBE_C1 D107_OD2_DBE_C1 D107_OD1_DBE_C2 D107_OD2_DBE_C2 N37_ND2_DBE_Br1 N37_ND2_DBE_Br2 W108_NE1_DBE_Br1 W108_NE1_DBE_Br2 D107_COM_DBE_COM W108_COM_DBE_COM N37_COM_DBE_COM catal_COM_p1aCOM catal_COM_p1bCOM catal_COM_p2COM catal_COM_p3COM p1aCOM_DBE_COM p1bCOM_DBE_COM p2COM_DBE_COM p3COM_DBE_COM catal_COM_DBE_COM p1aCOM_p1bCOM p1aCOM_p2COM p1aCOM_p3COM p1bCOM_p2COM p1bCOM_p3COM p2COM_p3COM
11_ULS_clustering_and_transition_assignments.tar.gz - Contains files for analysis of utilization of the substrate DBE. Each folder contains two types of .csv files:1. for the transition detection of DBE and the classification in Bulk (out_), Bottleneck (bt_), Unknown bottleneck (bt_unknown), Inside (in_) and 2. the second type as the charecterization on the tunnels utilization: Tunnel (p1a, p1b, p2, and p3), Mixed and Unknnown.# the details of the file arangements are as below for the studied schemes Bulk, Cavity, Cavity&Bulk and Tunnels:├── average_tunnel_utilization_per_scheme.png├── average_tunnel_utilization.png├── Bulk│ ├── Bulk_run_htmd_0_combined_df.csv│ ├── Bulk_run_htmd_0_transitions_counts.csv│ ├── Bulk_run_htmd_1_combined_df.csv│ ├── Bulk_run_htmd_1_transitions_counts.csv│ ├── Bulk_run_htmd_2_combined_df.csv│ └── Bulk_run_htmd_2_transitions_counts.csv├── Bulk&Cavity│ ├── Cavity&Bulk_run_htmd_0_combined_df.csv│ ├── Cavity&Bulk_run_htmd_0_transitions_counts.csv│ ├── Cavity&Bulk_run_htmd_1_combined_df.csv│ ├── Cavity&Bulk_run_htmd_1_transitions_counts.csv│ ├── Cavity&Bulk_run_htmd_2_combined_df.csv│ └── Cavity&Bulk_run_htmd_2_transitions_counts.csv├── Cavity│ ├── Cavity_run_htmd_0_combined_df.csv│
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains behavioural, psychophysiological, and demographic data collected from 193 preschool children aged 3-5 years to predict the risk of anxiety disorders. The data was originally collected for the research paper "Quantifying Risk for Anxiety Disorders in Preschool Children: A Machine Learning Approach" published in the Journal of Child Psychology and Psychiatry. This dataset can be used to develop machine learning models to quantify anxiety risk in young children.
Training Data.xlsx: Includes 130 randomly selected participants, which is about 70% of the full dataset. Testing Data.xlsx: Has the remaining 30% of participants, which is 63 children.
If you use this dataset in your research, please credit the data has been made publicly available through the Harvard Dataverse repository (Carpenter, 2016). Details on the study methodology and cohort are published in Carpenter et al., 2019 in the Journal of Child Psychology and Psychiatry. This dataset can be accessed at https://doi.org/10.7910/DVN/N42LWG and used under a CC0 1.0 Universal license that supports open scientific collaboration.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.