Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer This is the first release of the Global Ensemble Digital Terrain Model (GEDTM30). Use for testing purposes only. A publication describing the methods used has been submitted to PeerJ and is currently under review. This work was funded by the European Union. However, the views and opinions expressed are solely those of the author(s) and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. The data is provided "as is." The Open-Earth-Monitor project consortium, along with its suppliers and licensors, hereby disclaims all warranties of any kind, express or implied, including, without limitation, warranties of merchantability, fitness for a particular purpose, and non-infringement. Neither the Open-Earth-Monitor project consortium nor its suppliers and licensors make any warranty that the website will be error-free or that access to it will be continuous or uninterrupted. You understand that you download or otherwise obtain content or services from the website at your own discretion and risk. Description GEDTM30 is presented as a 1-arc-second (~30m) global Digital Terrain Model (DTM) generated using machine-learning-based data fusion. It was trained using a global-to-local Random Forest model with ICESat-2 and GEDI data, incorporating almost 30 billion high-quality points. To see the documentation, please visit our GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). This dataset covers the entire world and can be used for applications such as topography, hydrology, and geomorphometry analysis. Dataset Contents This dataset includes: GEDTM30Represents the predicted terrain height. Uncertainty of GEDTM30 predictionProvides an uncertainty map of the terrain prediction, derived from the standard deviation of individual tree predictions in the Random Forest model. Due to Zenodo's storage limitations, the original GEDTM30 dataset and its standard deviation map are provided via external links: GEDTM30 30m Uncertainty of GEDTM30 prediction 30m Related Identifiers Landform:Slope in Degree, Geomorphons Light and Shadow:Positive Openness, Negative Openness, Hillshade Curvature:Minimal Curvature, Maximal Curvature, Profile Curvature, Tangential Curvature, Ring Curvature, Shape Index Local Topographic Position:Difference from Mean Elevation, Spherical Standard Deviation of the Normals Hydrology:Specific Catchment Area, LS Factor, Topographic Wetness Index Data Details Time period: static. Type of data: Digital Terrain Model How the data was collected or derived: Machine learning models. Statistical Methods used: Random Forest. Limitations or exclusions in the data: The dataset does not include data Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180, -65, 180, 85) Spatial resolution: 120m Image size: 360,000P x 178,219L File format: Cloud Optimized Geotiff (COG) format. Layer information: Layer Scale Data Type No Data Ensemble Digital Terrain Model 10 Int32 -2,147,483,647 Standard Deviation EDTM 100 UInt16 65,535 Code Availability The primary development of GEDTM30 is documented in GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). The current version (v1) code is compressed and uploaded as GEDTM30-main.zip. To access the up-to-date development please visit our GitHub page. Support If you discover a bug, artifact, or inconsistency, or if you have a question please raise a GitHub issue here Naming convention To ensure consistency and ease of use across and within the projects, we follow the standard Ai4SoilHealth and Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describe important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. For example, for edtm_rf_m_120m_s_20000101_20231231_go_epsg.4326_v20250130.tif, the fields are: generic variable name: edtm = ensemble digital terrain model variable procedure combination: rf = random forest Position in the probability distribution/variable type: m = mean | sd = standard deviation Spatial support: 120m Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20231231 = 2023-12-31 Bounding box: go = global EPSG code: EPSG:4326 Version code: v20250130 = version from 2025-01-30
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
Machine learning (ML) interpretability has become increasingly crucial for identifying accurate and relevant structural relationships between spatial events and factors that explain them. Methodologically aspatial ML algorithms with an apparent high predictive power ignore non-stationary domain relationships in spatio-temporal data (e.g., dependence, heterogeneity), leading to incorrect interpretations and poor management decisions. This study addresses this critical methodological issue of ‘interpretability’ in ML-based modeling of structural relationships using the example of heterogeneous drivers of wildfires across the United States. Specifically, we present and evaluate a spatio-temporally interpretable random forest (iST-RF) that uses spatio-temporal sampling-based training and weighted prediction. Although the ultimate scientific objective is to derive interpretation in space-time, experiments show that iST-RF can improve predictive accuracy (76%) compared to the aspatial RF approach (70%), while enhancing interpretations of the trained model’s spatio-temporal relevance for its ensemble prediction. This novel approach can help balance prediction and interpretation with fidelity in a spatial data science life cycle. However, challenges exist for predictive modeling when dataset is very small, because in such cases locally optimized sub-model’s prediction performance can be suboptimal. With that caveat, our proposed approach is an ideal choice for identifying drivers of spatio-temporal events at country or regional-scale studies. Author contributions
A.M. conceived and designed the study, coded and performed data processing, modeling and interpretations, and wrote the manuscript. M.Y., P.M., D.P., and A.T. contributed to the refinement of the proposed methodology, experiments, and write-up. All authors reviewed the manuscript.
We find shark catch risk hotspots in all ocean basins, with notable high-risk areas off Southwest Africa and in the Eastern Tropical Pacific. These patterns are mostly driven by more common species such as blue sharks, though risk areas for less common, Endangered and Critically Endangered species are also identified. Clear spatial patterns of shark fishing risk identified here can be leveraged to develop spatial management strategies for threatened populations. Sharks are susceptible to industrial longline fishing due to their slow life histories and association with targeted tuna stocks. Identifying fished areas with high shark interaction risk is vital to protect threatened species. We harmonize shark catch records from global tuna Regional Fisheries Management Organizations (tRFMOs) from 2012–2020 and use machine learning to identify where sharks are most threatened by longline fishing. Most spatial patterns are driven by more common species such as blue sharks, though risk areas fo..., We built Random Forest (RF) machine learning models to estimate spatially explicit shark catch risk globally by longlines using a suite of catch and effort data from tRFMOs, additional effort datasets for fishing effort (Global Fishing Watch), environmental datasets (sea surface temperature, sea surface height, chlorophyll-A) and economic datasets (ex-vessel price). More information on the exact datasets used can be found in the associated software works. For each tRFMO, we tested various spatial resolutions and shark catch units to determine the most appropriate dataset for future model runs, identified by the highest R2 for each tRFMO. Once a resolution and unit were selected for a tRFMO, the same resolution was used in future model runs. We then conducted a second phase of parameter testing for combinations of the following variables: sea surface temperature (mean or mean and coefficient of variation), chlorophyll-A (mean or mean and coefficient of variation), sea surface height (m..., Please refer to the associated software works for instructions on how to download the input dataset and set up your folder structure. The files saved here are the outputs for machine learning models that were run using publicly available tRFMO datasets. Please refer to the README files for metadata.
<r...
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: Welcome to the "Loan Applicant Data for Credit Risk Analysis" dataset on Kaggle! This dataset provides essential information about loan applicants and their characteristics. Your task is to develop predictive models to determine the likelihood of loan default based on these simplified features.
In today's financial landscape, assessing credit risk is crucial for lenders and financial institutions. This dataset offers a simplified view of the factors that contribute to credit risk, making it an excellent opportunity for data scientists to apply their skills in machine learning and predictive modeling.
Column Descriptions:
Explore this dataset, preprocess the data as needed, and develop machine learning models, especially using Random Forest, to predict loan default. Your insights and solutions could contribute to better credit risk assessment methods and potentially help lenders make more informed decisions.
Remember to respect data privacy and ethics guidelines while working with this data. Good luck, and happy analyzing!
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.
We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.
This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).
This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.
Python version:
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor
from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')
dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()
y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}
RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data from a cluster-randomized controlled trial to evaluate the effectiveness of a community-based intervention (Konga model) for viral load suppression among children living with HIV in Simiyu, Tanzania. Children aged 2‒14 years with a viral load >1,000 cells/mL were randomly assigned to 15 treatment and 30 control clusters based on their area of residence. The intervention included adherence counseling, psychosocial support, and screening for comorbidities. Viral load was measured at baseline and 6 months later. We compared the mean viral loads of participants before and after the intervention. The 82 participants had a mean age of 9 years and a baseline median viral load of 13,150 copies/mL. After the study, the intervention group had significantly higher adherence (92%) than the control group (80%). After adjusting for baseline viral load, the intervention explained 4% of the viral load variation. This trial showed significant benefits of the Konga model. We recommend conducting similar trials elsewhere to confirm the generalizability of the intervention, so that it can be implemented elsewhere Further, we believe that this data will be of interest to the readership of your repository because our data increases our current understanding of the social dimensions of HIV in an African context and provides recommendations related to improving HIV care, particularly for children
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Disclaimer This is the first release of the Global Ensemble Digital Terrain Model (GEDTM30). Use for testing purposes only. A publication describing the methods used has been submitted to PeerJ and is currently under review. This work was funded by the European Union. However, the views and opinions expressed are solely those of the author(s) and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. The data is provided "as is." The Open-Earth-Monitor project consortium, along with its suppliers and licensors, hereby disclaims all warranties of any kind, express or implied, including, without limitation, warranties of merchantability, fitness for a particular purpose, and non-infringement. Neither the Open-Earth-Monitor project consortium nor its suppliers and licensors make any warranty that the website will be error-free or that access to it will be continuous or uninterrupted. You understand that you download or otherwise obtain content or services from the website at your own discretion and risk. Description GEDTM30 is presented as a 1-arc-second (~30m) global Digital Terrain Model (DTM) generated using machine-learning-based data fusion. It was trained using a global-to-local Random Forest model with ICESat-2 and GEDI data, incorporating almost 30 billion high-quality points. To see the documentation, please visit our GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). This dataset covers the entire world and can be used for applications such as topography, hydrology, and geomorphometry analysis. Dataset Contents This dataset includes: GEDTM30Represents the predicted terrain height. Uncertainty of GEDTM30 predictionProvides an uncertainty map of the terrain prediction, derived from the standard deviation of individual tree predictions in the Random Forest model. Due to Zenodo's storage limitations, the original GEDTM30 dataset and its standard deviation map are provided via external links: GEDTM30 30m Uncertainty of GEDTM30 prediction 30m Related Identifiers Landform:Slope in Degree, Geomorphons Light and Shadow:Positive Openness, Negative Openness, Hillshade Curvature:Minimal Curvature, Maximal Curvature, Profile Curvature, Tangential Curvature, Ring Curvature, Shape Index Local Topographic Position:Difference from Mean Elevation, Spherical Standard Deviation of the Normals Hydrology:Specific Catchment Area, LS Factor, Topographic Wetness Index Data Details Time period: static. Type of data: Digital Terrain Model How the data was collected or derived: Machine learning models. Statistical Methods used: Random Forest. Limitations or exclusions in the data: The dataset does not include data Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180, -65, 180, 85) Spatial resolution: 120m Image size: 360,000P x 178,219L File format: Cloud Optimized Geotiff (COG) format. Layer information: Layer Scale Data Type No Data Ensemble Digital Terrain Model 10 Int32 -2,147,483,647 Standard Deviation EDTM 100 UInt16 65,535 Code Availability The primary development of GEDTM30 is documented in GEDTM30 GitHub(https://github.com/openlandmap/GEDTM30). The current version (v1) code is compressed and uploaded as GEDTM30-main.zip. To access the up-to-date development please visit our GitHub page. Support If you discover a bug, artifact, or inconsistency, or if you have a question please raise a GitHub issue here Naming convention To ensure consistency and ease of use across and within the projects, we follow the standard Ai4SoilHealth and Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describe important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. For example, for edtm_rf_m_120m_s_20000101_20231231_go_epsg.4326_v20250130.tif, the fields are: generic variable name: edtm = ensemble digital terrain model variable procedure combination: rf = random forest Position in the probability distribution/variable type: m = mean | sd = standard deviation Spatial support: 120m Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20231231 = 2023-12-31 Bounding box: go = global EPSG code: EPSG:4326 Version code: v20250130 = version from 2025-01-30