48 datasets found

AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River...
data.csiro.au
researchdata.edu.au
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson (2024). AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. http://doi.org/10.25919/y0v9-7b58
Explore at:
Unique identifier
https://doi.org/10.25919/y0v9-7b58
Dataset updated
Apr 16, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2020 - Jun 30, 2023
Area covered

Dataset funded by
Northern Territory Department of Environment, Parks and Water Security
CSIROhttp://www.csiro.au/
Description
AWC to 60cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 60cm (mm of water to 60cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 60cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 60cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
Dataset: Differences between Neurodivergent and Neurotypical Software...
zenodo.org
bin, zip
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Dataset: Differences between Neurodivergent and Neurotypical Software Engineers: Analyzing the 2022 Stack Overflow Survey [Dataset]. http://doi.org/10.5281/zenodo.14779344
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14779344
Dataset updated
Jan 31, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
# Analysis and Figure Generation Scripts for the Paper "Differences between Neurodivergent and Neurotypical Software Engineers: Analyzing the 2022 Stack Overflow Survey"

This repository contains the necessary R scripts and session data to re-run all tests reported in the paper, as well as generate the figures we use (and a few others we did not use).

## Files
- SO_analysis_EASE25.R: The main R script executing all other parts. Running this script will calculate and the p-values of all tests, sorted by condition (lines 10 to 602). These are the same as reported in the paper in Tables 1 to 7. Additionally, figures will be generated (lines 603 onwards).
- SO_data_filtered_sampled.RData: A RData file containing the filtered and sampled data and functions necessary to run the tests/generate the figures.
- 01_SO_preprocessing.R and 02_SO_random_sampling.R: Scripts containing filtering logic and sampling logic. If those files are executed within SO_analysis_EASE25.R instead of loading the RData file (lines 5 to 8), the filtering and sampling is re-run. Note that this results in different random samples, and therefore different results than our paper. Also, the effect size calculations then have to be adjusted to the results that are significant (lines 596 to 601).
- generatedGraphs.zip: The graph files generated by the main R script.
- paperFigures.zip: The graphs we used in the paper. These are the generated graphs, but edited for readability.

## License
The Public 2022 Stack Overflow Developer Survey Results is made available under the Open Database License (ODbL): http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/).
We hereby would like to attribute Stack Overflow as the source of these results. All data we make available (i.e., the RData file) is a derivative work, and as such also shared under the under the ODbl.
d
Replication Data for: Algorithmic Maps and the Political Geography of...
dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravina, Mark (2023). Replication Data for: Algorithmic Maps and the Political Geography of Early-modern Japan [Dataset]. http://doi.org/10.7910/DVN/63MIDP
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/63MIDP
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Ravina, Mark
Description
File descriptions: Village_level_calculations.R — calculates village-level metrics Parcels_sample.txt — random sample of 37,295 parcels from 25,000 unique locations, a random sample of a complete data set of 65,201 unique locations Shikoku_Voronoi_map.R — code to generate Voronoi map “Figure_11_Interactive_map_of_Iyo.html” Shikoku_Voronoi_data.txt — data for Shikoku_Voronoi_map.R gadm40_JPN_shp — folder os shapefiles for Shikoku_Voronoi_map.R Domain_Simpson_complete.txt — complete domain-level data for logit calculations, based on all 65,201 locations and 97,553 parcels.
Soil rockiness DSM data of the Roper catchment NT generated by the Roper...
data.csiro.au
researchdata.edu.au
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson (2024). Soil rockiness DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. http://doi.org/10.25919/42fs-6a23
Explore at:
Unique identifier
https://doi.org/10.25919/42fs-6a23
Dataset updated
Apr 16, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2020 - Jun 30, 2023
Area covered

Dataset funded by
Northern Territory Department of Environment, Parks and Water Security
CSIROhttp://www.csiro.au/
Description
Soil rockiness is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). Soil rockiness represents areas that are excluded from agricultural production due to the abundance and size of rock outcrop, surface coarse fragments, profile coarse fragments and hard segregations. This raster data represents a modelled dataset of a set of rules applied to the above features for the top 0.10m of soil and is derived from field measured site data and environmental covariates. Data values are: 0 Not rocky, 1 Rocky. Descriptions of the rules defining rockiness are supplied with this data. Rockiness is a parameter used in land suitability assessments as restrictions relate to the intensity of rock picking required in land preparation, root crop harvesting, reduces crop growth and use of agricultural machinery particularly in the plough zone. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This soil rockiness dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create soil rockiness Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and confusion matrix results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For categorical attributes the method for estimating reliability is the Confusion Index. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
f
Output datasets from ML–assisted bibliometric workflow in African...
figshare.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Omogbene; Fikisiwe Gebashe; Ibraheem Lawal; Stephen Amoo; Adeyemi O. Aremu (2025). Output datasets from ML–assisted bibliometric workflow in African phytochemical metabolomics research [Dataset]. http://doi.org/10.6084/m9.figshare.30396481.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30396481.v1
Dataset updated
Oct 19, 2025
Dataset provided by
figshare
Authors
Temitope Omogbene; Fikisiwe Gebashe; Ibraheem Lawal; Stephen Amoo; Adeyemi O. Aremu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains supplementary datasets generated during the machine learning–assisted bibliometric workflow for metabolomics and phytochemical research. The datasets represent sequential outputs derived from the integration and harmonisation of bibliographic metadata from Scopus, Web of Science (WoS), and Dimensions, processed via R and Python environments.The datasets were produced through distinct workflow stages:Dataset 1A (merged_dataset2.xlsx): Consolidated metadata produced in R from the merged raw bibliographic exports of Scopus, WoS, and Dimensions.Dataset 1B (sampled_data.xlsx): A stratified random sample generated in Python for pretraining and manual annotation.Dataset 1C (sample_data_pretrained.xlsx): Annotated sample dataset manually screened according to inclusion and exclusion criteria.Dataset 1D (highlighted_full_data_with_predictions.xlsx): The complete harmonised dataset automatically classified using the trained XGBoost model.Dataset 1E (absolute_metabolomics_data.xlsx): Final curated dataset of relevant records extracted from the ML-filtered corpus.Importantly, the file names of each dataset presented here were renamed from their original Google Drive file paths (referenced in the Python Google Colab scripts) to ensure sequential, descriptive, and logically ordered naming. This adjustment enhances clarity, reproducibility, and cross-reference consistency across all linked repositories.
Soil surface salinity DSM data of the Victoria catchment NT generated by the...
researchdata.edu.au
data.csiro.au
datadownload
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Wilson; Peter R Wilson; jason hill; Linda Gregory; Ross Searle; Uta Stockmann; Seonaid Philip; Mark Thomas; Ian Watson; Searle, Ross; Linda Gregory; Ian Watson (2024). Soil surface salinity DSM data of the Victoria catchment NT generated by the Victoria River Water Resource Assessment [Dataset]. http://doi.org/10.25919/PYX8-X959
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/PYX8-X959
Dataset updated
Dec 13, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Peter Wilson; Peter R Wilson; jason hill; Linda Gregory; Ross Searle; Uta Stockmann; Seonaid Philip; Mark Thomas; Ian Watson; Searle, Ross; Linda Gregory; Ian Watson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2021 - Sep 30, 2024
Area covered

Description
Soil surface salinity is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Victoria River Water Resource Assessment (VIWRA) through the digital soil mapping process (DSM). Soil salinity represents the salt content of the soil. This raster data represents a modelled dataset of salinity at the soil surface and is derived from field measured and laboratory analysed site data, and environmental covariates. Data values are: 1 Surface salinity absent, 2 Surface salinity present. Soil surface salinity is a parameter used in land suitability assessments as it hinders seed establishment and retards plant growth. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO VIWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO VIWRA published report ‘Soils and land suitability for the Victoria catchment, Northern Territory’. A technical report from the CSIRO Victoria River Water Resource Assessment to the Government of Australia. The Victoria River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Victoria catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: The soil surface salinity dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO VIWRA published reports and in particular ' Soils and land suitability for the Victoria catchment, Northern Territory’. A technical report from the CSIRO Victoria River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create soil surface salinity Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and confusion matrix results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For categorical attributes the method for estimating reliability is the Confusion Index. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
FastLloyd Clustering Datasets
zenodo.org
xz
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum (2025). FastLloyd Clustering Datasets [Dataset]. http://doi.org/10.5281/zenodo.15530593
Explore at:
xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15530593
Dataset updated
May 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abdulrahman Diaa; Abdulrahman Diaa; Thomas Humphries; Thomas Humphries; Florian Kerschbaum; Florian Kerschbaum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework .

Contents

1. real_datasets.tar.xz

Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features:

iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements.

lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments .

s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series.

house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks.

adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction.

wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features.

breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset.

yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data.

mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits.

birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation .

2. scale_datasets.tar.xz

Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where:

$k \in \{2,4,8,16,32\}$ is the number of clusters,

$d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality,

$s \in \{1,2,3\}$ are different random seeds.

These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters.

3. ablate_datasets.tar.xz

Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with:

$k \in \{2,4,8,16\}$ clusters,

$d \in \{2,4,8,16\}$ dimensions,

$sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees.

Also generated via clusterGeneration.

4. g2_datasets.tar.xz

Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks:

$N=2048$ samples, $k=2$ Gaussian clusters,

Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$

Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$

5. timing_datasets.tar.xz

Includes:

s1.txt, lsun.txt: two real datasets for baseline timing.

timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying:

$k \in \{2,5\}$

$d \in \{2,5\}$

$N \in \{10000; 100000\}$

Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol .

Usage:

Unpack any archive with tar -xJf
Soil erodibility DSM data of the Roper catchment NT generated by the Roper...
researchdata.edu.au
data.csiro.au
datadownload
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Wilson; Peter R Wilson; John Gallant; Elisabeth Bui; jason hill; Linda Gregory; Ross Searle; Uta Stockmann; Seonaid Philip; Mark Thomas; Ian Watson (2024). Soil erodibility DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. https://researchdata.edu.au/soil-erodibility-dsm-resource-assessment/3379212
Explore at:
datadownloadAvailable download formats
Dataset updated
Apr 17, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Peter Wilson; Peter R Wilson; John Gallant; Elisabeth Bui; jason hill; Linda Gregory; Ross Searle; Uta Stockmann; Seonaid Philip; Mark Thomas; Ian Watson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2020 - Jun 30, 2023
Area covered

Description
Soil erodibility is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). Soil erodibility is used to indicate the potential susceptibility of soil to erosion. This soil erodibility raster data represents a modelled dataset of k-factor (rate of runoff not included) calculated on a scale between 0.0 and 0.1 and is derived from measured and analysed site data, calculations and environmental covariates. Soil erodibility is a parameter used in land suitability assessments to identify areas where water erosion could be a risk causing soil loss (land degradation) and productivity decline and is applied in combination with slope categories. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This soil erodibility dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create soil erodibility Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
AWC to 150cm DSM data of the Roper catchment NT generated by the Roper River...
researchdata.edu.au
data.csiro.au
datadownload
Updated Apr 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Wilson; Peter R Wilson; John Gallant; Elisabeth Bui; jason hill; Linda Gregory; Ross Searle; Uta Stockmann; Seonaid Philip; Mark Thomas; Ian Watson (2024). AWC to 150cm DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. http://doi.org/10.25919/1HHY-TT72
Explore at:
datadownloadAvailable download formats
Unique identifier
https://doi.org/10.25919/1HHY-TT72
Dataset updated
Apr 16, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Peter Wilson; Peter R Wilson; John Gallant; Elisabeth Bui; jason hill; Linda Gregory; Ross Searle; Uta Stockmann; Seonaid Philip; Mark Thomas; Ian Watson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2020 - Jun 30, 2023
Area covered

Description
AWC to 150cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 150cm (mm of water to 150cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 150cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 150cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
f
Supplement 1. R code demonstrating how to fit a logistic regression model,...
figshare.com
wiley.figshare.com
html
Updated Aug 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David I. Warton; Francis K. C. Hui (2016). Supplement 1. R code demonstrating how to fit a logistic regression model, with a random intercept term, and how to use resampling-based hypothesis testing for inference. [Dataset]. http://doi.org/10.6084/m9.figshare.3550407.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3550407.v1
Dataset updated
Aug 9, 2016
Dataset provided by
Wiley
Authors
David I. Warton; Francis K. C. Hui
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List glmmeg.R: R code demonstrating how to fit a logistic regression model, with a random intercept term, to randomly generated overdispersed binomial data. boot.glmm.R: R code for estimating P-values by applying the bootstrap to a GLMM likelihood ratio statistic. Description glmm.R is some example R code which show how to fit a logistic regression model (with or without a random effects term) and use diagnostic plots to check the fit. The code is run on some randomly generated data, which are generated in such a way that overdispersion is evident. This code could be directly applied for your own analyses if you read into R a data.frame called “dataset”, which has columns labelled “success” and “failure” (for number of binomial successes and failures), “species” (a label for the different rows in the dataset), and where we want to test for the effect of some predictor variable called “location”. In other cases, just change the labels and formula as appropriate. boot.glmm.R extends glmm.R by using bootstrapping to calculate P-values in a way that provides better control of Type I error in small samples. It accepts data in the same form as that generated in glmm.R.
Data from: Effect of hyperparameters on variable selection in random forests...
zenodo.org
bin
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cesaire J. Fouodo K.; Silke Szymczak; Cesaire J. Fouodo K.; Silke Szymczak (2023). Effect of hyperparameters on variable selection in random forests [Dataset]. http://doi.org/10.5281/zenodo.8308235
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8308235
Dataset updated
Sep 9, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cesaire J. Fouodo K.; Silke Szymczak; Cesaire J. Fouodo K.; Silke Szymczak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are data used to generate the results presented in simulation studies conducted in Fouodo et al. 2023. Each dataset is an R object in RDS format with 100 lists. For each element of the list, parameters used to generate the data are presented, followed by the simulated data. Here is the git repository of the R code used to conduct simulations in the manuscript. The files 01-data-only.R and 02-data-only.R contain the functions and more details about how the data have been simulated for studies 1 and 2.
Dataset for modeling spatial and temporal variation in natural background...
catalog.data.gov
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Dataset for modeling spatial and temporal variation in natural background specific conductivity [Dataset]. https://catalog.data.gov/dataset/dataset-for-modeling-spatial-and-temporal-variation-in-natural-background-specific-conduct
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).
d
Data from: How to use discrete choice experiments to capture stakeholder...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan R. Ellis; Qiana R. Cryer-Coupet; Bridget E. Weller; Kirsten Howard; Rakhee Raghunandan; Kathleen C. Thomas (2025). How to use discrete choice experiments to capture stakeholder preferences in social work research [Dataset]. http://doi.org/10.5061/dryad.z612jm6m0
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.z612jm6m0
Dataset updated
Jul 31, 2025
Dataset provided by
Dryad Digital Repository
Authors
Alan R. Ellis; Qiana R. Cryer-Coupet; Bridget E. Weller; Kirsten Howard; Rakhee Raghunandan; Kathleen C. Thomas
Description
The primary article (cited below under "Related works") introduces social work researchers to discrete choice experiments (DCEs) for studying stakeholder preferences. The article includes an online supplement with a worked example demonstrating DCE design and analysis with realistic simulated data. The worked example focuses on caregivers' priorities in choosing treatment for children with attention deficit hyperactivity disorder. This dataset includes the scripts (and, in some cases, Excel files) that we used to identify appropriate experimental designs, simulate population and sample data, estimate sample size requirements for the multinomial logit (MNL, also known as conditional logit) and random parameter logit (RPL) models, estimate parameters using the MNL and RPL models, and analyze attribute importance, willingness to pay, and predicted uptake. It also includes the associated data files (experimental designs, data generation parameters, simulated population data and parameters, ..., In the worked example, we used simulated data to examine caregiver preferences for 7 treatment attributes (medication administration, therapy location, school accommodation, caregiver behavior training, provider communication, provider specialty, and monthly out-of-pocket costs) identified by dosReis and colleagues in a previous DCE. We employed an orthogonal design with 1 continuous variable (cost) and 12 dummy-coded variables (representing the levels of the remaining attributes, which were categorical). Using the parameter estimates published by dosReis et al., with slight adaptations, we simulated utility values for a population of 100,000 people, then selected a sample of 500 for analysis. Relying on random utility theory, we used the mlogit package in R to estimate the MNL and RPL models, using 5,000 Halton draws for simulated maximum likelihood estimation of the RPL model. In addition to estimating the utility parameters, we measured the relative importance of each attribute, esti..., , # Data from: How to Use Discrete Choice Experiments to Capture Stakeholder Preferences in Social Work Research

Access this dataset on Dryad

This dataset supports the worked example in:

Ellis, A. R., Cryer-Coupet, Q. R., Weller, B. E., Howard, K., Raghunandan, R., & Thomas, K. C. (2024). How to use discrete choice experiments to capture stakeholder preferences in social work research. Journal of the Society for Social Work and Research. Advance online publication. https://doi.org/10.1086/731310

The referenced article introduces social work researchers to discrete choice experiments (DCEs) for studying stakeholder preferences. In a DCE, researchers ask participants to complete a series of choice tasks: hypothetical situations in which each participant is presented with alternative scenarios and selects one or more. For example, social work researchers may want to know how parents and other caregivers pr...
The Dynamics of Collective Action Corpus
zenodo.org
bin
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley (2024). The Dynamics of Collective Action Corpus [Dataset]. http://doi.org/10.5281/zenodo.8415049
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8415049
Dataset updated
Mar 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.

These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.

We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.

We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rdata" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" (used to link to original article pdfs) and "pub_title" (which is the title of the recollected article and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).

Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to "Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.

We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493 documents. The "20231006_dca_dtm.Rdata" is a sparse matrix class object from the Matrix R package.

In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.
AWC to 150cm DSM data of the Victoria catchment NT generated by the Victoria...
data.csiro.au
researchdata.edu.au
Updated Dec 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Peter R Wilson; Peter Wilson (2024). AWC to 150cm DSM data of the Victoria catchment NT generated by the Victoria River Water Resource Assessment [Dataset]. http://doi.org/10.25919/3zd2-tw72
Explore at:
Unique identifier
https://doi.org/10.25919/3zd2-tw72
Dataset updated
Dec 13, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Peter R Wilson; Peter Wilson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2021 - Sep 30, 2024
Area covered

Dataset funded by
Northern Territory Department of Environment, Parks and Water Security
CSIROhttp://www.csiro.au/
Description
AWC to 150cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Victoria River Water Resource Assessment (VIWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 150cm (mm of water to 150cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO VIWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO VIWRA published report ‘Soils and land suitability for the Victoria catchment, Northern Territory’. A technical report from the CSIRO Victoria River Water Resource Assessment to the Government of Australia. The Victoria River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Victoria catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 150cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO VIWRA published reports and in particular ' Soils and land suitability for the Victoria catchment, Northern Territory’. A technical report from the CSIRO Victoria River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 150cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
Datasets used in the study area.
plos.figshare.com
xls
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhen Zhao; Hongmei Guo; Xueli Jiang; Ying Zhang; Changjiang Lu; Can Zhang; Zonghang He (2025). Datasets used in the study area. [Dataset]. http://doi.org/10.1371/journal.pone.0321263.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321263.t001
Dataset updated
Apr 3, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zhen Zhao; Hongmei Guo; Xueli Jiang; Ying Zhang; Changjiang Lu; Can Zhang; Zonghang He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.
Evaluation of the improved random forest model.
plos.figshare.com
xls
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhen Zhao; Hongmei Guo; Xueli Jiang; Ying Zhang; Changjiang Lu; Can Zhang; Zonghang He (2025). Evaluation of the improved random forest model. [Dataset]. http://doi.org/10.1371/journal.pone.0321263.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321263.t004
Dataset updated
Apr 3, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zhen Zhao; Hongmei Guo; Xueli Jiang; Ying Zhang; Changjiang Lu; Can Zhang; Zonghang He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.
Evaluating Bayesian spatial methods for modelling species distributions with...
plos.figshare.com
tiff
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David W. Redding; Tim C. D. Lucas; Tim M. Blackburn; Kate E. Jones (2023). Evaluating Bayesian spatial methods for modelling species distributions with clumped and restricted occurrence data [Dataset]. http://doi.org/10.1371/journal.pone.0187602
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0187602
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
David W. Redding; Tim C. D. Lucas; Tim M. Blackburn; Kate E. Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistical approaches for inferring the spatial distribution of taxa (Species Distribution Models, SDMs) commonly rely on available occurrence data, which is often clumped and geographically restricted. Although available SDM methods address some of these factors, they could be more directly and accurately modelled using a spatially-explicit approach. Software to fit models with spatial autocorrelation parameters in SDMs are now widely available, but whether such approaches for inferring SDMs aid predictions compared to other methodologies is unknown. Here, within a simulated environment using 1000 generated species’ ranges, we compared the performance of two commonly used non-spatial SDM methods (Maximum Entropy Modelling, MAXENT and boosted regression trees, BRT), to a spatial Bayesian SDM method (fitted using R-INLA), when the underlying data exhibit varying combinations of clumping and geographic restriction. Finally, we tested how any recommended methodological settings designed to account for spatially non-random patterns in the data impact inference. Spatial Bayesian SDM method was the most consistently accurate method, being in the top 2 most accurate methods in 7 out of 8 data sampling scenarios. Within high-coverage sample datasets, all methods performed fairly similarly. When sampling points were randomly spread, BRT had a 1–3% greater accuracy over the other methods and when samples were clumped, the spatial Bayesian SDM method had a 4%-8% better AUC score. Alternatively, when sampling points were restricted to a small section of the true range all methods were on average 10–12% less accurate, with greater variation among the methods. Model inference under the recommended settings to account for autocorrelation was not impacted by clumping or restriction of data, except for the complexity of the spatial regression term in the spatial Bayesian model. Methods, such as those made available by R-INLA, can be successfully used to account for spatial autocorrelation in an SDM context and, by taking account of random effects, produce outputs that can better elucidate the role of covariates in predicting species occurrence. Given that it is often unclear what the drivers are behind data clumping in an empirical occurrence dataset, or indeed how geographically restricted these data are, spatially-explicit Bayesian SDMs may be the better choice when modelling the spatial distribution of target species.
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St. Petersburg
European University at St Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Prediction accuracy of different random forest models.
plos.figshare.com
xls
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhen Zhao; Hongmei Guo; Xueli Jiang; Ying Zhang; Changjiang Lu; Can Zhang; Zonghang He (2025). Prediction accuracy of different random forest models. [Dataset]. http://doi.org/10.1371/journal.pone.0321263.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0321263.t003
Dataset updated
Apr 3, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zhen Zhao; Hongmei Guo; Xueli Jiang; Ying Zhang; Changjiang Lu; Can Zhang; Zonghang He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction accuracy of different random forest models.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson (2024). AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. http://doi.org/10.25919/y0v9-7b58

AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment

Explore at:

Unique identifier

https://doi.org/10.25919/y0v9-7b58

Dataset updated

Apr 16, 2024

Dataset provided by

CSIROhttp://www.csiro.au/

Authors

Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jul 1, 2020 - Jun 30, 2023

Area covered

Dataset funded by

Northern Territory Department of Environment, Parks and Water Security
CSIROhttp://www.csiro.au/

Description

AWC to 60cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 60cm (mm of water to 60cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 60cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 60cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.

Clear search

Close search

Google apps

Main menu

AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River...

Dataset: Differences between Neurodivergent and Neurotypical Software...

Replication Data for: Algorithmic Maps and the Political Geography of...

Soil rockiness DSM data of the Roper catchment NT generated by the Roper...

Output datasets from ML–assisted bibliometric workflow in African...

Soil surface salinity DSM data of the Victoria catchment NT generated by the...

FastLloyd Clustering Datasets

Contents

1. real_datasets.tar.xz

2. scale_datasets.tar.xz

3. ablate_datasets.tar.xz

4. g2_datasets.tar.xz

5. timing_datasets.tar.xz

Soil erodibility DSM data of the Roper catchment NT generated by the Roper...

AWC to 150cm DSM data of the Roper catchment NT generated by the Roper River...

Supplement 1. R code demonstrating how to fit a logistic regression model,...

Data from: Effect of hyperparameters on variable selection in random forests...

Dataset for modeling spatial and temporal variation in natural background...

Data from: How to use discrete choice experiments to capture stakeholder...

The Dynamics of Collective Action Corpus

AWC to 150cm DSM data of the Victoria catchment NT generated by the Victoria...

Datasets used in the study area.

Evaluation of the improved random forest model.

Evaluating Bayesian spatial methods for modelling species distributions with...

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables