8 datasets found

c
Input Files and Code for: Machine learning can accurately assign geologic...
s.cnmilf.com
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
f
Table_3_Biological signatures and prediction of an immunosuppressive...
frontiersin.figshare.com
bin
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingxing Lei; Zhencan Han; Shengjie Wang; Chunxue Guo; Xianlong Zhang; Ya Song; Feng Lin; Tianlong Huang (2023). Table_3_Biological signatures and prediction of an immunosuppressive status—persistent critical illness—among orthopedic trauma patients using machine learning techniques.docx [Dataset]. http://doi.org/10.3389/fimmu.2022.979877.s015
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2022.979877.s015
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Mingxing Lei; Zhencan Han; Shengjie Wang; Chunxue Guo; Xianlong Zhang; Ya Song; Feng Lin; Tianlong Huang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundPersistent critical illness (PerCI) is an immunosuppressive status. The underlying pathophysiology driving PerCI remains incompletely understood. The objectives of the study were to identify the biological signature of PerCI development, and to construct a reliable prediction model for patients who had suffered orthopedic trauma using machine learning techniques.MethodsThis study enrolled 1257 patients from the Medical Information Mart for Intensive Care III (MIMIC-III) database. Lymphocytes were tracked from ICU admission to more than 20 days following admission to examine the dynamic changes over time. Over 40 possible variables were gathered for investigation. Patients were split 80:20 at random into a training cohort (n=1035) and an internal validation cohort (n=222). Four machine learning algorithms, including random forest, gradient boosting machine, decision tree, and support vector machine, and a logistic regression technique were utilized to train and optimize models using data from the training cohort. Patients in the internal validation cohort were used to validate models, and the optimal one was chosen. Patients from two large teaching hospitals were used for external validation (n=113). The key metrics that used to assess the prediction performance of models mainly included discrimination, calibration, and clinical usefulness. To encourage clinical application based on the optimal machine learning-based model, a web-based calculator was developed.Results16.0% (201/1257) of all patients had PerCI in the MIMIC-III database. The means of lymphocytes (%) were consistently below the normal reference range across the time among PerCI patients (around 10.0%), whereas in patients without PerCI, the number of lymphocytes continued to increase and began to be in normal range on day 10 following ICU admission. Subgroup analysis demonstrated that patients with PerCI were in a more serious health condition at admission since those patients had worse nutritional status, more electrolyte imbalance and infection-related comorbidities, and more severe illness scores. Eight variables, including albumin, serum calcium, red cell volume distributing width (RDW), blood pH, heart rate, respiratory failure, pneumonia, and the Sepsis-related Organ Failure Assessment (SOFA) score, were significantly associated with PerCI, according to the least absolute shrinkage and selection operator (LASSO) logistic regression model combined with the 10-fold cross-validation. These variables were all included in the modelling. In comparison to other algorithms, the random forest had the optimal prediction ability with the highest area under receiver operating characteristic (AUROC) (0.823, 95% CI: 0.757-0.889), highest Youden index (1.571), and lowest Brier score (0.107). The AUROC in the external validation cohort was also up to 0.800 (95% CI: 0.688-0.912). Based on the risk stratification system, patients in the high-risk group had a 10.0-time greater chance of developing PerCI than those in the low-risk group. A web-based calculator was available at https://starxueshu-perci-prediction-main-9k8eof.streamlitapp.com/.ConclusionsPatients with PerCI typically remain in an immunosuppressive status, but those without PerCI gradually regain normal immunity. The dynamic changes of lymphocytes can be a reliable biomarker for PerCI. This work developed a reliable model that may be helpful in improving early diagnosis and targeted intervention of PerCI. Beneficial interventions, such as improving nutritional status and immunity, maintaining electrolyte and acid-base balance, curbing infection, and promoting respiratory recovery, are early warranted to prevent the onset of PerCI, especially among patients in the high-risk group and those with a continuously low level of lymphocytes.
Code and data for random forests model in mapping function parameter...
figshare.com
zip
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haopeng Fan; Xinxing Li; Chunlin Shi (2023). Code and data for random forests model in mapping function parameter calculation [Dataset]. http://doi.org/10.6084/m9.figshare.24534937.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24534937.v1
Dataset updated
Nov 9, 2023
Dataset provided by
figshare
Authors
Haopeng Fan; Xinxing Li; Chunlin Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The tropospheric mapping function (MF) plays an important role in estimating the delay of electromagnetic waves when they traveling through the troposphere, especially for the space observing technologies such as GNSS and VLBI, where electromagnetic waves serve as the main measuring means. At present, the non-meteorological parameter empirical models are widely used to calculate MF parameters, convenient yet less accurate for areas suffering from drastic climate changes, which may lead to even meter-level errors for slant path delays and further diffuse errors to other data products. Although, there are authoritative institutions that regularly release meteorological parameters which are essential to MF calculation or ready-made MF products, they usually hold a delay of several days or require special authorization. In view of this, we proposed a method based on the random forest (RF) model to obtain MF key parameters rapidly based on surface observations. Experiments have shown that compared with traditional models, the RF method could significantly raise the accuracy of MF parameters, with an improvement of over 60% for the hydrostatic component and over 20% for the wet part. Seasonal system deviations were almost eliminated. In the end, a set of RF models were trained by setting different sample spaces, and the performances were counted under different combinations of feature dimensions and time spans, which turned out that an optimal compromise may exist when difficulty in sample data obtaining, training time, model size, computational efficiency, accuracy, etc. were taken into consideration.
Data from: Large, climate-sensitive soil carbon stocks mapped with...
data.niaid.nih.gov
data.subak.org
+2more
zip
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gavin McNicol; Chuck Bulmer; David D'Amore; Paul Sanborn; Sari Saunders; Ian Giesbrecht; Santiago Gonzalez Arriola; Allison Bidlack; David Butman; Brian Buma (2024). Large, climate-sensitive soil carbon stocks mapped with pedology-informed machine learning in the North Pacific coastal temperate rainforest [Dataset]. http://doi.org/10.5061/dryad.5jf6j1r
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.5jf6j1r
Dataset updated
Oct 10, 2024
Dataset provided by
Hakai Institute
US Forest Service
University of Alaska Fairbanks
University of Illinois Chicago
University of Washington
Ministry of Forests
University of Colorado Denver
University of Northern British Columbia
Authors
Gavin McNicol; Chuck Bulmer; David D'Amore; Paul Sanborn; Sari Saunders; Ian Giesbrecht; Santiago Gonzalez Arriola; Allison Bidlack; David Butman; Brian Buma
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Alaska, Southeast Alaska, North Pacific coastal temperate rainforest, British Columbia
Description
Accurate soil organic carbon (SOC) maps are needed to predict the terrestrial SOC feedback to climate change, one of the largest remaining uncertainties in Earth system modeling. Over the last decade, global scale models have produced varied predictions of the size and distribution of SOC stocks, ranging from 1,000 to > 3,000 Pg of C within the top 1 m. Regional assessments may help validate or improve global maps because they can examine landscape controls on SOC stocks and offer a tractable means to retain regionally-specific information, such as soil taxonomy, during database creation and modeling. We compile a new transboundary SOC stock database for coastal watersheds of the North Pacific coastal temperate rainforest, using soil classification data to guide gap-filling and machine learning approaches used to explore spatial controls on SOC and predict regional stocks. Precipitation and topographic attributes controlling soil wetness were found to be the dominant controls of SOC, underscoring the dependence of C accumulation on high soil moisture. The random forest model predicted stocks of 4.5 Pg C (to 1 m) for the study region, 22% of which was stored in organic soil layers. Calculated stocks of 228 ± 111 Mg C ha-1 fell within ranges of several past regional studies and indicate 11-33 Pg C may be stored across temperate rainforest soils globally. Predictions were compared very favorably to regionalized estimates from two spatially explicit global products (Pearson's correlation: ρ = 0.73 vs. 0.34). Notably, SoilGrids250m was an outlier for estimates of total SOC, predicting 4-fold higher stocks (18 Pg C) and indicating bias in this global product for the soils of the temperate rainforest. In sum, our study demonstrates that CTR ecosystems represent a moisture-dependent hotspot for SOC storage at mid-latitudes. Methods Transboundary SOC Database We compiled a transboundary database of > 1300 soil profile descriptions (pedons) across SEAK and BC from published and archive data sources. For each pedon, we calculated SOC stocks for the top 1 m of mineral soil plus surface organic horizons using data harmonization and gap-filling procedures that are detailed in the supplementary information (supplementary tables 1–5). In brief, US soil classification was converted to Canadian where necessary, and gaps were filled with published values or modeled estimates grouped by soil class, horizon, and lithology. In contrast to some other regional and global C assessments, this approach avoided the use of generalized empirical relationships between soil properties and missing variables, such as between soil C and soil bulk density, or soil C and depth. Environmental covariates Environmental covariates were selected (supplementary table 6) to predict SOC stock due to their relationship with soil-forming factors (climate, organisms, relief, parent material, and time; Jenny 1994). Covariate data were extracted from the rasters at the pedon coordinates and appended to the final SOC stocks (in supplementary material) to use in all further analyses. Further details of the 12 selected environmental covariates along with justification for inclusion and pre-processing steps are listed in supplementary table 6. Briefly, only high-quality and spatially continuous data products were used. Curating covariates based on knowledge of regional soil development facilitates clearer interpretation and reduces the risk of autocorrelation between variables. Random forest model A random forest model was trained to predict stocks of SOC across the NPCTR in R (v.3.4; R Core Team 2018 (www.R-project.org)) using the R-package randomForest (4.6; Liaw and Wiener 2002). Random forests grow a large number of regression trees (Breiman et al 1984) from different random subsets of training data and predictor variables, thereby reducing variance relative to single trees, and greatly reducing the risk of over-fitting model predictions and non-optimal solutions—though at the cost of interpretability (Breiman 2001). The transboundary database SOC stocks and associated covariates were first split into training (80%) and testing (20%) data and the model was parameterized to grow 5000 trees. For each tree, a subsample equivalent to ¼ of the total sample size was utilized (with replacement). Node size was set at 4 to minimize the out-of-bag error based on preliminary testing. Model performance was measured from goodness-of-fit, distributions of residuals, and predictions of test SOC stocks. Confidence intervals were computed using an infinitesimal jack-knife procedure (Wager et al 2013). Predictions were made across the NPCTR study extent using an R-package raster (v2.6; Hijmans 2017) which produced a SOC map at 90.5 m resolution. All lakes >10 ha were clipped from the final map (HydroLakes, Messager et al 2016), and the glacier area was clipped using the Randolph Glacier Inventory 5.0 (GLIMS, Raup et al 2007) database. Final SOC stocks were adjusted for topography by scaling the SOC map with actual land surface area calculated from cell slope values. The random forest model was re-run for the three gap-filling sensitivity analyses. Soil organic carbon maps were exported as .tif files.
Random forests for predicting species identity of forensically important...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Jan 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsung Fei Khang; Nur Ayuni Dayana Mohd Puaad; Ser Huy Teh; Zulqarnain Mohamed (2024). Random forests for predicting species identity of forensically important blow flies (Diptera: Calliphoridae) and flesh flies (Sarcophagidae) using geometric morphometric data: proof of concept [Dataset]. http://doi.org/10.5061/dryad.95x69p8hf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.95x69p8hf
Dataset updated
Jan 21, 2024
Dataset provided by
Gene Express Sdn. Bhd.
University of Malaya
Authors
Tsung Fei Khang; Nur Ayuni Dayana Mohd Puaad; Ser Huy Teh; Zulqarnain Mohamed
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Wing shape variation has been shown to be useful for delineating forensically important fly species in two Diptera families: Calliphoridae and Sarcophagidae. Compared to DNA-based identification, the cost of geometric morphometric data acquisition and analysis is relatively much lower because the tools required are basic, and stable softwares are available. However, to date, an explicit demonstration of using wing geometric morphometric data for species identity prediction in these two families remains lacking. Here, geometric morphometric data from 19 homologous landmarks on the left wing of males from seven species of Calliphoridae (n=55), and eight species of Sarcophagidae (n=40) were obtained and processed using Generalized Procrustes Analysis. Allometric effect was removed by regressing centroid size (in log10) against the Procrustes coordinates. Subsequently, principal component analysis of the allometry-adjusted Procrustes variables was done, with the first 15 principal components used to train a random forests model for species prediction. Using a real test sample consisting of 33 male fly specimens collected around a human corpse at a crime scene, the estimated percentage of concordance between species identities predicted using the random forests model and those inferred using DNA-based identification was about 80.6% (approximate 95% confidence interval = [68.9%, 92.2%]). In contrast, baseline concordance using naive majority class prediction was 36.4%. The results provide proof of concept that geometric morphometric data has good potential to complement morphological and DNA-based identification of blow flies and flesh flies in forensic work. Methods The fly training set samples were taken from archived fly collection at the Molecular Genetics Laboratory MP2, University of Malaya. The fly test samples were collected from a murder crime scene in 2016 at the state of Selangor, Malaysia, where the victim was estimated to have been dead for more than 24 hours. The collected flies were anesthetised in ethyl acetate in a covered bottle. The left wing of each fly specimen was detached after overnight relaxation, mounted onto a glass slide using euparal as the mounting medium, and covered with a coverslip. The slides were left overnight at 56°C to clear out bubbles. Wing images were captured using a digital camera (20X magnification) attached to a binocular microscope (Motic Microscope 2.0, China). For each specimen, coordinate data from 19 homologous landmarks on the left wing image were recorded by a single person (NADMP), using the tpsDig 2.0 (Version 2.17) software and saved in tps file format. Sequence data from the COII gene were obtained from eight Sarcophagidae test samples, and five out of 26 of the Calliphoridae test samples. DNA was extracted from two legs of a specimen using QIAamp® DNA and Blood Mini Kit (Qiagen, USA). For detailed molecular protocols, see the associated publication. For processing and analysing geometric morphometric data, we used R Version 3.2.1. General Procrustes Analysis (GPA) (geomorph R package, Version 3.0.3) was applied to two separate data sets: one containing only the training samples (for inspection of patterns of shape variation within and among species), and another containing both the training and the test samples (for prediction using random forests). Linear regression was applied on the resultant Procrustes coordinate variable against the logarithm (base 10) of the centroid size was done to remove potential effects of wing shape allometry. The data were subsequently transformed to uncorrelated principal component scores using principal component analysis (PCA).
SCANFI: the Spatialized CAnadian National Forest Inventory data product
datasets.ai
catalogue.arctic-sdi.org
+2more
48, 74
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Resources Canada | Ressources naturelles Canada (2024). SCANFI: the Spatialized CAnadian National Forest Inventory data product [Dataset]. https://datasets.ai/datasets/18e6a919-53fd-41ce-b4e2-44a9707c52dc
Explore at:
48, 74Available download formats
Dataset updated
Mar 10, 2024
Dataset provided by
Ministry of Natural Resources of Canadahttps://www.nrcan.gc.ca/
Authors
Natural Resources Canada | Ressources naturelles Canada
Area covered
Canada
Description
This data publication contains a set of 30m resolution raster files representing 2020 Canadian wall-to-wall maps of broad land cover type, forest canopy height, degree of crown closure and aboveground tree biomass, along with species composition of several major tree species. The Spatialized CAnadian National Forest Inventory data product (SCANFI) was developed using the newly updated National Forest Inventory photo-plot dataset, which consists of a regular sample grid of photo-interpreted high-resolution imagery covering all of Canada’s non-arctic landmass. SCANFI was produced using temporally harmonized summer and winter Landsat spectral imagery along with hundreds of tile-level regional models based on a novel k-nearest neighbours and random forest imputation method.

A full description of all methods and validation analyses can be found in Guindon et al. (2024). As the Arctic ecozones are outside NFI’s covered areas, the vegetation attributes in these regions were predicted using a single random forest model. The vegetation attributes in these arctic areas could not be rigorously validated. The raster file « SCANFI_aux_arcticExtrapolationArea.tif » identifies these zones.

SCANFI is not meant to replace nor ignore provincial inventories which could include better and more regularly updated inputs, training data and local knowledge. Instead, SCANFI was developed to provide a current, spatially-explicit estimate of forest attributes, using a consistent data source and methodology across all provincial boundaries and territories. SCANFI is the first coherent 30m Canadian wall-to-wall map of tree structure and species composition and opens novel opportunities for a plethora of studies in a number of areas, such as forest economics, fire science and ecology.

# Limitations

1- The spectral disturbances of some areas disturbed by pests are not comprehensively represented in the training set, thus making it impossible to predict all defoliation cases. One such area, severely impacted by the recent eastern spruce budworm outbreak, is located on the North Shore of the St-Lawrence River. These forests are misrepresented in our training data, there is therefore an imprecision in our estimates.

2- Attributes of open stand classes, namely shrub, herbs, rock and bryoid, are more difficult to estimate through the photointerpretation of aerial images. Therefore, these estimates could be less reliable than the forest attribute estimates.

3- As reported in the manuscript, the uncertainty of tree species cover predictions is relatively high. This is particularly true for less abundant tree species, such as ponderosa pine and tamarack. The tree species layers are therefore suitable for regional and coarser scale studies. Also, the broadleaf proportion are slightly underestimated in this product version.

4- Our validation indicates that the areas in Yukon exhibit a notably lower R2 value. Consequently, estimates within these regions are less dependable.

5- Urban areas and roads are classified as rock, according to the 2020 Agriculture and Agri-Food Canada land-use classification map. Even though those areas contain mostly buildings and infrastructure, they may also contain trees. Forested urban parks are usually classified as forested areas. Vegetation attributes are also predicted for forested areas in agricultural regions.

Updates of this dataset will eventually be available on this metadata page.

# Details on the product development and validation can be found in the following publication:

Guindon, L., Manka, F., Correia, D.L.P., Villemaire, P., Smiley, B., Bernier, P., Gauthier, S., Beaudoin, A., Boucher, J., and Boulanger, Y. 2024. A new approach for Spatializing the Canadian National Forest Inventory (SCANFI) using Landsat dense time series. Can. J. For. Res. https://doi.org/10.1139/cjfr-2023-0118

# Please cite this dataset as:

Guindon L., Villemaire P., Correia D.L.P., Manka F., Lacarte S., Smiley B. 2023. SCANFI: Spatialized CAnadian National Forest Inventory data product. Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, Quebec, Canada. https://doi.org/10.23687/18e6a919-53fd-41ce-b4e2-44a9707c52dc

# The following raster layers are available:

• NFI land cover class values: Land cover classes include Water, Rock, Bryoid, Herbs, Shrub, Treed broadleaf, Treed mixed and Treed conifer

• Aboveground tree biomass (tons/ha): biomass was derived from total merchantable volume estimates produced by provincial agencies

• Height (meters): vegetation height

• Crown closure (%): percentage of pixel covered by the tree canopy

• Tree species cover (%): estimated as the proportion of the canopy covered by each tree species:

o Balsam fir tree cover in percentage (Abies balsamea) o Black spruce tree cover in percentage (Picea mariana) o Douglas fir tree cover in percentage (Pseudotsuga menziesii) o Jack pine tree cover in percentage (Pinus banksiana) o Lodgepole pine tree cover in percentage (Pinus contorta) o Ponderosa pine tree cover in percentage (Pinus ponderosa) o Tamarack tree cover in percentage (Larix laricina) o White and red pine tree cover in percentage (Pinus strobus and Pinus resinosa) o Broadleaf tree cover in percentage (PrcB) o Other coniferous tree cover in percentage (PrcC)
Gradients2-MGL1704-IFCB-Abundance_2020-04-01_v1.0
zenodo.org
bin
Updated Nov 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelicque White; Angelicque White (2020). Gradients2-MGL1704-IFCB-Abundance_2020-04-01_v1.0 [Dataset]. http://doi.org/10.5281/zenodo.4267140
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4267140
Dataset updated
Nov 11, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Angelicque White; Angelicque White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cruise: Gradients 2, MGL1704

Project Name: Simons Foundation, Gradient NPSG

Dataset Description:

The Imaging FlowCytoBot (IFCB) is an in situ automated imaging flow cytometer that generates images of particles suspended in seawater, in this case from the underway uncontaminated seawater system aboard the R/V Langseth (intake 5m). The IFCB uses a recycled sheath fluid (0.2 µm filtered seawater) to align and drive particles individually towards a light source (red laser, 4.5 mW) in order to detect and identify single or colonial cells using a combination of optical properties (red fluorescence and light scattering intensities) and high resolution images (3.2 pixels per micron) by a mounted camera. Both optical properties are used to trigger targeted image acquisition of suspended particles in the size range <4 to 100 μm. The instrument continuously samples (few seconds) from ~5 ml aliquots from the intake, and processes all particles contained in that volume for the next 20 mins. Images corresponding to the "Sample" variable are available on the cruise's dashboard (http://ifcb-data.soest.hawaii.edu/IFCB_NPTZ). This dataset is for the abundance of imaged cells by genus. For each sample, the total number of cells classified to the genus-level by a random forest algorithm (Sosik and Olson, 2007 doi:10.4319/lom.2007.5.204) is counted and divided by the corresponding volume analyzed (~5 mL). Note that we used all the images collected during Gradients 2.0 to train the random forest algorithm and that classification is therefore highly accurate for this dataset. Using 7 µm calibration beads, we estimated that the error of cell concentration due to cell detection during sample acquisition averages 11 ± 10 %, independently of concentrations in the range 1-10000 cell/mL.
Global urban area 60 meter resolution land cover data (1978-1985 )
data.tpdc.ac.cn
tpdc.ac.cn
zip
Updated Jan 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zemin FENG; Hong WEI; Jun YANG (2024). Global urban area 60 meter resolution land cover data (1978-1985 ) [Dataset]. http://doi.org/10.11888/Terre.tpdc.300988
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.11888/Terre.tpdc.300988
Dataset updated
Jan 30, 2024
Dataset provided by
Tanzania Petroleum Development Corporationhttp://tpdc.co.tz/
Authors
Zemin FENG; Hong WEI; Jun YANG
Area covered
Description
The data includes tif data and shp documents of urban land cover based on MSS from 1978 to 1985. The classification system includes 10 types of surface cover: cultivated land (10), forest (20), grassland (30), shrubland (40), wetland (50), water body (60), tundra (70), artificial surface (80), bare land (90), glacier and permanent snow (100) (corresponding raster data values in parentheses). Data production includes four steps: satellite image data preprocessing, sample collection, classification, and post-processing. The first three steps are performed on Google Earth Engine, while the fourth step is performed offline. The preprocessing of data includes determining the T1 level images available in MSS Collection 2 within the classification range, removing cloud pollution data and saturated pixels, and generating annual composite data using the median band values of available pixels. The preparation of training samples is mainly achieved through visual interpretation and sample migration based on Euclidean distance (ED) and spectral angular distance (SAD) methods for training samples from 85 years ago. In classification, the first step is to extract sample feature values, including remote sensing indices such as NDVI, NDWI, MBSI, and the median of band values as inputs. A random forest model is trained on every 3 degree grid, and the trained random forest model is used to classify images. The classification results are processed for quality offline. Use an independent validation sample set to test the classification accuracy. Due to severe data loss, the classification results of the Landsat Global Land Survey 1975 dataset, mainly based on the 1972-1983 and a small amount based on the 1982-1987 MSS data integration, were used to supplement the 78-83 years, and the classification results of the Landsat TM 84 and 85 years were filled in. To maintain consistency, the TM classification results were resampled to 60m. Afterwards, the results were merged into the second level based on the Global Administrative Region Dataset (GADM4.1), which corresponds to the prefecture level administrative regions in China. If there is no second level administrative division, it will be merged at the national level, and then urban boundaries will be generated based on impermeable surfaces. The classification results will be cropped using urban boundaries to generate urban land cover within the administrative region. Test the classification accuracy of the final result based on an independent validation sample set. Data format: raster data.tif. Data projection: WGS1984.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced

Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

Clear search

Close search

Google apps

Main menu

Input Files and Code for: Machine learning can accurately assign geologic...

Table_3_Biological signatures and prediction of an immunosuppressive...

Code and data for random forests model in mapping function parameter...

Data from: Large, climate-sensitive soil carbon stocks mapped with...

Random forests for predicting species identity of forensically important...

SCANFI: the Spatialized CAnadian National Forest Inventory data product

Gradients2-MGL1704-IFCB-Abundance_2020-04-01_v1.0

Global urban area 60 meter resolution land cover data (1978-1985 )

Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parametersSee More Versions

Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters