Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.
This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.
Facebook
TwitterThis is a list of Median Annual Household Incomes for the Overburdened Communities page on the EGLE website.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.
The dataset includes 1,000 records across the following features:
| Column Name | Description |
|---|---|
| Date | The date of the sale (01-01-2023 onward). |
| Product_Category | Category of the product (e.g., Electronics, Sports, Other). |
| Price | Price of the product (numerical). |
| Discount | Discount applied to the product (numerical). |
| Customer_Segment | Buyer segment (e.g., Regular, Occasional, Other). |
| Marketing_Spend | Marketing budget allocated for sales (numerical). |
| Units_Sold | Number of units sold per transaction (numerical). |
Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.
Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).
Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.
Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.
Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.
Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.
Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.
The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.
The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:
Feature Engineering:
Data Simulation:
Validation:
Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.
Here’s an example of building a predictive model using Linear Regression:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')
# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Created isotopes, half-life, reaction, and threshold reaction energy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.
The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.
An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.
A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.
The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.
Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Visualization
a. Scatter plot
i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for
any pair of genes.
ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint -
visit https://plotly.com/r/, https://plotly.com/python)
iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
Gender/Sex column from the metadata file.
b. Boxplot/violin plot
i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across
multiple categories as defined by user selected variable (a column from the metadata file)
ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
results_DMDtoolkit. This folder includes all the results produced by DMDtoolkit, such as “prediction results of DMD patients.xlsx” which contains the original prediction results of DMD patients from TREAT-NMD, Flanigan’s and GHCPAPF, “case7-1 (combination of multiple mutations).pdf” which is a vector illustration of a combination of multiple mutations. “supplement1.docx”. Examples of basic statistics includes calculation of summary, correlation coefficient, regression coefficient, and t test. “supplement2.docx”. Examples of basic graphs include pedigree, histogram, scatter plot with trend line, stem and leaf plot, and cluster dendrogram. (RAR 5435 kb)
Facebook
TwitterFigure 6 data Includes: -Panel A, Data from multiple biological replicates in BMK-DKO cells expressing either mCerulean3-BCL-XL, mCerulean3-BCL-XL-ActA or mCerulean3-BCL-XL-Cb5. '.pzfx' file can be opened in Graphpad prism. Alternatively, see "Fig6A_GetAverageForHeatmaps" for raw data used to generate Heatmaps. The same data was displayed in a Scatter plot (included as well), but was not included in the paper. SFigure 6 data -Images, and example data from a single replicate that were used to generate FigureS6. This figure was made to demonstrate the 3-channel colocalization method. Median, or mode colocalization values shown in Figures 6,7, and SFig8A were determined as shown in SFig6.
Facebook
TwitterAdditional file 1. Example of chromoMap interactive plot constructed using various features of chromoMap including polyploidy (used as multi-track), feature-associated data visualization (scatter and bar plots), chromosome heatmaps, data filters (color-coded scatter and bars). Differential gene expression in a cohort of patients positive for COVID19 and healthy individuals (NCBI Gene Expression Omnibus id: GSE162835) [12]. Each set of five tracks labeled with the same chromosome ID (e.g. 1-22, X & Y) contains the following information: From top to bottom: (1) number of differentially expressed genes (DEGs) (FDR < 0.05) (bars over the chromosome depictions) per genomic window (green boxes within the chromosome). Windows containing ≥ 5 DEGs are shown in yellow. (2) DEGs (FDR < 0.05) between healthy individuals and patients positive for COVID19 visualized as a scatterplot above the chromosome depiction (genes with logFC ≥ 2 or logFC ≤ −2 are highlighted in orange). Dots above the grey dashed line represent upregulated genes in COVID19 positive patients. Heatmap within chromosome depictions indicates the average LogFC value per window. (3–4) Normalized expression of differentially expressed genes (scatterplot) and of each genomic window containing DEG (green scale heatmap) in (3) patients with severe/critical outcomes and (4) asymptomatic/mild outcome patients. (5) logFC of DEGs between healthy individuals and patients positive for COVID19 visualized as scatter plot color-coded based on the metabolic pathway each DEG belongs to.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
UltraFeedback is an unprecedentedly expansive, meticulously detailed, and multifarious preference dataset built exclusively to train powerful reward and critic models with aligned language models. With thousands of prompts lifted from countless distinct sources like UltraChat, ShareGPT, Evol-Instruet, TruthfulQA and more, UltraFeedback contains an overwhelming 256k samples – perfect for introducing to a wide array of AI-driven projects. Dive into the selection of correct answers and incorrect answers attached to this remarkable collection easily within the same data file! Get up close in exploring options presented in UltraFeedback – a groundbreaking new opportunity for data collectors!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The first step is to understand the content of the dataset, including source, models, correct answers and incorrect answers. Knowing which language models (LM) were used to generate completions can help you better interpret the data in this dataset.
Once you are familiar with the column titles and their meanings it’s time to begin exploring! To maximize your insight into this data set use a variety of visualization techniques such as scatter plots or bar charts to view sample distributions across different LMs or answer types. Analyzing trends between incorrect and correct answers through data manipulation techniques such as merging sets can also provide valuable insights into preferences across different prompts and sources.
Finally, you may want to try running LR or other machine learning models on this dataset in order to create simple models for predicting preferences when given inputs from real world scenarios related to specific tasks that require nuanced understanding of instructions provided by one’s peers or superiors.
The possibilities for further exploration of this dataset are endless - now let’s get started!
- Training sentence completion models on the dataset to generate responses with high accuracy and diversity.
- Creating natural language understanding (NLU) tasks such as question-answering and sentiment analysis using the aligned dataset as training/testing sets.
- Developing strongly supervised learning algorithms that are able to use techniques like reward optimization with potential translation applications in developing machine translation systems from scratch or upstream text-generation tasks like summarization, dialog generation, etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------| | source | The source of the data. (String) | | instruction | The instruction given to the language models. (String) | | models | The language models used to generate the completions. (String) | | correct_answers | The correct answers to the instruction. (String) | | incorrect_answers | The incorrect answers to the instruction. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterThis is a list of Median Annual Household Incomes for the Overburdened Communities page on the EGLE website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT. Several mathematical models have been developed for applications in the hydraulics of irrigation systems and several performance indicators of these models are used and suggested by the literature. Thus, the objective of this work was to investigate the performance of statistical indicators for the evaluation of models in irrigation hydraulics. For this, three case studies which represent typical irrigation hydraulics modeling were used to assess the indicators. A set of indicators were analyzed: a) difference-based: mean absolute error, mean square error, root mean square error, scaled root mean square error, and percent mean absolute error; b) efficiency-based: Nash-Sutcliffe and Legates-McCabe; c) correlation coefficient (r); d) coefficient of determination (R2); e) index of agreement index (d); f) Camargo and Sentelhas index (c); and g) graphical methods: regression error characteristic curve based on relative absolute error and 1:1 scatter plot. For the evaluated cases, which are physical phenomena, differentiable indicators are similar measures and it is appropriate to report either or both indices. The assessment of models must also be supported by graphical analysis, which shows the real scenario of errors in the model evaluation processes. Efficiency-based indicators, r, R2, c, and d are not recommended and should be avoided in modeling of irrigation hydraulics.
Facebook
TwitterThis figure shows illustrative examples of scatter plots for parameter values derived from fits to simulated noisy data (sampled from the distributions of protein data measurements; see Figure 3 for mean and standard deviations of spatial expression profiles). Parameter values for Kr are shown in green (left column, A, D, G), for kni in red (centre column, B, E, H), and for gt in blue (right column, C, F, I). Parameter notation: (production rate), (decay rate), (diffusion rate), and (production delay; see equation 1). Black triangles indicate the original parameter estimate obtained with unperturbed data. Dashed ellipse around parameter values for Kr (in D) indicates parameters selected for further analysis. Arrow in E indicates striped interference pattern in the distribution of kni parameter values. See text for details.
Facebook
TwitterJupyter notebook files containing the Python script used for analyzing interacting effects of water chemistry features on zinc anode passivation. Includes code to evaluate Master Dataset with histograms, correlation matrices, scatter plots, Dunn's tests, and logistic regression models.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Iris dataset is a classic dataset in the field of machine learning and statistics. It's often used for demonstrating various data analysis, machine learning, and statistical techniques. Here are some key details about it:
Background - Origin: The dataset was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper titled "The use of multiple measurements in taxonomic problems." - Purpose: Fisher developed the dataset as an example of linear discriminant analysis.
Data Composition - Data Points: The dataset consists of 150 samples from three species of Iris flowers: Iris Setosa, Iris Versicolour, and Iris Virginica. - Features: There are four features measured in centimeters for each sample: 1. Sepal Length 2. Sepal Width 3. Petal Length 4. Petal Width - Classes: The dataset contains three classes, corresponding to the three species of Iris. Each class has 50 samples.
Usage - Classification: The Iris dataset is widely used for classification tasks, especially to illustrate the principles of supervised machine learning algorithms. - Testing Algorithms: It's often used to test out algorithms for linear regression, classification, and clustering due to its simplicity and small size. - Educational Purpose: Because of its clarity and simplicity, it's frequently used in teaching data science and machine learning.
Characteristics - Simple and Clean: The dataset is straightforward, with minimal preprocessing required, making it ideal for beginners. - Well-Behaved Classes: The species are relatively well separated, though there's some overlap between Versicolor and Virginica. - Multivariate Data: It involves understanding the relationship between multiple variables (the four features).
Applications - Benchmarking: The Iris dataset serves as a benchmark for evaluating the performance of different algorithms. - Visualization**: It's great for practicing data visualization, especially for exploring techniques like scatter plots, box plots, and pair plots to understand feature relationships.
Despite its simplicity, the Iris dataset remains one of the most famous datasets in the world of data science and machine learning. It serves as an excellent starting point for anyone new to the field and remains a baseline for testing algorithms and teaching concepts.
Facebook
TwitterThe data set comprises more than 7000 time series of ocean currents from moored instruments. The records contain horizontal current speed and direction and often concurrent temperature data. They may also contain vertical velocities, pressure and conductivity data. The majority of data originate from the continental shelf seas around the British Isles (for example, the North Sea, Irish Sea, Celtic Sea) and the North Atlantic. Measurements are also available for the South Atlantic, Indian, Arctic and Southern Oceans and the Mediterranean Sea. Data collection commenced in 1967 and is currently ongoing. Sampling intervals normally vary between 5 and 60 minutes. Current meter deployments are typically 2-8 weeks duration in shelf areas but up to 6-12 months in the open ocean. About 25 per cent of the data come from water depths of greater than 200m. The data are processed and stored by the British Oceanographic Data Centre (BODC) and a computerised inventory is available online. Data are quality controlled prior to loading to the databank. Data cycles are visually inspected by means of a sophisticated screening software package. Data from current meters on the same mooring or adjacent moorings can be overplotted and the data can also be displayed as time series or scatter plots. Series header information accompanying the data is checked and documentation compiled detailing data collection and processing methods.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file includes examples of interactive 2D and 3D scatter plots of cells from Nagano et al. (ZIP)
Facebook
TwitterThe data here were originally posted to facilitate timely and transparent peer review. The final public data release with formal metadata is now available from at the following location:
Nauman, T.W., and Duniway, M.C., 2020, Predictive soil property maps with prediction uncertainty at 30 meter resolution for the Colorado River Basin above Lake Mead: U.S. Geological Survey data release, https://doi.org/10.5066/P9SK0DO2.
Associated publication:
Nauman, T. W., and Duniway, M. C., 2020, A hybrid approach for predictive soil property mapping using conventional soil survey data: Soil Science Society of America Journal, v. 84, no. 4, p. 1170-1194. https://doi.org/10.1002/saj2.20080.
UPDATE: WE FOUND A RENDERING ERROR IN MANY AREAS OF THE 5 CM MAP. WE HAVE RECREATED THE MAP AND INCLUDED IN THIS VERSION OF THE REPOSITORY.
Repository includes maps of organic matter content (% wt) as defined by United States soil survey program.
These data are preliminary or provisional and are subject to revision. They are being provided to meet the need for timely best science. The data have not received final approval by the U.S. Geological Survey (USGS) and are provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the data.
This data should be used in combination with a soil depth or depth to restriction layer map (both layers that will be released soon as part of this project) to eliminate areas mapped at deeper depths than the soil actually goes. This is a limitation of this data which will hopefully be updated in future updates.
The creation and interpretation of this data is documented in the following article. Please note this article has not been reviewed yet and this citation will be updated as the peer review process proceeds.
Nauman, T. W., Duniway, M. C., In Preparation. Predictive reconstruction of soil survey property maps for field scale adaptive land management. Soil Science Society of America Journal.
File Name Details:
ACCURACY!! Please see manuscript and Github repository (https://github.com/naumi421/SoilReconProps) for full details on accuracy. We do provide cross validation (CV) accuracy plots in this repository for both the overall sample (_CV_plots.tif). These plots compare CV predictions with observed values relative to a 1:1 line. Values plotted near the 1:1 line are more accurate. Note that values are plotted in hex-bin density scatter plots because of the large number of observations (most are >3000). Predictions are also evaluated with the U.S. soil survey laboratory database soil organic carbon (SOC) data. The SOC measurements were coverted to OM matter values using the common 1.724 conversion factor. The converted OM values are compared to predicted OM values using an accuracy plot (OM_SOC_plots.tif).
Elements are separated by underscore (_) in the following sequence:
property_r_depth_cm_geometry_model_additional_elements.extension
Example: om_r_0_cm_2D_QRF_bt.tif
Indicates soil organic matter content (om) at 0 cm depth using a 2D model (separate model for each depth) employing a quantile regression forest. This file is the raster prediction map for this model. There may be additional GIS files associated with this file (e.g. pyramids) that have the same file name, but different extensions. The _bt indicates that the map has been back transformed from ln or sqrt transformation used in modeling.
The following elements may also exist on the end of filenames indicating other spatial files that characterize a given model's uncertainty (see below).
_95PI_h: Indicates the layer is the upper 95% prediction interval value.
_95PI_l: Indicates the layer is the lower 95% prediction interval value.
_95PI_relwidth: Indicates the layer is the 95% relative prediction interval (RPI). The RPI is a standardization of the prediction interval that indicates that model is constraining uncertainty relative to the original sample. RPI values less than one represent uncertainty is being improved by the model relative to the original sample, and values less than 0.5 indicate low uncertainty in predictions. See paper listed above and also Nauman and Duniway (In revision) for more details on RPI.
References
Nauman, T. W., and Duniway, M. C., In Revision, Relative prediction intervals reveal larger uncertainty in 3D approaches to predictive digital soil mapping of soil properties with legacy data: Geoderma
Facebook
TwitterHi Folks,
Let's understand the importance of Data Visualization.
Here below, we have four different data sets and they are paired in the sense of x and y.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F4f6c696e3ad5e2c887b01a0bdd14b355%2Fdata_set.png?generation=1685190700223447&alt=media" alt="">
Next let's calculate some descriptive statistics such as mean, standard deviation and correlation of each variables.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F14765ba12bdc18b8ff67cb6a9f2d7c7a%2Fstatistics.png?generation=1685192394142325&alt=media" alt="">
After examining the descriptive statistics the above four data sets have nearly identical or similar simple descriptive statistics.
However, when we graphically plot the datasets on scatter plot, we can see the difference that these 4 datasets looks very different.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2Fdbccf9dc638d3de28930b9f660e5f5a4%2Fgarph.png?generation=1685191588780934&alt=media" alt="">
Data 1 has a clear linear relationship, Data 2 has a curved relationship that is not linear, Data 3 has a tight linear relationship with one outlier and Data 4 has a linear relationship with one large outlier.
Such datasets are known as Anscombe's Quartet
Anscombe's quartet is a classic example of the importance of data visualization.
Anscombe's quartet is a set of four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphically represented. Each dataset consists of eleven (x,y) points.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F2b964d437afe17db949c57988b5fba05%2Fanscombes_quartet.png?generation=1685192626504792&alt=media" alt="">
Anscombe's quartet illustrates the importance of plotting data before we analyze it. Descriptive statistics can be misleading, and they can't tell us everything we need to know about a dataset. Plotting the data on charts can help us to understand the shape of the distribution and to identify any outliers.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.
This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.