60 datasets found

Data from: Valid Inference Corrected for Outlier Removal
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v4
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
f
Data from: Methodology to filter out outliers in high spatial density data...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14305658.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
Outlier classification using autoencoders: application for fluctuation...
osti.gov
dataverse.harvard.edu
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKEHRJ
Dataset updated
Jun 2, 2021
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
f
Data from: A Multi-Objective Genetic Algorithm for Outlier Removal
acs.figshare.com
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oren E. Nahum; Abraham Yosipof; Hanoch Senderowitz (2023). A Multi-Objective Genetic Algorithm for Outlier Removal [Dataset]. http://doi.org/10.1021/acs.jcim.5b00515.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.5b00515.s001
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Oren E. Nahum; Abraham Yosipof; Hanoch Senderowitz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed “preservation”), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
Authors
World Bank
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
paper mill response data
kaggle.com
zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yerkin Mudebayev (2023). paper mill response data [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/paper-mill-response-data/discussion
Explore at:
zip(689197 bytes)Available download formats
Dataset updated
Apr 6, 2023
Authors
Yerkin Mudebayev
Description
The project is to conduct a principal components analysis of the paper mill response data (paper_mill_data_response.txt, Aldrin, M., "Moderate projection pursuit regression for multivariate response data", Computational Statistics and Data Analysis, 21 (1996), p. 501-531). (a) Label the variables r1,···, r13. Carry out an initial investigation. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Display the sample correlation matrix R. (c) Perform a principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. iii. Give the eigenvectors for the first two principal components and write out the principal components. iv. Considering the coefficients of the principal components, describe dependencies of the principal components on the variables. v. Display a scatter plot of the first two principal components. Make observations about the plots. (d) Include your code.
n
Data from: Pacman profiling: a simple procedure to identify stratigraphic...
data-staging.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jul 8, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Lazarus; Manuel Weinkauf; Patrick Diver (2011). Pacman profiling: a simple procedure to identify stratigraphic outliers in high-density deep-sea microfossil data [Dataset]. http://doi.org/10.5061/dryad.2m7b0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2m7b0
Dataset updated
Jul 8, 2011
Authors
David Lazarus; Manuel Weinkauf; Patrick Diver
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Marine, Global
Description
The deep-sea microfossil record is characterized by an extraordinarily high density and abundance of fossil specimens, and by a very high degree of spatial and temporal continuity of sedimentation. This record provides a unique opportunity to study evolution at the species level for entire clades of organisms. Compilations of deep-sea microfossil species occurrences are, however, affected by reworking of material, age model errors, and taxonomic uncertainties, all of which combine to displace a small fraction of the recorded occurrence data both forward and backwards in time, extending total stratigraphic ranges for taxa. These data outliers introduce substantial errors into both biostratigraphic and evolutionary analyses of species occurrences over time. We propose a simple method—Pacman—to identify and remove outliers from such data, and to identify problematic samples or sections from which the outlier data have derived. The method consists of, for a large group of species, compiling species occurrences by time and marking as outliers calibrated fractions of the youngest and oldest occurrence data for each species. A subset of biostratigraphic marker species whose ranges have been previously documented is used to calibrate the fraction of occurrences to mark as outliers. These outlier occurrences are compiled for samples, and profiles of outlier frequency are made from the sections used to compile the data; the profiles can then identify samples and sections with problematic data caused, for example, by taxonomic errors, incorrect age models, or reworking of sediment. These samples/sections can then be targeted for re-study.
f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Sales data for bulls
kaggle.com
zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yerkin Mudebayev (2023). Sales data for bulls [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/sales-data-for-bulls
Explore at:
zip(413774 bytes)Available download formats
Dataset updated
Apr 6, 2023
Authors
Yerkin Mudebayev
Description
Preliminary investigation (a) Carry out a shortened initial investigation (steps 1, 2 and 3) based on the matrix scatter plot and box plot. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Explain why using the correlation matrix for the factor analysis is indicated. (c) Display the sample correlation matrix R. Does the matrix R suggest the number of factors to use? (d) Perform a preliminary simplified principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. (e) Include your code
m
Data from: Classification of Heart Failure Using Machine Learning: A...
data.mendeley.com
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan Chulde (2024). Classification of Heart Failure Using Machine Learning: A Comparative Study [Dataset]. http://doi.org/10.17632/959dxmgj8d.1
Explore at:
Unique identifier
https://doi.org/10.17632/959dxmgj8d.1
Dataset updated
Oct 29, 2024
Authors
Bryan Chulde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our research demonstrates that machine learning algorithms can effectively predict heart failure, highlighting high-accuracy models that improve detection and treatment. The Kaggle “Heart Failure” dataset, with 918 instances and 12 key features, was preprocessed to remove outliers and features a distribution of cases with and without heart disease (508 and 410). Five models were evaluated: the random forest achieved the highest accuracy (92%) and was consolidated as the most effective at classifying cases. Logistic regression and multilayer perceptron were also quite accurate (89%), while decision tree and k-nearest neighbors performed less well, showing that k-neighbors is less suitable for this data. F1 scores confirmed the random forest as the optimal one, benefiting from preprocessing and hyperparameter tuning. The data analysis revealed that age, blood pressure and cholesterol correlate with disease risk, suggesting that these models may help prioritize patients at risk and improve their preventive management. The research underscores the potential of these models in clinical practice to improve diagnostic accuracy and reduce costs, supporting informed medical decisions and improving health outcomes.
Heidelberg Tributary Loading Program (HTLP) Dataset
zenodo.org
data.niaid.nih.gov
+1more
bin, png
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCWQR; NCWQR (2024). Heidelberg Tributary Loading Program (HTLP) Dataset [Dataset]. http://doi.org/10.5281/zenodo.6606950
Explore at:
bin, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6606950
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
NCWQR; NCWQR
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is updated more frequently and can be visualized on NCWQR's data portal.

If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.

The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.

Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.

At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.

Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.

2017 Ohio EPA Project Study Plan and Quality Assurance Plan

Project Study Plan

Quality Assurance Plan

Data quality control and data screening

The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.

This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.

A note on detection limits and zero and negative concentrations

It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.

The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.

For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.

Analyte Detection Limits

https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024

For more information, please visit https://ncwqr.org/
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Intermediate data for TE calculation
zenodo.org
bin, csv
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Liu; Yue Liu (2025). Intermediate data for TE calculation [Dataset]. http://doi.org/10.5281/zenodo.10373032
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10373032
Dataset updated
May 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yue Liu; Yue Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes intermediate data from RiboBase that generates translation efficiency (TE). The code to generate the files can be found at https://github.com/CenikLab/TE_model.

We uploaded demo HeLa .ribo files, but due to the large storage requirements of the full dataset, I recommend contacting Dr. Can Cenik directly to request access to the complete version of RiboBase if you need the original data.

The detailed explanation for each file:

human_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in human.

human_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in human.

human_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in human.

human_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in human.

human_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in human.

human_TE_rho.rda: TE proportional similarity data as genes by genes matrix in human.

mouse_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in mouse.

mouse_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in mouse.

mouse_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in mouse.

mouse_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in mouse.

mouse_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in mouse.

mouse_TE_rho.rda: TE proportional similarity data as genes by genes matrix in mouse.

All the data was passed quality control. There are 1054 mouse samples and 835 mouse samples:
* coverage > 0.1 X
* CDS percentage > 70%
* R2 between RNA and RIBO >= 0.188 (remove outliers)

All ribosome profiling data here is non-dedup winsorizing data paired with RNA-seq dedup data without winsorizing (even though it names as flatten, it just the same format of the naming)

####code
If you need to read rda data please use load("rdaname.rda") with R

If you need to calculate proportional similarity from clr data:
library(propr)
human_TE_homo_rho <- propr:::lr2rho(as.matrix(clr_data))
rownames(human_TE_homo_rho) <- colnames(human_TE_homo_rho) <- rownames(clr_data)
f
Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...
acs.figshare.com
zip
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross (2023). PCP-SAFT Parameters of Pure Substances Using Large Experimental Databases [Dataset]. http://doi.org/10.1021/acs.iecr.3c02255.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.iecr.3c02255.s001
Dataset updated
Sep 6, 2023
Dataset provided by
ACS Publications
Authors
Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.
Dataset for the paper "Observation of Acceleration and Deceleration Periods...
zenodo.org
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yide Qian; Yide Qian (2025). Dataset for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 " [Dataset]. http://doi.org/10.5281/zenodo.15022854
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15022854
Dataset updated
Mar 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yide Qian; Yide Qian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pine Island Glacier
Description
Dataset and codes for "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 "

Description of the data and file structure

The MATLAB codes and related datasets are used for generating the figures for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023".

Files and variables

File 1: Data_and_Code.zip

Directory: Main_function

**Description:****Include MATLAB scripts and functions. Each script include discriptions that guide the user how to used it and how to find the dataset that used for processing.

MATLAB Main Scripts: Include the whole steps to process the data, output figures, and output videos.

Script_1_Ice_velocity_process_flow.m

Script_2_strain_rate_process_flow.m

Script_3_DROT_grounding_line_extraction.m

Script_4_Read_ICESat2_h5_files.m

Script_5_Extraction_results.m

MATLAB functions: Five Files that includes MATLAB functions that support the main script:

1_Ice_velocity_code: Include MATLAB functions related to ice velocity post-processing, includes remove outliers, filter, correct for atmospheric and tidal effect, inverse weited averaged, and error estimate.

2_strain_rate: Include MATLAB functions related to strain rate calculation.

3_DROT_extract_grounding_line_code: Include MATLAB functions related to convert range offset results output from GAMMA to differential vertical displacement and used the result extract grounding line.

4_Extract_data_from_2D_result: Include MATLAB functions that used for extract profiles from 2D data.

5_NeRD_Damage_detection: Modified code fom Izeboud et al. 2023. When apply this code please also cite Izeboud et al. 2023 (https://www.sciencedirect.com/science/article/pii/S0034425722004655).

6_Figure_plotting_code:Include MATLAB functions related to Figures in the paper and support information.

Director: data_and_result

Description:**Include directories that store the results output from MATLAB. user only neeed to modify the path in MATLAB script to their own path.

1_origin : Sample data ("PS-20180323-20180329", “PS-20180329-20180404”, “PS-20180404-20180410”) output from GAMMA software in Geotiff format that can be used to calculate DROT and velocity. Includes displacment, theta, phi, and ccp.

2_maskccpN: Remove outliers by ccp < 0.05 and change displacement to velocity (m/day).

3_rockpoint: Extract velocities at non-moving region

4_constant_detrend: removed orbit error

5_Tidal_correction: remove atmospheric and tidal induced error

6_rockpoint: Extract non-aggregated velocities at non-moving region

6_vx_vy_v: trasform velocities from va/vr to vx/vy

7_rockpoint: Extract aggregated velocities at non-moving region

7_vx_vy_v_aggregate_and_error_estimate: inverse weighted average of three ice velocity maps and calculate the error maps

8_strain_rate: calculated strain rate from aggregate ice velocity

9_compare: store the results before and after tidal correction and aggregation.

10_Block_result: times series results that extrac from 2D data.

11_MALAB_output_png_result: Store .png files and time serties result

12_DROT: Differential Range Offset Tracking results

13_ICESat_2: ICESat_2 .h5 files and .mat files can put here (in this file only include the samples from tracks 0965 and 1094)

14_MODIS_images: you can store MODIS images here

shp: grounding line, rock region, ice front, and other shape files.

File 2 : PIG_front_1947_2023.zip

Includes Ice front positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

File 3 : PIG_DROT_GL_2016_2021.zip

Includes grounding line positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

Data was derived from the following sources:
Those links can be found in MATLAB scripts or in the paper "**Open Research" **section.
S
Euler number calculation with spots
scidb.cn
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Zhang (2025). Euler number calculation with spots [Dataset]. http://doi.org/10.57760/sciencedb.25091
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.25091
Dataset updated
May 19, 2025
Dataset provided by
Science Data Bank
Authors
Yu Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Since the small spots in the slices were not completely removed, the calculation of the Euler number was incorrect. Therefore, taking Sr30 as an example, we provide the original liquid phase, the liquid phase after removing noise, and the three-phase data of the noise. After recalculating the Euler number, we confirmed that the calculation error was caused by the noise.The noise removal operation can be performed in ImageJ as follows:Process > Noise > Remove Outliers, with parameters set to Radius=5 and Threshold=0.50
H
High frequency dataset for event-scale concentration-discharge analysis in a...
hydroshare.org
search.dataone.org
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Musolff (2024). High frequency dataset for event-scale concentration-discharge analysis in a forested headwater 01/2018-08/2023 [Dataset]. http://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
Explore at:
zip(17.1 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
Dataset updated
Sep 19, 2024
Dataset provided by
HydroShare
Authors
Andreas Musolff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2018 - Aug 23, 2023
Area covered

Description
This composite repository contains high-frequency data of discharge, electrical conductivity, nitrate-N, DOC and water temperature obtained the Rappbode headwater catchment in the Harz mountains, Germany. This catchment was affected by a bark-beetle infestion and forest dieback from 2018 onwards.The data extents previous observations from the same catchment (RB) published as part of Musolff (2020). Details on the catchment can be found here: Werner et al. (2019, 2021), Musolff et al. (2021). The file RB_HF_data_2018_2023.txt states measurements for each timestep using the following columns: "index" (number of observation),"Date.Time" (timestamp in YYYY-MM-DD HH:MM:SS), "WT" (water temperature in degree celsius), "Q.smooth" ( discharge in mm/d smoothed using moving average), "NO3.smooth" (nitrate concentrations in mg N/L smoothed using moving average), "DOC.smooth" (Dissolved organic carbon concentrations in mg/L, smoothed using moving average), "EC.smooth" (electrical conductivity in µS/cm smoothed using moving average); NA - no data.

Water quality data and discharge was measured at a high-frequency interval of 15 min in the time period between January 2018 and August 2023. Both, NO3-N and DOC were measured using an in-situ UV-VIS probe (s::can spectrolyser, scan Austria). EC was measured using an in-situ probe (CTD Diver, Van Essen Canada). Discharge measurements relied on an established stage-discharge relationship based on water level observations (CTD Diver, Van Essen Canada, see Werner et al. [2019]). Data loggers were maintained every two weeks, including manual cleaning of the UV-VIS probes and grab sampling for subsequent lab analysis, calibration and validation.

Data preparation included five steps: drift corrections, outlier detection, gap filling, calibration and moving averaging: - Drift was corrected by distributing the offset between mean values one hour before and after cleaning equally among the two weeks maintenance interval as an exponential growth. - Outliers were detected with a two-step procedure. First, values outside a physically unlikely range were removed. Second, the Grubbs test, to detect and remove outliers, was applied to a moving window of 100 values. - Data gaps smaller than two hours were filled using cubic spline interpolation. - The resulting time series were globally calibrated against the lab measured concentration of NO3-N and DOC. EC was calibrated against field values obtained with a handheld WTW probe (WTW Multi 430, Xylem Analytics Germany). - Noise in the signal of both discharge and water quality was reduced by a moving average with a window lenght of 2.5 hours.

References: Musolff, A. (2020). High frequency dataset for event-scale concentration-discharge analysis. https://doi.org/http://www.hydroshare.org/resource/27c93a3f4ee2467691a1671442e047b8 Musolff, A., Zhan, Q., Dupas, R., Minaudo, C., Fleckenstein, J. H., Rode, M., Dehaspe, J., & Rinke, K. (2021). Spatial and Temporal Variability in Concentration-Discharge Relationships at the Event Scale. Water Resources Research, 57(10). Werner, B. J., A. Musolff, O. J. Lechtenfeld, G. H. de Rooij, M. R. Oosterwoud, and J. H. Fleckenstein (2019), High-frequency measurements explain quantity and quality of dissolved organic carbon mobilization in a headwater catchment, Biogeosciences, 16(22), 4497-4516. Werner, B. J., Lechtenfeld, O. J., Musolff, A., de Rooij, G. H., Yang, J., Grundling, R., Werban, U., & Fleckenstein, J. H. (2021). Small-scale topography explains patterns and dynamics of dissolved organic carbon exports from the riparian zone of a temperate, forested catchment. Hydrology and Earth System Sciences, 25(12), 6067-6086.
flea beetle data
kaggle.com
zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yerkin Mudebayev (2023). flea beetle data [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/flea-beetle-data
Explore at:
zip(950663 bytes)Available download formats
Dataset updated
Apr 6, 2023
Authors
Yerkin Mudebayev
Description
The project is to conduct a principal components analysis of the flea beatle data (fleabeetledata.xlsx, Lubischew, A., On the use of discriminant functions in taxonomy, Biometrics 18 (1962), 455-477.). The data has two groups. You will conduct three principal component analysis, one for each individual group and one for the entire data set ignoring groups. You will use S for the PCA. (a) Carry out an initial investigation. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. Hint: Pay attention to potential differences between the groups. (b) For the Haltica oleracea group, i. Display the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display scatter plots of pairs of principal components. Make observations about the plots. (c) For the Haltica carduorum group, i. Display the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display scatter plots of pairs of principal components. Make observations about the plots. (d) For the entire data set (ignoring groups), i. Display the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display scatter plots of pairs of principal components. Make observations about the plots. (e) Compare the results for the three principal component analyses. Do you have any conclusions? Key for Flea Beetle Data x1 = distance of transverse groove from posteriori border of prothorax x2 = length of elytra x3 length of second antennal joint x4 = length of third antennal joint
housing
kaggle.com
zip
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HappyRautela (2023). housing [Dataset]. https://www.kaggle.com/datasets/happyrautela/housing
Explore at:
zip(809785 bytes)Available download formats
Dataset updated
Sep 22, 2023
Authors
HappyRautela
Description
The exercise after this contains questions that are based on the housing dataset.

How many houses have a waterfront? a. 21000 b. 21450 c. 163 d. 173

How many houses have 2 floors? a. 2692 b. 8241 c. 10680 d. 161

How many houses built before 1960 have a waterfront? a. 80 b. 7309 c. 90 d. 92

What is the price of the most expensive house having more than 4 bathrooms? a. 7700000 b. 187000 c. 290000 d. 399000

For instance, if the ‘price’ column consists of outliers, how can you make the data clean and remove the redundancies? a. Calculate the IQR range and drop the values outside the range. b. Calculate the p-value and remove the values less than 0.05. c. Calculate the correlation coefficient of the price column and remove the values less than the correlation coefficient. d. Calculate the Z-score of the price column and remove the values less than the z-score.

What are the various parameters that can be used to determine the dependent variables in the housing data to determine the price of the house? a. Correlation coefficients b. Z-score c. IQR Range d. Range of the Features

If we get the r2 score as 0.38, what inferences can we make about the model and its efficiency? a. The model is 38% accurate, and shows poor efficiency. b. The model is showing 0.38% discrepancies in the outcomes. c. Low difference between observed and fitted values. d. High difference between observed and fitted values.

If the metrics show that the p-value for the grade column is 0.092, what all inferences can we make about the grade column? a. Significant in presence of other variables. b. Highly significant in presence of other variables c. insignificance in presence of other variables d. None of the above

If the Variance Inflation Factor value for a feature is considerably higher than the other features, what can we say about that column/feature? a. High multicollinearity b. Low multicollinearity c. Both A and B d. None of the above

Facebook

Twitter

Click to copy link

Link copied

Cite

Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4

Data from: Valid Inference Corrected for Outlier Removal

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9762731.v4

Dataset updated

Jun 4, 2023

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

Shuxiao Chen; Jacob Bien

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Valid Inference Corrected for Outlier Removal

Data from: Methodology to filter out outliers in high spatial density data...

Outlier classification using autoencoders: application for fluctuation...

Data from: A Multi-Objective Genetic Algorithm for Outlier Removal

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

paper mill response data

Data from: Pacman profiling: a simple procedure to identify stratigraphic...

Data from: Error and anomaly detection for intra-participant time-series...

Data Analysis for the Systematic Literature Review of DL4SE

Sales data for bulls

Data from: Classification of Heart Failure Using Machine Learning: A...

Heidelberg Tributary Loading Program (HTLP) Dataset

Cdd Dataset

Intermediate data for TE calculation

Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...

Dataset for the paper "Observation of Acceleration and Deceleration Periods...

Euler number calculation with spots

High frequency dataset for event-scale concentration-discharge analysis in a...

flea beetle data

housing

Data from: Valid Inference Corrected for Outlier Removal