12 datasets found

f
Anomaly Detection in High-Dimensional Data
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844508.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
Chemical outlier dataset
zenodo.org
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Lovric; Mario Lovric (2020). Chemical outlier dataset [Dataset]. http://doi.org/10.5281/zenodo.1167835
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1167835
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mario Lovric; Mario Lovric
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objects are numbered. The Y-variable are boiling points. Other features are structural features of molecules. In the outlier column the outliers are assigned with a value of 1.

The data is derived from a published chemical dataset on boiling point measurements [1] and from public data [2]. Features were generated by means of the RDKit Python library [3]. The dataset was infused with known outliers (~5%) based on significant structural differences, i.e. polar and non-polar molecules.

Cherqaoui D., Villemin D. Use of a Neural Network to determine the Boiling Point of Alkanes. J CHEM SOC FARADAY TRANS. 1994;90(1):97–102.

https://pubchem.ncbi.nlm.nih.gov/

RDKit: Open-source cheminformatics; http://www.rdkit.org
g
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
elki-project.github.io
explore.openaire.eu
+2more
Updated Sep 2, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erich Schubert; Arthur Zimek (2011). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6355684
Dataset updated
Sep 2, 2011
Dataset provided by
TU Dortmund University
University of Southern Denmark, Denmark
Authors
Erich Schubert; Arthur Zimek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Amsterdam Library of Object Images" is a collection of 110250 images of 1000 small objects, taken under various light conditions and rotation angles. All objects were placed on a black background. Thus the images are taken under rather uniform conditions, which means there is little uncontrolled bias in the data set (unless mixed with other sources). They do however not resemble a "typical" image collection. The data set has a rather unique property for its size: there are around 100 different images of each object, so it is well suited for clustering. By downsampling some objects it can also be used for outlier detection. For multi-view research, we offer a number of different feature vector sets for evaluating this data set.
u
Results and analysis using the Lean Six-Sigma define, measure, analyze,...
researchdata.up.ac.za
docx
Updated Mar 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Modiehi Mophethe (2024). Results and analysis using the Lean Six-Sigma define, measure, analyze, improve, and control (DMAIC) Framework [Dataset]. http://doi.org/10.25403/UPresearchdata.25370374.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25403/UPresearchdata.25370374.v1
Dataset updated
Mar 12, 2024
Dataset provided by
University of Pretoria
Authors
Modiehi Mophethe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This section presents a discussion of the research data. The data was received as secondary data however, it was originally collected using the time study techniques. Data validation is a crucial step in the data analysis process to ensure that the data is accurate, complete, and reliable. Descriptive statistics was used to validate the data. The mean, mode, standard deviation, variance and range determined provides a summary of the data distribution and assists in identifying outliers or unusual patterns. The data presented in the dataset show the measures of central tendency which includes the mean, median and the mode. The mean signifies the average value of each of the factors presented in the tables. This is the balance point of the dataset, the typical value and behaviour of the dataset. The median is the middle value of the dataset for each of the factors presented. This is the point where the dataset is divided into two parts, half of the values lie below this value and the other half lie above this value. This is important for skewed distributions. The mode shows the most common value in the dataset. It was used to describe the most typical observation. These values are important as they describe the central value around which the data is distributed. The mean, mode and median give an indication of a skewed distribution as they are not similar nor are they close to one another. In the dataset, the results and discussion of the results is also presented. This section focuses on the customisation of the DMAIC (Define, Measure, Analyse, Improve, Control) framework to address the specific concerns outlined in the problem statement. To gain a comprehensive understanding of the current process, value stream mapping was employed, which is further enhanced by measuring the factors that contribute to inefficiencies. These factors are then analysed and ranked based on their impact, utilising factor analysis. To mitigate the impact of the most influential factor on project inefficiencies, a solution is proposed using the EOQ (Economic Order Quantity) model. The implementation of the 'CiteOps' software facilitates improved scheduling, monitoring, and task delegation in the construction project through digitalisation. Furthermore, project progress and efficiency are monitored remotely and in real time. In summary, the DMAIC framework was tailored to suit the requirements of the specific project, incorporating techniques from inventory management, project management, and statistics to effectively minimise inefficiencies within the construction project.
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://worldbank.org/
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
Data from: Brain Rot Dataset
kaggle.com
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Takleef ALhasani (2025). Brain Rot Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12070497
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/12070497
Dataset updated
Jun 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmed Takleef ALhasani
Description
A structured, self-report questionnaire designed by our research team was used to develop a customized dataset. The questionnaire was in the form of an online questionnaire comprising 4 main sections: • Demographics: age, gender, and education. • Technology and social media use: Daily hours of screen time, time spent on social media, main platforms used, and preference for technology usage (work or leisure). • Psychological and Cognitive Indicators: Self-rated concentration during the study (1–5), number of interruptions, change in mood following technology use, and perceived difficulty concentrating while using social media. • Self-Awareness and Coping: Perception of being overused, concerns about the use of technology, use of apps to reduce mental fatigue, and use of strategies to reduce duration. The responses were numerical. Physicians left the respondents with missing or invalid responses, which were removed during the pre-processing stage. A new binary response was defined—Brain Rot (Yes/No). A participant was deemed to have brain rot if they demonstrated 3 or more of the 6 brain rot patterns: • Social media use ≥3 hours per day • Screen time ≥ 4 hours per day • Focus level ≤ 2 out of 5 • Reports frequent distraction • Notices mood shift as technology is used • Thinks social media is bad for mental health This was the target variable and the outcome label for classification. However, the dataset was cleaned and pre-processed as follows pre-analysis: • Elimination of incomplete or contradictory records • Conversion of categorical into the numerical form (namely, yes = 1, no = 0). • Normalization of numerical features, if necessary • Treatment of outliers and testing for normality The ultimate dataset was balanced, well-formatted for statistical and machine learning analyses, and presented with well-defined input features and a binary classification output.
u
Data from USDA ARS Central Plains Experimental Range (CPER) near Nunn, CO:...
agdatacommons.nal.usda.gov
catalog.data.gov
application/csv
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justin D Derner; Mary Ashby; David J. Augustine; Melissa Johnston; Tamarah (Tami) Jorns; Matt Mortenson; Jake Thomas; Jeff Thomas (2024). Data from USDA ARS Central Plains Experimental Range (CPER) near Nunn, CO: Cattle weight gains managed with light, moderate and heavy grazing intensities [Dataset]. http://doi.org/10.15482/USDA.ADC/1528520
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1528520
Dataset updated
Feb 20, 2024
Dataset provided by
Ag Data Commons
Authors
Justin D Derner; Mary Ashby; David J. Augustine; Melissa Johnston; Tamarah (Tami) Jorns; Matt Mortenson; Jake Thomas; Jeff Thomas
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
Nunn, Colorado
Description
The USDA-Agricultural Research Service Central Plains Experimental Range (CPER) is a Long-Term Agroecosystem Research (LTAR) network site located ~20 km northeast of Nunn, in north-central Colorado, USA. In 1939, scientists established the Long-term Grazing Intensity study (LTGI) with four replications of light, moderate, and heavy grazing. Each replication had three 129.5 ha pastures with the grazing intensity treatment randomly assigned. Today, one replication remains. Light grazing occurs in pasture 23W (9.3 Animal Unit Days (AUD)/ha, targeted for 20% utilization of peak growing-season biomass), moderate grazing in pasture 15E (12.5 AUD/ha, 40% utilization), and heavy grazing in pasture 23E (18.6 AUD/ha, 60% utilization). British- and continental-breed yearling cattle graze the pastures season-long from mid-May to October except when forage limitations shorten the grazing season. Individual raw data on cattle entry and exit weights, as well as weights every 28-days during the grazing season are available from 2000 to 2019. Cattle entry and exit weights are included in this dataset. Weight outliers (± 2 SD) are flagged for calculating summary statistics or performing statistical analysis. Resources in this dataset:Resource Title: Data Dictionary for LTGI Cattle weights on CPER (2000-2019). File Name: LTGI_2000-2019_data_dictionary.csvResource Description: Data dictionary for data from USDA ARS Central Plains Experimental Range (CPER) near Nunn, CO cattle weight gains managed with light, moderate and heavy grazing intensities Resource Title: LTGI Cattle weights on CPER (2000-2019). File Name: LTGI_2000-2019_all_weights_published.csvResource Description: Data from USDA ARS Central Plains Experimental Range (CPER) near Nunn, CO cattle weight gains managed with light, moderate and heavy grazing intensities
Gender, Age, and Emotion Detection from Voice
kaggle.com
Updated May 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Zaman (2021). Gender, Age, and Emotion Detection from Voice [Dataset]. https://www.kaggle.com/datasets/rohitzaman/gender-age-and-emotion-detection-from-voice/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rohit Zaman
Description
Context

Our target was to predict gender, age and emotion from audio. We found audio labeled datasets on Mozilla and RAVDESS. So by using R programming language 20 statistical features were extracted and then after adding the labels these datasets were formed. Audio files were collected from "Mozilla Common Voice" and “Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS)”.

Content

Datasets contains 20 feature columns and 1 column for denoting the label. The 20 statistical features were extracted through the Frequency Spectrum Analysis using R programming Language. They are: 1) meanfreq - The mean frequency (in kHz) is a pitch measure, that assesses the center of the distribution of power across frequencies. 2) sd - The standard deviation of frequency is a statistical measure that describes a dataset’s dispersion relative to its mean and is calculated as the variance’s square root. 3) median - The median frequency (in kHz) is the middle number in the sorted, ascending, or descending list of numbers. 4) Q25 - The first quartile (in kHz), referred to as Q1, is the median of the lower half of the data set. This means that about 25 percent of the data set numbers are below Q1, and about 75 percent are above Q1. 5) Q75 - The third quartile (in kHz), referred to as Q3, is the central point between the median and the highest distributions. 6) IQR - The interquartile range (in kHz) is a measure of statistical dispersion, equal to the difference between 75th and 25th percentiles or between upper and lower quartiles. 7) skew - The skewness is the degree of distortion from the normal distribution. It measures the lack of symmetry in the data distribution. 8) kurt - The kurtosis is a statistical measure that determines how much the tails of distribution vary from the tails of a normal distribution. It is actually the measure of outliers present in the data distribution. 9) sp.ent - The spectral entropy is a measure of signal irregularity that sums up the normalized signal’s spectral power. 10) sfm - The spectral flatness or tonality coefficient, also known as Wiener entropy, is a measure used for digital signal processing to characterize an audio spectrum. Spectral flatness is usually measured in decibels, which, instead of being noise-like, offers a way to calculate how tone-like a sound is. 11) mode - The mode frequency is the most frequently observed value in a data set. 12) centroid - The spectral centroid is a metric used to describe a spectrum in digital signal processing. It means where the spectrum’s center of mass is centered. 13) meanfun - The meanfun is the average of the fundamental frequency measured across the acoustic signal. 14) minfun - The minfun is the minimum fundamental frequency measured across the acoustic signal 15) maxfun - The maxfun is the maximum fundamental frequency measured across the acoustic signal. 16) meandom - The meandom is the average of dominant frequency measured across the acoustic signal. 17) mindom - The mindom is the minimum of dominant frequency measured across the acoustic signal. 18) maxdom - The maxdom is the maximum of dominant frequency measured across the acoustic signal 19) dfrange - The dfrange is the range of dominant frequency measured across the acoustic signal. 20) modindx - the modindx is the modulation index, which calculates the degree of frequency modulation expressed numerically as the ratio of the frequency deviation to the frequency of the modulating signal for a pure tone modulation.

Acknowledgements

Gender and Age Audio Data Souce: Link: https://commonvoice.mozilla.org/en Emotion Audio Data Souce: Link : https://smartlaboratory.org/ravdess/
d
Supporting data for analysis of general water-quality conditions, long-term...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Supporting data for analysis of general water-quality conditions, long-term trends, and network analysis at selected sites within the Missouri Ambient Water-Quality Monitoring Network, water years 1993–2017 [Dataset]. https://catalog.data.gov/dataset/supporting-data-for-analysis-of-general-water-quality-conditions-long-term-trends-and-netw
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The U.S. Geological Survey (USGS), in cooperation with the Missouri Department of Natural Resources (MDNR), collects data pertaining to the surface-water resources of Missouri. These data are collected as part of the Missouri Ambient Water-Quality Monitoring Network (AWQMN) and are stored and maintained by the USGS National Water Information System (NWIS) database. These data constitute a valuable source of reliable, impartial, and timely information for developing an improved understanding of the water resources of the State. Water-quality data collected between water years 1993 and 2017 were analyzed for long term trends and the network was investigated to identify data gaps or redundant data to assist MDNR on how to optimize the network in the future. This is a companion data release product to the Scientific Investigation Report: Richards, J.M., and Barr, M.N., 2021, General water-quality conditions, long-term trends, and network analysis at selected sites within the Ambient Water-Quality Monitoring Network in Missouri, water years 1993–2017: U.S. Geological Survey Scientific Investigations Report 2021–5079, 75 p., https://doi.org/10.3133/sir20215079. The following selected tables are included in this data release in compressed (.zip) format: AWQMN_EGRET_data.xlsx -- Data retrieved from the USGS National Water Information System database that was quality assured and conditioned for network analysis of the Missouri Ambient Water-Quality Monitoring Network AWQMN_R-QWTREND_data.xlsx -- Data retrieved from the USGS National Water Information System database that was quality assured and conditioned for analysis of flow-weighted trends for selected sites in the Missouri Ambient Water-Quality Monitoring Network AWQMN_R-QWTREND_outliers.xlsx -- Data flagged as outliers during analysis of flow-weighted trends for selected sites in the Missouri Ambient Water-Quality Monitoring Network AWQMN_R-QWTREND_outliers_quarterly.xlsx -- Data flagged as outliers during analysis of flow-weighted trends using a simulated quarterly sampling frequency dataset for selected sites in the Missouri Ambient Water-Quality Monitoring Network AWQMN_descriptive_statistics_WY1993-2017.xlsx -- Descriptive statistics for selected water-quality parameters at selected sites in the Missouri Ambient Water-Quality Monitoring Network The following selected graphics are included in this data release in .pdf format. Also included in this data release are web pages accessible for people with disabilities provided in compressed .zip format. The web pages present the same information as the .pdf files: Annual and seasonal discharge trends.pdf -- Graphics of discharge trends produced from the EGRET software for selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Annual_and_seasonal_discharge_trends_htm.zip -- Compressed web page presenting graphics of discharge trends produced from the EGRET software for selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics of simulated quarterly sampling frequency trends.pdf -- Graphics of results of simulated quarterly sampling frequency trends produced by the R-QWTREND software at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics_of_simulated_quarterly_sampling_frequency_trends_htm.zip -- Compressed web page presenting graphics of results of simulated quarterly sampling frequency trends produced by the R-QWTREND software at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics of median parameter values.pdf -- Graphics of median values for selected parameters at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Graphics_of_median_parameter_values_htm.zip -- Compressed web page presenting graphics of median values for selected parameters at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter value versus time.pdf -- Scatter plots of the value of selected parameters versus time at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter_value_versus_time_htm.zip -- Compressed web page presenting scatter plots of the value of selected parameters versus time at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter value versus discharge.pdf -- Scatter plots of the value of selected parameters versus discharge at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Parameter_value_versus_discharge_htm.zip -- Compressed web page presenting scatter plots of the value of selected parameters versus discharge at selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot of parameter value distribution by season.pdf -- Seasonal boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Seasons defined as Winter (December, January, and February), Spring (March, April, and May), Summer (June, July, and August), and Fall (September, October, and November). Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot_of_parameter_value_distribution_by_season_htm.zip -- Compressed web page presenting seasonal boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Seasons defined as Winter (December, January, and February), Spring (March, April, and May), Summer (June, July, and August), and Fall (September, October, and November). Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot of sampled discharge compared with mean daily discharge.pdf -- Boxplots of the distribution of discharge collected at the time of sampling of selected parameters compared with the period of record discharge distribution from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot_of_sampled_discharge_compared_with_mean_daily_discharge_htm.zip -- Compressed web page presenting boxplots of the distribution of discharge collected at the time of sampling of selected parameters compared with the period of record discharge distribution from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot of parameter value distribution by month.pdf -- Monthly boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report. Boxplot_of_parameter_value_distribution_by_month_htm.zip -- Compressed web page presenting monthly boxplots of selected parameters from selected sites in the Missouri Ambient Water-Quality Monitoring Network. Graphics provided to support the interpretations in the Scientific Investigations Report.
g
11BU RV Belgica CTD profiles data | gimi9.com
gimi9.com
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). 11BU RV Belgica CTD profiles data | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_bmdc-be-dataset-2940/
Explore at:
Dataset updated
Jun 24, 2025
Description
This datasets is a summary of the CTD profiles measured with the RV Belgica. It provides general meta-information such as the campaign code, the date of measurement and the geographical information. An important information is the profile quality flag that describes the validity of the data. A quality flag = 2 means the data is generally good although some outliers can still be present. A quality flag = 4 means the data should not be trusted. 1 meter binned data can be download on the SeaDataNet CDI portal (enter the cruise_id in the search bar) ONLY for the good quality profiles. Full acquisition frequency datasets are available on request to BMDC.
f
A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...
figshare.com
csv
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin (2025). A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023): 2.82Million Record Resource for Empirical and ML-Based Research [Dataset]. http://doi.org/10.6084/m9.figshare.27800394.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27800394.v2
Dataset updated
Feb 23, 2025
Dataset provided by
figshare
Authors
Md. Rajaul Karim; Mahbubul Syeed; Ashifur Rahman; Khondkar Ayaz Rabbani; Kaniz Fatema; Razib Hayat Khan; Md Shakhawat Hossain; Mohammad Faisal Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data DescriptionWater Quality Parameters: Ammonia, BOD, DO, Orthophosphate, pH, Temperature, Nitrogen, Nitrate.Countries/Regions: United States, Canada, Ireland, England, China.Years Covered: 1940-2023.Data Records: 2.82 million.Definition of ColumnsCountry: Name of the water-body region.Area: Name of the area in the region.Waterbody Type: Type of the water-body source.Date: Date of the sample collection (dd-mm-yyyy).Ammonia (mg/l): Ammonia concentration.Biochemical Oxygen Demand (BOD) (mg/l): Oxygen demand measurement.Dissolved Oxygen (DO) (mg/l): Concentration of dissolved oxygen.Orthophosphate (mg/l): Orthophosphate concentration.pH (pH units): pH level of water.Temperature (°C): Temperature in Celsius.Nitrogen (mg/l): Total nitrogen concentration.Nitrate (mg/l): Nitrate concentration.CCME_Values: Calculated water quality index values using the CCME WQI model.CCME_WQI: Water Quality Index classification based on CCME_Values.Data Directory Description:Category 1: DatasetCombined Data: This folder contains two CSV files: Combined_dataset.csv and Summary.xlsx. The Combined_dataset.csv file includes all eight water quality parameter readings across five countries, with additional data for initial preprocessing steps like missing value handling, outlier detection, and other operations. It also contains the CCME Water Quality Index calculation for empirical analysis and ML-based research. The Summary.xlsx provides a brief description of the datasets, including data distributions (e.g., maximum, minimum, mean, standard deviation).Combined_dataset.csvSummary.xlsxCountry-wise Data: This folder contains separate country-based datasets in CSV files. Each file includes the eight water quality parameters for regional analysis. The Summary_country.xlsx file presents country-wise dataset descriptions with data distributions (e.g., maximum, minimum, mean, standard deviation).England_dataset.csvCanada_dataset.csvUSA_dataset.csvIreland_dataset.csvChina_dataset.csvSummary_country.xlsxCategory 2: CodeData processing and harmonization code (e.g., Language Conversion, Date Conversion, Parameter Naming and Unit Conversion, Missing Value Handling, WQI Measurement and Classification).Data_Processing_Harmonnization.ipynbThe code used for Technical Validation (e.g., assessing the Data Distribution, Outlier Detection, Water Quality Trend Analysis, and Vrifying the Application of the Dataset for the ML Models).Technical_Validation.ipynbCategory 3: Data Collection SourcesThis category includes links to the selected dataset sources, which were used to create the dataset and are provided for further reconstruction or data formation. It contains links to various data collection sources.DataCollectionSources.xlsxOriginal Paper Title: A Comprehensive Dataset of Surface Water Quality Spanning 1940-2023 for Empirical and ML Adopted ResearchAbstractAssessment and monitoring of surface water quality are essential for food security, public health, and ecosystem protection. Although water quality monitoring is a known phenomenon, little effort has been made to offer a comprehensive and harmonized dataset for surface water at the global scale. This study presents a comprehensive surface water quality dataset that preserves spatio-temporal variability, integrity, consistency, and depth of the data to facilitate empirical and data-driven evaluation, prediction, and forecasting. The dataset is assembled from a range of sources, including regional and global water quality databases, water management organizations, and individual research projects from five prominent countries in the world, e.g., the USA, Canada, Ireland, England, and China. The resulting dataset consists of 2.82 million measurements of eight water quality parameters that span 1940 - 2023. This dataset can support meta-analysis of water quality models and can facilitate Machine Learning (ML) based data and model-driven investigation of the spatial and temporal drivers and patterns of surface water quality at a cross-regional to global scale.Note: Cite this repository and the original paper when using this dataset.
g
11BU RV Belgica CTD profiles data
publish.geo.be
ogc:wms
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Royal Belgian Institute for Natural Sciences (RBINS), Directorate Natural Environment (OD Nature), Belgian Marine Data Centre (BMDC) (2025). 11BU RV Belgica CTD profiles data [Dataset]. https://publish.geo.be/geonetwork/F0ow2Say/api/records/bmdc.be:dataset:2940
Explore at:
ogc:wmsAvailable download formats
Dataset updated
Jul 11, 2025
Dataset provided by
Royal Belgian Institute of Natural Sciences (RBINS)
Royal Belgian Institute for Natural Sciences (RBINS), Directorate Natural Environment (OD Nature), Belgian Marine Data Centre (BMDC)
Authors
Royal Belgian Institute for Natural Sciences (RBINS), Directorate Natural Environment (OD Nature), Belgian Marine Data Centre (BMDC)
License
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
http://inspire.ec.europa.eu/metadata-codelist/ConditionsApplyingToAccessAndUse/noConditionsApplyhttp://inspire.ec.europa.eu/metadata-codelist/ConditionsApplyingToAccessAndUse/noConditionsApply
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This datasets is a summary of the CTD profiles measured with the RV Belgica. It provides general meta-information such as the campaign code, the date of measurement and the geographical information. An important information is the profile quality flag that describes the validity of the data. A quality flag = 2 means the data is generally good although some outliers can still be present. A quality flag = 4 means the data should not be trusted. 1 meter binned data can be download on the SeaDataNet CDI portal (enter the cruise_id in the search bar) ONLY for the good quality profiles. Full acquisition frequency datasets are available on request to BMDC.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2

Anomaly Detection in High-Dimensional Data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12844508.v2

Dataset updated

May 30, 2023

Dataset provided by

Taylor & Francis

Authors

Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Anomaly Detection in High-Dimensional Data

Chemical outlier dataset

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Results and analysis using the Lean Six-Sigma define, measure, analyze,...

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Data from: Brain Rot Dataset

Data from USDA ARS Central Plains Experimental Range (CPER) near Nunn, CO:...

Gender, Age, and Emotion Detection from Voice

Context

Content

Acknowledgements

Supporting data for analysis of general water-quality conditions, long-term...

11BU RV Belgica CTD profiles data | gimi9.com

A Comprehensive Surface Water Quality Monitoring Dataset (1940-2023):...

11BU RV Belgica CTD profiles data

Anomaly Detection in High-Dimensional Data