20 datasets found

f
Data from: Methodology to filter out outliers in high spatial density data...
scielo.figshare.com
jpeg
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14305658.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
Data from: Outlier classification using autoencoders: application for...
osti.gov
Updated Jun 2, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1882649-outlier-classification-using-autoencoders-application-fluctuation-driven-flows-fusion-plasmas
Explore at:
Dataset updated
Jun 2, 2021
Dataset provided by
United States Department of Energyhttp://energy.gov/
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Authors
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B.
Description
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Heidelberg Tributary Loading Program (HTLP) Dataset
zenodo.org
bin, png
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCWQR; NCWQR (2024). Heidelberg Tributary Loading Program (HTLP) Dataset [Dataset]. http://doi.org/10.5281/zenodo.6606950
Explore at:
bin, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6606950
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
NCWQR; NCWQR
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is updated more frequently and can be visualized on NCWQR's data portal.

If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.

The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.

Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.

At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.

Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.

2017 Ohio EPA Project Study Plan and Quality Assurance Plan

Project Study Plan

Quality Assurance Plan

Data quality control and data screening

The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.

This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.

A note on detection limits and zero and negative concentrations

It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.

The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.

For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.

Analyte Detection Limits

https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024

For more information, please visit https://ncwqr.org/
Credit Approval (Mixed Attributes)
kaggle.com
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Credit Approval (Mixed Attributes) [Dataset]. https://www.kaggle.com/datasets/thedevastator/improving-credit-approval-with-mixed-attributes/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
Credit Approval (Mixed Attributes)

Continuous and Categorical Features

By UCI [source]

About this dataset

This dataset explores the phenomenon of credit card application acceptance or rejection. It includes a range of both continuous and categorical attributes, such as the applicant's gender, credit score, and income; as well as details about recent credit card activity including balance transfers and delinquency. This data presents a unique opportunity to investigate how these different attributes interact in determining application status. With careful analysis of this dataset, we can gain valuable insights into understanding what factors help ensure a successful application outcome. This could lead us to developing more effective strategies for predicting and improving financial credit access for everyone

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is an excellent resource for researching the credit approval process, as it provides a variety of attributes from both continuous and categorical sources. The aim of this guide is to provide tips and advice on how to make the most out of this dataset. - Understand the data: Before attempting to work with this dataset, it's important to understand what kind of information it contains. Since there is a mix of continuous and categorical attributes in this data set, make sure you familiarise yourself with all the different columns before proceeding further. - Exploratory Analysis: It is recommended that you conduct some exploratory analysis on your data in order to gain an overall understanding of its characteristics and distributions. By investigating things like missing values and correlations between different independent variables (IVs) or dependent variables (DVs), you can better prepare yourself for making meaningful analyses or predictions in further steps. - Data Cleaning: Once you have familiarised yourself with your data, begin cleaning up any potential discrepancies such as missing values or outliers by replacing them appropriately or removing them from your dataset if necessary - Feature Selection/Engineering: After cleansing your data set, feature selection/engineering may be necessary if certain columns are redundant or not proving useful for constructing meaningful models/analyses over your data set (usually observed after exploratory analysis). You should be very mindful when deciding which features should be removed so that no information about potentially important relationships are lost!
- Model Building/Analysis: Now that our data has been pre-processed appropriately we can move forward with developing our desired models / analyses over our newly transformed datasets!

Research Ideas

Developing predictive models to identify customers who are likely to default on their credit card payments.

Creating a risk analysis system that can identify customers who pose a higher risk for fraud or misuse of their credit cards.

Developing an automated lending decision system that can use the data points provided in the dataset (i.e., gender, average monthly balance, etc.) to decide whether or not to approve applications for new credit lines and loans

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: crx.data.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------| | b | Gender (Categorical) | | 30.83 | Average Monthly Balance (Continuous) | | 0 | Number of Months Since Applicant's Last Delinquency (Continuous) | | w | Number of Months Since Applicant's Last Credit Card Approval (Continuous) | | 1.25 | Number Of Months since The applicant's last balance increase (Continuous) ...
COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
+1more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
World Bankhttp://topics.nytimes.com/top/reference/timestopics/organizations/w/world_bank/index.html
Time period covered
2020
Area covered
Vietnam
Description
Geographic coverage

National, regional

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire for Round 2 consisted of the following sections

Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

Cleaning operations

Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.
f
Anomaly Detection in High-Dimensional Data
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles (2023). Anomaly Detection in High-Dimensional Data [Dataset]. http://doi.org/10.6084/m9.figshare.12844508.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844508.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Priyanga Dilini Talagala; Rob J. Hyndman; Kate Smith-Miles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbor distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbors with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray. Supplementary materials for this article are available online.
z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
bin
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646897
Dataset updated
Jul 12, 2024
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
e
Thermistor chain temperature data from the EMSO-Azores observatory,...
b2find.eudat.eu
Updated Oct 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Thermistor chain temperature data from the EMSO-Azores observatory, 2020-2021 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3628b9ef-4eee-5e4c-982d-f0f32b9ae00c
Explore at:
Dataset updated
Oct 22, 2023
Description
This dataset contains temperature data acquired with a 70 m long thermistor chain (50 m equipped with sensors, 20 m of cable to the monitoring station), deployed on the EMSO-Azores observatory from September 2020 to May 2021. Each thermometer is separated by 50 cm for a total of 100 temperature points. Data consist in 70 963 measurements acquired every 15 minutes. The chain is connected to the SEAMON East environmental monitoring node and measures temperatures on different faunal assemblages and substratum on the active Tour Eiffel edifice at 1695 m depth. The raw file contains all data acquired by the sensors. The corrected file was processed to remove outliers that included values below background seawater temperature (4°C), values above 30°C that corresponded to simultaneous periods of high/homogeneous temperature values recorded on all sensors. Unique outliers were then removed after screening of the graph. If extreme values lasted at least three consecutive step, they were kept. Removed values were replaced by NAs. Location of the Eiffel Tower edifice : N 37°17.33 - W 32° 16.53
v
N01 SUNA - Nitrate Observations Corrected
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
data.ioos.us
+3more
Updated Jul 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Maine School of Marine Sciences (Point of Contact) (2025). N01 SUNA - Nitrate Observations Corrected [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/n01-suna-nitrate-observations-corrected1
Explore at:
Dataset updated
Jul 29, 2025
Dataset provided by
University of Maine School of Marine Sciences (Point of Contact)
Description
One SUNA V2 and two ISUS (Sea-Bird Scientific) were deployed on NERACOOS Buoy N (Latitude: 42 deg 19.48'N and Longitude: 65 deg 54.55'W) at 50m, 100m and 180m, respectively, from July 26, 2016 to July 15, 2017. This corresponds to University of Maine buoy deployment N0120 (http://gyre.umeoce.maine.edu/data/gomoos/buoy/html/N01.html). The sensors were programmed to collect nitrate measurements three times per day (0045, 0845 and 1645 GMT) over a fixed sampling window of 30 seconds at approximately 1.4 readings per second. Five measurements recorded in the middle of the 30 second data stream were transmitted back to the University of Maine Physical Oceanography Group via cell phone. The mean concentration and standard deviation were calculated for each specific sampling time. If the standard deviation for a mean was greater than 2uM, the values were eliminated. Surviving data were then quality controlled (QCed) by removing outliers assessed as values less than -0.5uM and greater than 30uM. At the time of buoy deployment, a standard CTD cast was conducted with a carrousel water sampler equipped with Niskin bottles. Ground truth water samples were collected at the same depth as the nitrate sensors. These samples were analyzed for nitrate concentrations at the University of Maine using a Bran Luebbe AA3 Autoanalyzer and standard techniques. Ground truth corrections were added or subtracted as an offset only if the difference between the nitrate concentration of the water sample collected and the mean nitrate sensor reading at approximately the same time was greater than 2 uM. Upon recovery of the instruments, the data was processed using the software provided by Sea-Bird (UCI and ISUS.com) and compared with the real-time measurements as a check for accuracy. In the event that the sensor was not able to transmit in real-time, the processed data was QCed. Please contact David Townsend, University of Maine, School of Marine Sciences if you have any questions.
d
High frequency dataset for event-scale concentration-discharge analysis in a...
search.dataone.org
hydroshare.org
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Musolff (2024). High frequency dataset for event-scale concentration-discharge analysis in a forested headwater 01/2018-08/2023 [Dataset]. http://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
Explore at:
Unique identifier
https://doi.org/10.4211/hs.9be43573ba754ec1b3650ce233fc99de
Dataset updated
Sep 21, 2024
Dataset provided by
Hydroshare
Authors
Andreas Musolff
Time period covered
Jan 1, 2018 - Aug 23, 2023
Area covered

Description
This composite repository contains high-frequency data of discharge, electrical conductivity, nitrate-N, DOC and water temperature obtained the Rappbode headwater catchment in the Harz mountains, Germany. This catchment was affected by a bark-beetle infestion and forest dieback from 2018 onwards.The data extents previous observations from the same catchment (RB) published as part of Musolff (2020). Details on the catchment can be found here: Werner et al. (2019, 2021), Musolff et al. (2021). The file RB_HF_data_2018_2023.txt states measurements for each timestep using the following columns: "index" (number of observation),"Date.Time" (timestamp in YYYY-MM-DD HH:MM:SS), "WT" (water temperature in degree celsius), "Q.smooth" ( discharge in mm/d smoothed using moving average), "NO3.smooth" (nitrate concentrations in mg N/L smoothed using moving average), "DOC.smooth" (Dissolved organic carbon concentrations in mg/L, smoothed using moving average), "EC.smooth" (electrical conductivity in µS/cm smoothed using moving average); NA - no data.

Water quality data and discharge was measured at a high-frequency interval of 15 min in the time period between January 2018 and August 2023. Both, NO3-N and DOC were measured using an in-situ UV-VIS probe (s::can spectrolyser, scan Austria). EC was measured using an in-situ probe (CTD Diver, Van Essen Canada). Discharge measurements relied on an established stage-discharge relationship based on water level observations (CTD Diver, Van Essen Canada, see Werner et al. [2019]). Data loggers were maintained every two weeks, including manual cleaning of the UV-VIS probes and grab sampling for subsequent lab analysis, calibration and validation.

Data preparation included five steps: drift corrections, outlier detection, gap filling, calibration and moving averaging: - Drift was corrected by distributing the offset between mean values one hour before and after cleaning equally among the two weeks maintenance interval as an exponential growth. - Outliers were detected with a two-step procedure. First, values outside a physically unlikely range were removed. Second, the Grubbs test, to detect and remove outliers, was applied to a moving window of 100 values. - Data gaps smaller than two hours were filled using cubic spline interpolation. - The resulting time series were globally calibrated against the lab measured concentration of NO3-N and DOC. EC was calibrated against field values obtained with a handheld WTW probe (WTW Multi 430, Xylem Analytics Germany). - Noise in the signal of both discharge and water quality was reduced by a moving average with a window lenght of 2.5 hours.

References: Musolff, A. (2020). High frequency dataset for event-scale concentration-discharge analysis. https://doi.org/http://www.hydroshare.org/resource/27c93a3f4ee2467691a1671442e047b8 Musolff, A., Zhan, Q., Dupas, R., Minaudo, C., Fleckenstein, J. H., Rode, M., Dehaspe, J., & Rinke, K. (2021). Spatial and Temporal Variability in Concentration-Discharge Relationships at the Event Scale. Water Resources Research, 57(10). Werner, B. J., A. Musolff, O. J. Lechtenfeld, G. H. de Rooij, M. R. Oosterwoud, and J. H. Fleckenstein (2019), High-frequency measurements explain quantity and quality of dissolved organic carbon mobilization in a headwater catchment, Biogeosciences, 16(22), 4497-4516. Werner, B. J., Lechtenfeld, O. J., Musolff, A., de Rooij, G. H., Yang, J., Grundling, R., Werban, U., & Fleckenstein, J. H. (2021). Small-scale topography explains patterns and dynamics of dissolved organic carbon exports from the riparian zone of a temperate, forested catchment. Hydrology and Earth System Sciences, 25(12), 6067-6086.
I
M01 SUNA - Nitrate Observations Corrected
data.ioos.us
s.cnmilf.com
+3more
erddap +2
Updated Sep 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NERACOOS (2025). M01 SUNA - Nitrate Observations Corrected [Dataset]. https://data.ioos.us/dataset/m01-suna-nitrate-observations-corrected
Explore at:
opendap, erddap, erddap-tabledapAvailable download formats
Dataset updated
Sep 5, 2025
Dataset authored and provided by
NERACOOS
Description
Various nitrate sensors including the SUNA V1, SUNA V2 and ISUS (Sea-Bird Scientific) were deployed on NERACOOS Buoy M (Latitude: 43Â° 29.44'N and Longitude: 67Â° 52.79'W) from June 23, 2013 to March 23, 2021. This time period corresponds to University of Maine buoy deployments M0124, M0125, M0126, M0127, M0129, and M130 (http://gyre.umeoce.maine.edu/data/gomoos/buoy/html/M01.html). These sensors were placed at the surface, 1m, 50m, 100m, 150m and 250m and programmed to collect nitrate measurements six times per day (0045, 0445, 0845, 1245, 1645 and 2045 GMT) over a fixed sampling window of 30 seconds at approximately 1.4 readings per second. The number of sensors and their depths varied for each deployment depending on the number of available sensors. Five measurements recorded in the middle of the 30 second data stream were transmitted back to the University of Maine Physical Oceanography Group via cell phone. The mean concentration and standard deviation were calculated for each specific sampling time. If the standard deviation for a mean was greater than 2ÂµM, the values were eliminated. Surviving data were then quality controlled (QCed) by removing outliers assessed as values less than -0.5ÂµM and greater than 30ÂµM. Whenever possible at the time of buoy deployment and recovery, a standard CTD cast was conducted with a carousel water sampler equipped with Niskin bottles. Ground truth water samples were collected at the same depth as the nitrate sensors. These samples were analyzed for nitrate concentrations at the University of Maine using a Bran Luebbe AA3 Autoanalyzer and standard techniques. Ground truth corrections were added or subtracted as an offset only if the difference between the nitrate concentration of the water sample collected and the mean nitrate sensor reading at approximately the same time was greater than 2 ÂµM. Upon recovery of the instruments, the data was processed using the software provided by Sea-Bird (UCI.com) and compared with the real-time measurements as a check for accuracy. In the event that the sensor was not able to transmit in real-time, the processed data was QCed. Please contact David Townsend, davidt@maine.edu, University of Maine, School of Marine Sciences if you have any questions.
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
🌽 Crop Price Prediction in Senegal
kaggle.com
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2024). 🌽 Crop Price Prediction in Senegal [Dataset]. https://www.kaggle.com/datasets/mexwell/crop-price-prediction-in-senegal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2024
Dataset provided by
Kaggle
Authors
mexwell
Area covered
Senegal
Description
Dataset Overview:

This dataset contains information about crop prices in Senegal.

Dataset Details:

Here is a brief description of each header in the dataset:

date: This header represents the date on which the price of the crop was recorded.

cmname: This header represents the name of the crop whose price was recorded.

unit: This header represents the unit in which the crop was sold (e.g. kg, liter).

category: This header represents the category of the crop (e.g. cereals and tubers, vegetables, fruits).

price: This header represents the price of the crop in the local currency.

currency: This header represents the currency in which the price of the crop was recorded.

country: This header represents the country where the crop was sold.

admname: This header represents the name of the administrative region where the crop was sold.

adm1id: This header represents the administrative region's unique identifier.

mktname: This header represents the name of the market where the crop was sold.

mktid: This header represents the market's unique identifier. -cmid: This header represents the crop's unique identifier.

ptid: This header represents the price type's unique identifier.

umid: This header represents the unit of measure's unique identifier.

catid: This header represents the crop category's unique identifier.

sn: This header represents a unique identifier for the record.

default: This header represents the default price for the crop.

Data Cleaning:

Before using this dataset for analysis, you might need to clean the data by removing any duplicates, missing values, or outliers. You can also convert the date column to a datetime format for ease of analysis.

Original Data

Acknowlegement

Foto von Dietmar Reichle auf Unsplash
Thunderstorm outflows in the Mediterranean Sea area
zenodo.org
txt, zip
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federico Canepa; Federico Canepa; Massimiliano Burlando; Massimiliano Burlando; Maria Pia Repetto; Maria Pia Repetto (2024). Thunderstorm outflows in the Mediterranean Sea area [Dataset]. http://doi.org/10.5281/zenodo.10688746
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10688746
Dataset updated
Apr 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federico Canepa; Federico Canepa; Massimiliano Burlando; Massimiliano Burlando; Maria Pia Repetto; Maria Pia Repetto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the context of the European projects “Wind and Ports” (grant No. B87E09000000007) and “Wind, Ports and Sea” (grant No. B82F13000100005), an extensive in-situ wind monitoring network was installed in the main ports of the Northern Mediterranean Sea. An unprecedent number of wind records has been acquired and systematically analyzed. Among these, a considerable number of records presented non-stationary and non-Gaussian characteristics that are completely different from those of synoptic extra-tropical cyclones, widely known in the atmospheric science and wind engineering communities. The cross-checking with meteorological information allows to identify which of these events can be defined as thunderstorm winds, i.e., downbursts and gust fronts.

The scientific literature of the last few decades has demonstrated that downbursts, and especially micro-bursts, are extremely dangerous for the natural and built environment. Furthermore, recent trends in climate change seem to preview drastic future scenarios in terms of intensification and frequency increase of this type of extreme events. However, the limited space and time structure of thunderstorm outflows makes them still difficult to be measured in nature and, consequently, to build physically reliable and easily applicable models as in the case of extra-tropical cyclones. For these reasons, the collection and publication of events of this type represents a unique opportunity for the scientific community.

The dataset here presented was built in the context of the activities of the project THUNDERR “Detection, simulation, modelling and loading of thunderstorm outflows to design wind-safer and cost-efficient structures”, financed by the European Research Council (ERC), Advanced Grant 2016 (grant No. 741273, P.I. Prof. Giovanni Solari, University of Genoa). It collects 29 thunderstorm downbursts that occurred between 2010 and 2015 in the Italian ports of Genoa (GE) (4), Livorno (LI) (14), and La Spezia (SP) (11), and were recorded by means of ultrasonic anemometers (Gill WindObserver II in Genoa and La Spezia, Gill WindMaster Pro in Livorno). All thunderstorm events included in the database were verified by means of meteorological information, such as radar (CIMA Research Foundation is gratefully acknowledge for providing with most of the radar images), satellite, and lightning data. In fact, (i) high and localized clouds typical of thunderstorm cumulonimbus, (ii) precipitations, and (iii) lightnings represent reliable indicators of the occurrence of the thunderstorm event.

Some events were recorded by multiple anemometers in the same port area – the total number of signals included in the database is 99. Despite the limited number of points (anemometers), this will allow the user to perform cross-correlation analysis in time and space to eventually retrieve size, position, trajectory of the storm, etc.

The ASCII tab-delimited file ‘Anemometers_location.txt’ reports specifications of the anemometers used in this monitoring study: port code (Port code – Genoa-GE, Livorno-LI, La Spezia-SP); anemometer code (Anemometer code); latitude (Lat.) and longitude (Lon.) in decimal degree WGS84; height above the ground level (h a.g.l.) in meters; Instrument type. Bi-axial anemometers were used from the ports of Genoa and La Spezia, recording the two horizontal wind speed components (u, v). Three-axial ultrasonic anemometers were used in the port of Livorno, also providing the vertical wind speed component w (except bi-axial anemometers LI06 and LI07). All anemometers acquired velocity data at sampling frequency 10 Hz, sensitivity 0.01 m s^-1 (except anemometers LI06 and LI07 with sensitivity 0.1 m s^-1) and were installed at various heights ranging from 13.0 to 75.0 m, as reported in the file ‘Anemometers_location.txt’.

The ASCII tab-delimited file ‘List_DBevents.txt’ lists all downburst records included in the database, in terms of: event and record number (Event | record no.); port code (Port code); date of event occurrence (Date) in the format yyyy-mm-dd; approximate time of occurrence of the velocity peak (Time [UTC]) in the format HH:MM; anemometer code (Anemometer code).

The database is presented as a zip file (‘DB-records.zip’). The events are divided based on the port of occurrence (three folders GE, LI, and SP). Within each folder, the downburst events that were recorded in that specific port are reported as subfolders (name format ‘[port code]_yyyy-mm-dd’) and contain the single anemometers signals as TAB-delimited text files (name format ‘[port and anemometer code]_yyyy-mm-dd.txt’). Each sub-dataset (file) contains 3(4) columns and 360.000 rows. The first column shows the 10-h time vector (t, ISO format) in UTC, while the remaining 2(3) columns report the 10-h time series of 10-Hz instantaneous horizontal (zonal west-to-east u, meridional south-to-north v) and, where available, vertical (positive upward w) wind speed components, centred around the time of maximum horizontal wind speed (vectorial sum of u and v). The choice of representation of the wind speed in a large time interval (10 hours) allows the user to perform a more comprehensive and detailed analysis of the event by taking into account also the wind conditions before and after the onset of the downburst phenomenon. 'Not-a-Number' (‘NaN’) values are reported in wind velocity signals when the instrument did not record valid data. Some wind speed records show noise in discrete intervals of the signal, which reflects in an increase of the wind speed standard deviation. A modified Hampel filter was employed to remove measurement outliers. For each wind speed signal, every data sample was considered in ascending order, along with its adjacent ten samples (five on each side). This technique calculated the median and standard deviation within the sampling window using the median absolute deviation. Elements deviating from the median by more than six standard deviations were identified and replaced with 'NaN'. The tuning of the filter parameters involved finding a balance between overly agressive and insufficient removal of outliers. Residual outliers were subsequently manually removed through meticulous qualitative inspection. The complexity and subjectivity of this operation provide users with the opportunity to explore alternative approaches. Consequently, the published dataset includes two versions: an initial version (v1) comprising the original raw data with no filtering applied, and a second "cleaned" version (v2).

The presented database can be further used by researchers to validate and calibrate experimental and numerical simulations, as well as analytical models, of downburst winds. It will also be an important resource for the scientific community working in the wind engineering field, in meteorology and atmospheric sciences, as well as in the risk management and reductions of losses related to thunderstorm events (i.e., insurance companies).
Superstore Sales Analysis
kaggle.com
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
e
[X/Fe] scatter derived for spectral lines - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). [X/Fe] scatter derived for spectral lines - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/f306f5e5-7919-5e54-9deb-76b71d72b54e
Explore at:
Dataset updated
Oct 23, 2023
Description
The main goal of this work is to explore which elements carry the most information about the birth origin of stars and, as such, which are best suited for chemical tagging. We explored different techniques to minimize the effect of outlier value lines in the abundances by using Ni abundances derived for 1111 FGK-type stars. We evaluate how the limited number of spectral lines can affect the final chemical abundance. Then we make an efficient even footing comparison of the [X/Fe] scatter between the elements that have a different number of observable spectral lines in the studied spectra. When several spectral lines are available, we find that the most efficient way of calculating the average abundance of elements is to use a weighted mean (WM), whereby we consider the distance from the median abundance as a weight. This method can be used effectively without removing suspected outlier lines. When the same number of lines are used to determine chemical abundances, we show that the [X/Fe] star-to-star scatter for iron group and alpha-capture elements is almost the same. The largest scatter among the studied elements, was observed for Al and the smallest for Cr and Ni. We recommend caution when comparing [X/Fe] scatters among elements where a different number of spectral lines are available. A meaningful comparison is necessary to identify elements that show the largest intrinsic scatter, which can then be used limit of [X/H] are very abrupt.
f
DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With...
frontiersin.figshare.com
pdf
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuangbin Xu; Meijun Chen; Tingze Feng; Li Zhan; Lang Zhou; Guangchuang Yu (2023). DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With Large Datasets and Outliers.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.774846.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.774846.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Shuangbin Xu; Meijun Chen; Tingze Feng; Li Zhan; Lang Zhou; Guangchuang Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the rapid increase of large-scale datasets, biomedical data visualization is facing challenges. The data may be large, have different orders of magnitude, contain extreme values, and the data distribution is not clear. Here we present an R package ggbreak that allows users to create broken axes using ggplot2 syntax. It can effectively use the plotting area to deal with large datasets (especially for long sequential data), data with different magnitudes, and contain outliers. The ggbreak package increases the available visual space for a better presentation of the data and detailed annotation, thus improves our ability to interpret the data. The ggbreak package is fully compatible with ggplot2 and it is easy to superpose additional layers and applies scale and theme to adjust the plot using the ggplot2 syntax. The ggbreak package is open-source software released under the Artistic-2.0 license, and it is freely available on CRAN (https://CRAN.R-project.org/package=ggbreak) and Github (https://github.com/YuLab-SMU/ggbreak).
Data from: Interplay of physical and social drivers of movement in male...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maggie Wisniewska; Caitlin E. O'Connell-Rodwell; J. Werner Kilian; Simon Garnier; Gareth J. Russell (2024). Interplay of physical and social drivers of movement in male African savanna elephants [Dataset]. http://doi.org/10.5061/dryad.4qrfj6qm3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4qrfj6qm3
Dataset updated
Nov 26, 2024
Dataset provided by
New Jersey Institute of Technology
Harvard University
Etosha National Park
Authors
Maggie Wisniewska; Caitlin E. O'Connell-Rodwell; J. Werner Kilian; Simon Garnier; Gareth J. Russell
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Africa
Description
Despite extensive research into the behavioral ecology of free-ranging animal groups, questions remain about how group members integrate information about their physical and social surroundings. This is because a) tracking of multiple group members is limited to a few easily manageable species; and b) the tools to simultaneously quantify physical and social influences on an individual’s movement remain challenging, especially across large geographic scales. A relevant example of a widely-ranging species with a complex social structure and of conservation concern is the African savanna elephant. We evaluate highly synchronized GPS tracks from five male elephants in Etosha National Park in Namibia by incorporating their dynamic social landscape into an established resource selection model. The fitted model predicts movement patterns based simultaneously on the physical landscape (e.g., repeated visitation of waterholes) and the social landscape (e.g., avoidance of a dominant male). Combining the fitted models for multiple focal individuals produces landscape-dependent social networks that vary over space (e.g., with distance from a waterhole) and time (e.g., as the seasons change). The networks, especially around waterholes, are consistent with dominance patterns determined from previous behavioral studies. Models that combine physical landscape and social effects based on remote tracking can augment traditional methods for determining social structure from intensive behavioral observations. More broadly, these models will be essential to effective, in-situ conservation and management of wide-ranging social species in the face of anthropogenic disruptions to their physical surroundings and social connections. Methods Study subjects and the social landscape: The five individuals considered in this study belong to a large elephant subpopulation residing in the northeastern region of Etosha National Park, Namibia. As a part of a different research effort, these individuals were classified into several age, dominance, social, and reproductive categories (O’Connell-Rodwell et al. 2011; O’Connell et al. 2024a). The age structure in this population was determined on the basis of several morphological features and can be found in the original publication (O’Connell-Rodwell et al. 2011; O’Connell et al. 2022). The dominance categories are reported from a population-level, ordinal dominance hierarchy based on the frequency of agonistic dyadic interactions (i.e., displacement) observed during all-occurrence sampling, over multiple field seasons (O’Connell-Rodwell et al. 2024a). The social categories were approximated using social network analysis (i.e., eigenvector centrality—an index expressing how influential an individual is based on the frequency of associating with other influential conspecifics) (O’Connell-Rodwell et al. 2024a; O’Connell-Rodwell et al. 2024b;). The reproductive category expresses whether an elephant was in musth at the time of behavioral data collection. Tracking data: In September 2009, ENP personnel fitted five elephants with Global Positioning System (GSP) and satellite Global System for Mobile Communication (GSM) tracking devices. The trackers recorded positional data (i.e., longitude, latitude) every 15 minutes over approximately 24 months. Prior to analysis, we converted tracking data to Cartesian units (i.e., meters) using the Universal Transverse Mercator coordinate system (UTM) projection. We also filtered the data to remove outlier movements as follows: we kept only movements (pairs of GPS fixes) in which 1) the interval was 15 minutes, 2) the focal individual moved ≤ 300 m in that time, and 3) all four of the other tracked elephants were within 20 km. Criterion 1 eliminates missed fixes; criterion 2 eliminates a small number of unusually fast movements which could represent startle responses to rare stimuli; and criterion 3 ensures that there is at least the potential for social interactions between all five elephants. The resulting datasets (one for each focal individual) had between 27,397 and 30,584 movements. The physical landscape: To evaluate tracking data in the context of the physical landscape, we constructed a map of vegetation productivity using data from the 16-day 250 m Normalized Difference Vegetation Index (NDVI) MODIS imagery. We also created a map of the perennial waterholes by extracting coordinate information from existing geospatial records generated by ENP personnel. Finally, we compiled a map of ‘frequently visited areas’ (FVAs) as the centroids of the top 20 clusters of large turning angles (>90 degrees) in the movement data. These locations broadly correlate with the presence of shade and proximity to fruiting trees (Kilian, W. personal communication), which in other populations affect elephant movement. The Social Resource Selection Function (SRSF) model: Our approach extends the existing Resource Selection Function framework in which an individual’s location, when fixed (by a GPS device or other tracking methodology) is considered a choice made from a set of possible locations. This set of locations is bounded in space by how far from its previously known location the individual could reasonably be expected to move in the time between the two fixes. The relative probability of ending up at different destinations, relative to one’s current location, is modeled using conditional logistic regression (CLR) as a function of various environmental parameters that differ between locations (e.g., ‘vegetation density’, ‘distance to water’, distance to the previous location). The SRSF model adds to the RSF framework by considering the locations of other individuals in a moving group as time-varying point features of the landscape. One individual (the focal individual) is modeled, and the locations of the others (nonfocal individuals) are incorporated as ‘distance to neighbor’ values that can be calculated for all the possible locations in the CLR. Assuming that each elephant responds differently to different conspecifics, we calculate a set of social predictors by determining the distance to each neighbor separately. For any given movement m, the ‘choice’ is a binary response, where a potential location i is either the endpoint at which the individual was recorded (yi = 1) or one of the alternatives (yi = 0). For convenience, we have labeled the chosen location with the subscript j (j ∈ i). The probability of a movement is modeled by where X is a matrix of k predictors derived from the landscape data; β is a k by 1 matrix of parameters to be estimated; c is the total number of locations considered (1 being the actual endpoint and c – 1 being randomly sampled within a circle of fixed radius); and s is the probability of a stochastic, ‘non-choice-type movement’ for which the endpoint is independent to any of the included predictors. One example might be a sudden scare that causes a flight response. In this case, we assign all possible endpoints the same probability 1/c. Including the possibility of non-choice movements is a novel addition to the standard CLR model; we found that for these data it stabilized the parameter estimates (meaning that we obtained similar results with different random subsets of the data when it was included, and disparate estimates when it was not). Overall, pm is the predicted ‘preference value’ for the chosen location divided by (and therefore conditional on) the sum of the preference values for the random sample of possible locations. In practice, depending on the resolution of the landscape and the boundary of possible distances reached, the denominator could include hundreds or even thousands of random locations. This can make computation of the expression, which is repeated for every movement in a dataset, time-consuming—a challenge that then translates into the model fitting. It is thus standard practice to randomly select a fixed number of non-chosen alternative locations on the assumption that they will comprise a representative sample of the landscape variation available to the individual. Given that our landscape features — various distance measures plus an interpolated array of NDVI values — vary smoothly and continuously within our sampling radius, we used 30 random locations (so c = 31). We fit the CLR by maximizing L, the log-likelihood of the entire set of n movements, using quasi-Newton nonlinear maximization. We performed variable selection by first fitting models with all possible subsets of physical and social landscape variables in their quadratic forms, except for distance to the previous location, which was always included as a linear function as an established proxy for the effort required to move to a new location. We ranked the models using Akaike’s Information Criterion (AIC) and calculated importance scores for each variable as the cumulative Akaike weight of the models in which it appeared. Interpretation of the SRSF model outputs depends on the functional form of each variable over the range of its values and its importance score. Because a linear cost-of-movement function is in every model by design, we exclude it from further reporting and discussion. The functional forms of the remaining variables can be divided into five categories: 1) monotonically increasing or 2) decreasing (indicating a preference for large or smaller values of the variable in question); 3) convex with the maximum within the data range (a preference for intermediate values); 4) concave with the minimum within the data range (a preference for large and small values indicating a back-and-forth movement between the locations containing the variable in question); or 5) constant over the data range (lack of preference for a specific value) (Mashintonio et al. 2014). The SRSF model outputs are expressed as the relative preference for movement towards locations defined by the
H
Historical Operation Data of 256 Reservoirs in Contiguous United States
hydroshare.org
dataone.org
zip
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanan Chen; Ximing Cai; Donghui Li (2025). Historical Operation Data of 256 Reservoirs in Contiguous United States [Dataset]. https://www.hydroshare.org/resource/092720588e2e4524bf2674235ff69d81
Explore at:
zip(218.0 MB)Available download formats
Dataset updated
Jan 29, 2025
Dataset provided by
HydroShare
Authors
Yanan Chen; Ximing Cai; Donghui Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Reservoir operations face persistent challenges due to increasing water demand, more frequent extreme events, and stricter environmental requirements. Historical operation records are crucial to investigating real-world reservoir operations, which integrate prescribed operation rules, empirical knowledge of operators, and regulatory response to extreme events. This dataset offers processed daily operation records—including inflow, outflow, and storage—for 256 major reservoirs across the Contiguous United States (CONUS) from 1990 to 2019. The reservoirs were selected from the dataset of Li et al. (2023), which includes 452 reservoirs, based on two criteria: (1) a minimum of 25 years of records (starting in 1990 and ending in 2014 or later), (2) less than 10% missing data during the study period. To enhance data quality, we remove the outliers of storage data with abnormal sudden storage changes even when inflows remain stable, and use linear interpolation to fill missing values, resulting in continuous daily records. Additionally, daily water surface elevation data are included for 217 of the 256 reservoirs. Related findings on changes in reservoir storage and operations are published in Chen and Cai (2025, Water Resources Research).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken (2023). Methodology to filter out outliers in high spatial density data to improve maps reliability [Dataset]. http://doi.org/10.6084/m9.figshare.14305658.v1

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability

Explore at:

jpegAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.14305658.v1

Dataset updated

Jun 4, 2023

Dataset provided by

SciELO journals

Authors

Leonardo Felipe Maldaner; José Paulo Molin; Mark Spekken

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.

Clear search

Close search

Google apps

Main menu

Data from: Methodology to filter out outliers in high spatial density data...

Data from: Outlier classification using autoencoders: application for...

Cdd Dataset

Heidelberg Tributary Loading Program (HTLP) Dataset

Credit Approval (Mixed Attributes)

Credit Approval (Mixed Attributes)

Continuous and Categorical Features

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Anomaly Detection in High-Dimensional Data

Controlled Anomalies Time Series (CATS) Dataset

Thermistor chain temperature data from the EMSO-Azores observatory,...

N01 SUNA - Nitrate Observations Corrected

High frequency dataset for event-scale concentration-discharge analysis in a...

M01 SUNA - Nitrate Observations Corrected

Malaria disease and grading system dataset from public hospitals reflecting...

🌽 Crop Price Prediction in Senegal

Dataset Overview:

Dataset Details:

Data Cleaning:

Acknowlegement

Thunderstorm outflows in the Mediterranean Sea area

Superstore Sales Analysis

[X/Fe] scatter derived for spectral lines - Dataset - B2FIND

DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With...

Data from: Interplay of physical and social drivers of movement in male...

Historical Operation Data of 256 Reservoirs in Contiguous United States

Data from: Methodology to filter out outliers in high spatial density data to improve maps reliability