9 datasets found

f
A Comparison of Four Methods for the Analysis of N-of-1 Trials
figshare.com
doc
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinlin Chen; Pingyan Chen (2023). A Comparison of Four Methods for the Analysis of N-of-1 Trials [Dataset]. http://doi.org/10.1371/journal.pone.0087752
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0087752
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Xinlin Chen; Pingyan Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTo provide a practical guidance for the analysis of N-of-1 trials by comparing four commonly used models.MethodsThe four models, paired t-test, mixed effects model of difference, mixed effects model and meta-analysis of summary data were compared using a simulation study. The assumed 3-cycles and 4-cycles N-of-1 trials were set with sample sizes of 1, 3, 5, 10, 20 and 30 respectively under normally distributed assumption. The data were generated based on variance-covariance matrix under the assumption of (i) compound symmetry structure or first-order autoregressive structure, and (ii) no carryover effect or 20% carryover effect. Type I error, power, bias (mean error), and mean square error (MSE) of effect differences between two groups were used to evaluate the performance of the four models.ResultsThe results from the 3-cycles and 4-cycles N-of-1 trials were comparable with respect to type I error, power, bias and MSE. Paired t-test yielded type I error near to the nominal level, higher power, comparable bias and small MSE, whether there was carryover effect or not. Compared with paired t-test, mixed effects model produced similar size of type I error, smaller bias, but lower power and bigger MSE. Mixed effects model of difference and meta-analysis of summary data yielded type I error far from the nominal level, low power, and large bias and MSE irrespective of the presence or absence of carryover effect.ConclusionWe recommended paired t-test to be used for normally distributed data of N-of-1 trials because of its optimal statistical performance. In the presence of carryover effects, mixed effects model could be used as an alternative.
m
Questionnaire data on land use change of Industrial Heritage: Insights from...
data.mendeley.com
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arsalan Karimi (2023). Questionnaire data on land use change of Industrial Heritage: Insights from Decision-Makers in Shiraz, Iran [Dataset]. http://doi.org/10.17632/gk3z8gp7cp.2
Explore at:
Unique identifier
https://doi.org/10.17632/gk3z8gp7cp.2
Dataset updated
Jul 20, 2023
Authors
Arsalan Karimi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Shiraz, Iran
Description
The survey dataset for identifying Shiraz old silo’s new use which includes four components: 1. The survey instrument used to collect the data “SurveyInstrument_table.pdf”. The survey instrument contains 18 main closed-ended questions in a table format. Two of these, concern information on Silo’s decision-makers and proposed new use followed up after a short introduction of the questionnaire, and others 16 (each can identify 3 variables) are related to the level of appropriate opinions for ideal intervention in Façade, Openings, Materials and Floor heights of the building in four values: Feasibility, Reversibility, Compatibility and Social Benefits. 2. The raw survey data “SurveyData.rar”. This file contains an Excel.xlsx and a SPSS.sav file. The survey data file contains 50 variables (12 for each of the four values separated by colour) and data from each of the 632 respondents. Answering each question in the survey was mandatory, therefor there are no blanks or non-responses in the dataset. In the .sav file, all variables were assigned with numeric type and nominal measurement level. More details about each variable can be found in the Variable View tab of this file. Additional variables were created by grouping or consolidating categories within each survey question for simpler analysis. These variables are listed in the last columns of the .xlsx file. 3. The analysed survey data “AnalysedData.rar”. This file contains 6 “SPSS Statistics Output Documents” which demonstrate statistical tests and analysis such as mean, correlation, automatic linear regression, reliability, frequencies, and descriptives. 4. The codebook “Codebook.rar”. The detailed SPSS “Codebook.pdf” alongside the simplified codebook as “VariableInformation_table.pdf” provides a comprehensive guide to all 50 variables in the survey data, including numerical codes for survey questions and response options. They serve as valuable resources for understanding the dataset, presenting dictionary information, and providing descriptive statistics, such as counts and percentages for categorical variables.
f
Data from: Examining Chi-Square Test Statistics Under Conditions of Large...
tandf.figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dexin Shi; Christine DiStefano; Heather L. McDaniel; Zhehan Jiang (2023). Examining Chi-Square Test Statistics Under Conditions of Large Model Size and Ordinal Data [Dataset]. http://doi.org/10.6084/m9.figshare.6070703.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6070703.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Dexin Shi; Christine DiStefano; Heather L. McDaniel; Zhehan Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study examined the effect of model size on the chi-square test statistics obtained from ordinal factor analysis models. The performance of six robust chi-square test statistics were compared across various conditions, including number of observed variables (p), number of factors, sample size, model (mis)specification, number of categories, and threshold distribution. Results showed that the unweighted least squares (ULS) robust chi-square statistics generally outperform the diagonally weighted least squares (DWLS) robust chi-square statistics. The ULSM estimator performed the best overall. However, when fitting ordinal factor analysis models with a large number of observed variables and small sample size, the ULSM-based chi-square tests may yield empirical variances that are noticeably larger than the theoretical values and inflated Type I error rates. On the other hand, when the number of observed variables is very large, the mean- and variance-corrected chi-square test statistics (e.g., based on ULSMV and WLSMV) could produce empirical variances conspicuously smaller than the theoretical values and Type I error rates lower than the nominal level, and demonstrate lower power rates to reject misspecified models. Recommendations for applied researchers and future empirical studies involving large models are provided.
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
bin
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646897
Dataset updated
Jul 12, 2024
Dataset provided by
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Statistical Area 2 2025 Clipped
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
Updated Aug 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ (2025). Statistical Area 2 2025 Clipped [Dataset]. https://datafinder.stats.govt.nz/layer/120969-statistical-area-2-2025-clipped/
Explore at:
pdf, csv, geopackage / sqlite, kml, geodatabase, mapinfo tab, dwg, mapinfo mif, shapefileAvailable download formats
Dataset updated
Aug 8, 2025
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Refer to the 'Current Geographic Boundaries Table' layer for a list of all current geographies and recent updates.

This dataset is the definitive version of the annually released statistical area 2 (SA2) boundaries as at 1 January 2025 as defined by Stats NZ, clipped to the coastline. This clipped version has been created for cartographic purposes and so does not fully represent the official full extent boundaries. This clipped version contains 2,311 SA2 areas.

SA2 is an output geography that provides higher aggregations of population data than can be provided at the statistical area 1 (SA1) level. The SA2 geography aims to reflect communities that interact together socially and economically. In populated areas, SA2s generally contain similar sized populations.

The SA2 should:

form a contiguous cluster of one or more SA1s,

excluding exceptions below, allow the release of multivariate statistics with minimal data suppression,

capture a similar type of area, such as a high-density urban area, farmland, wilderness area, and water area,

be socially homogeneous and capture a community of interest. It may have, for example:

a shared road network,

shared community facilities,

shared historical or social links, or

socio-economic similarity,

form a nested hierarchy with statistical output geographies and administrative boundaries. It must:

be built from SA1s,

either define or aggregate to define SA3s, urban areas, territorial authorities, and regional councils.

SA2s in city council areas generally have a population of 2,000–4,000 residents while SA2s in district council areas generally have a population of 1,000–3,000 residents.

In major urban areas, an SA2 or a group of SA2s often approximates a single suburb. In rural areas, rural settlements are included in their respective SA2 with the surrounding rural area.

SA2s in urban areas where there is significant business and industrial activity, for example ports, airports, industrial, commercial, and retail areas, often have fewer than 1,000 residents. These SA2s are useful for analysing business demographics, labour markets, and commuting patterns.

In rural areas, some SA2s have fewer than 1,000 residents because they are in conservation areas or contain sparse populations that cover a large area.

To minimise suppression of population data, small islands with zero or low populations close to the mainland, and marinas are generally included in their adjacent land-based SA2.

Zero or nominal population SA2s

To ensure that the SA2 geography covers all of New Zealand and aligns with New Zealand’s topography and local government boundaries, some SA2s have zero or nominal populations. These include:

SA2s where territorial authority boundaries straddle regional council boundaries. These SA2s each have fewer than 200 residents and are: Arahiwi, Tiroa, Rangataiki, Kaimanawa, Taharua, Te More, Ngamatea, Whangamomona, and Mara.

SA2s created for single islands or groups of islands that are some distance from the mainland or to separate large unpopulated islands from urban areas

SA2s that represent inland water, inlets or oceanic areas including: inland lakes larger than 50 square kilometres, harbours larger than 40 square kilometres, major ports, other non-contiguous inlets and harbours defined by territorial authority, and contiguous oceanic areas defined by regional council.

SA2s for non-digitised oceanic areas, offshore oil rigs, islands, and the Ross Dependency. Each SA2 is represented by a single meshblock. The following 16 SA2s are held in non-digitised form (SA2 code; SA2 name):

400001; New Zealand Economic Zone, 400002; Oceanic Kermadec Islands, 400003; Kermadec Islands, 400004; Oceanic Oil Rig Taranaki, 400005; Oceanic Campbell Island, 400006; Campbell Island, 400007; Oceanic Oil Rig Southland, 400008; Oceanic Auckland Islands, 400009; Auckland Islands, 400010 ; Oceanic Bounty Islands, 400011; Bounty Islands, 400012; Oceanic Snares Islands, 400013; Snares Islands, 400014; Oceanic Antipodes Islands, 400015; Antipodes Islands, 400016; Ross Dependency.

SA2 numbering and naming

Each SA2 is a single geographic entity with a name and a numeric code. The name refers to a geographic feature or a recognised place name or suburb. In some instances where place names are the same or very similar, the SA2s are differentiated by their territorial authority name, for example, Gladstone (Carterton District) and Gladstone (Invercargill City).

SA2 codes have six digits. North Island SA2 codes start with a 1 or 2, South Island SA2 codes start with a 3 and non-digitised SA2 codes start with a 4. They are numbered approximately north to south within their respective territorial authorities. To ensure the north–south code pattern is maintained, the SA2 codes were given 00 for the last two digits when the geography was created in 2018. When SA2 names or boundaries change only the last two digits of the code will change.

Clipped Version

This clipped version has been created for cartographic purposes and so does not fully represent the official full extent boundaries.

High-definition version

This high definition (HD) version is the most detailed geometry, suitable for use in GIS for geometric analysis operations and for the computation of areas, centroids and other metrics. The HD version is aligned to the LINZ cadastre.

Macrons

Names are provided with and without tohutō/macrons. The column name for those without macrons is suffixed ‘ascii’.

Digital data

Digital boundary data became freely available on 1 July 2007.

Further information

To download geographic classifications in table formats such as CSV please use Ariā

For more information please refer to the Statistical standard for geographic areas 2023.

Contact: geography@stats.govt.nz
f
Case for omitting tied observations in the two-sample t-test and the...
plos.figshare.com
tiff
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monnie McGee (2023). Case for omitting tied observations in the two-sample t-test and the Wilcoxon-Mann-Whitney Test [Dataset]. http://doi.org/10.1371/journal.pone.0200837
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0200837
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Monnie McGee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When the distributional assumptions for a t-test are not met, the default position of many analysts is to resort to a rank-based test, such as the Wilcoxon-Mann-Whitney Test to compare the difference in means between two samples. The Wilcoxon-Mann-Whitney Test presents no danger of tied observations when the observations in the data are continuous. However, in practice, observations are discretized due various logical reasons, or the data are ordinal in nature. When ranks are tied, most textbooks recommend using mid-ranks to replace the tied ranks, a practice that affects the distribution of the Wilcoxon-Mann-Whitney Test under the null hypothesis. Other methods for breaking ties have also been proposed. In this study, we examine four tie-breaking methods—average-scores, mid-ranks, jittering, and omission—for their effects on Type I and Type II error of the Wilcoxon-Mann-Whitney Test and the two-sample t-test for various combinations of sample sizes, underlying population distributions, and percentages of tied observations. We use the results to determine the maximum percentage of ties for which the power and size are seriously affected, and for which method of tie-breaking results in the best Type I and Type II error properties. Not surprisingly, the underlying population distribution of the data has less of an effect on the Wilcoxon-Mann-Whitney Test than on the t-test. Surprisingly, we find that the jittering and omission methods tend to hold Type I error at the nominal level, even for small sample sizes, with no substantial sacrifice in terms of Type II error. Furthermore, the t-test and the Wilcoxon-Mann-Whitney Test are equally effected by ties in terms of Type I and Type II error; therefore, we recommend omitting tied observations when they occur for both the two-sample t-test and the Wilcoxon-Mann-Whitney due to the bias in Type I error that is created when tied observations are left in the data, in the case of the t-test, or adjusted using mid-ranks or average-scores, in the case of the Wilcoxon-Mann-Whitney.
Statistical Area 2 2025
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
Updated Aug 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ (2025). Statistical Area 2 2025 [Dataset]. https://datafinder.stats.govt.nz/layer/120978-statistical-area-2-2025/
Explore at:
pdf, csv, kml, mapinfo tab, shapefile, geopackage / sqlite, geodatabase, dwg, mapinfo mifAvailable download formats
Dataset updated
Aug 8, 2025
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Refer to the 'Current Geographic Boundaries Table' layer for a list of all current geographies and recent updates.

This dataset is the definitive version of the annually released statistical area 2 (SA2) boundaries as at 1 January 2025 as defined by Stats NZ. This version contains 2,395 SA2s (2,379 digitised and 16 with empty or null geometries (non-digitised)).

SA2 is an output geography that provides higher aggregations of population data than can be provided at the statistical area 1 (SA1) level. The SA2 geography aims to reflect communities that interact together socially and economically. In populated areas, SA2s generally contain similar sized populations.

The SA2 should:

form a contiguous cluster of one or more SA1s,

excluding exceptions below, allow the release of multivariate statistics with minimal data suppression,

capture a similar type of area, such as a high-density urban area, farmland, wilderness area, and water area,

be socially homogeneous and capture a community of interest. It may have, for example:

a shared road network,

shared community facilities,

shared historical or social links, or

socio-economic similarity,

form a nested hierarchy with statistical output geographies and administrative boundaries. It must:

be built from SA1s,

either define or aggregate to define SA3s, urban areas, territorial authorities, and regional councils.

SA2s in city council areas generally have a population of 2,000–4,000 residents while SA2s in district council areas generally have a population of 1,000–3,000 residents.

In major urban areas, an SA2 or a group of SA2s often approximates a single suburb. In rural areas, rural settlements are included in their respective SA2 with the surrounding rural area.

SA2s in urban areas where there is significant business and industrial activity, for example ports, airports, industrial, commercial, and retail areas, often have fewer than 1,000 residents. These SA2s are useful for analysing business demographics, labour markets, and commuting patterns.

In rural areas, some SA2s have fewer than 1,000 residents because they are in conservation areas or contain sparse populations that cover a large area.

To minimise suppression of population data, small islands with zero or low populations close to the mainland, and marinas are generally included in their adjacent land-based SA2.

Zero or nominal population SA2s

To ensure that the SA2 geography covers all of New Zealand and aligns with New Zealand’s topography and local government boundaries, some SA2s have zero or nominal populations. These include:

SA2s where territorial authority boundaries straddle regional council boundaries. These SA2s each have fewer than 200 residents and are: Arahiwi, Tiroa, Rangataiki, Kaimanawa, Taharua, Te More, Ngamatea, Whangamomona, and Mara.

SA2s created for single islands or groups of islands that are some distance from the mainland or to separate large unpopulated islands from urban areas

SA2s that represent inland water, inlets or oceanic areas including: inland lakes larger than 50 square kilometres, harbours larger than 40 square kilometres, major ports, other non-contiguous inlets and harbours defined by territorial authority, and contiguous oceanic areas defined by regional council.

SA2s for non-digitised oceanic areas, offshore oil rigs, islands, and the Ross Dependency. Each SA2 is represented by a single meshblock. The following 16 SA2s are held in non-digitised form (SA2 code; SA2 name):

400001; New Zealand Economic Zone, 400002; Oceanic Kermadec Islands, 400003; Kermadec Islands, 400004; Oceanic Oil Rig Taranaki, 400005; Oceanic Campbell Island, 400006; Campbell Island, 400007; Oceanic Oil Rig Southland, 400008; Oceanic Auckland Islands, 400009; Auckland Islands, 400010 ; Oceanic Bounty Islands, 400011; Bounty Islands, 400012; Oceanic Snares Islands, 400013; Snares Islands, 400014; Oceanic Antipodes Islands, 400015; Antipodes Islands, 400016; Ross Dependency.

SA2 numbering and naming

Each SA2 is a single geographic entity with a name and a numeric code. The name refers to a geographic feature or a recognised place name or suburb. In some instances where place names are the same or very similar, the SA2s are differentiated by their territorial authority name, for example, Gladstone (Carterton District) and Gladstone (Invercargill City).

SA2 codes have six digits. North Island SA2 codes start with a 1 or 2, South Island SA2 codes start with a 3 and non-digitised SA2 codes start with a 4. They are numbered approximately north to south within their respective territorial authorities. To ensure the north–south code pattern is maintained, the SA2 codes were given 00 for the last two digits when the geography was created in 2018. When SA2 names or boundaries change only the last two digits of the code will change.

High-definition version

This high definition (HD) version is the most detailed geometry, suitable for use in GIS for geometric analysis operations and for the computation of areas, centroids and other metrics. The HD version is aligned to the LINZ cadastre.

Macrons

Names are provided with and without tohutō/macrons. The column name for those without macrons is suffixed ‘ascii’.

Digital data

Digital boundary data became freely available on 1 July 2007.

Further information

To download geographic classifications in table formats such as CSV please use Ariā

For more information please refer to the Statistical standard for geographic areas 2023.

Contact: geography@stats.govt.nz
Controlled Anomalies Time Series (CATS) Dataset
kaggle.com
Updated Sep 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/controlled-anomalies-time-series-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables)including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include** both nominal and anomalous segments.** This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Contamination level of 0.038. This means about 3.8% of the observations (rows) are anomalous.

Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.

Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during**** our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

The dataset provider, Solenix, is an international company providing software e...
SAMS/Nimbus-7 Level 3 Zonal Means Composition Data V001 (SAMSN7L3ZMTG) at...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). SAMS/Nimbus-7 Level 3 Zonal Means Composition Data V001 (SAMSN7L3ZMTG) at GES DISC - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sams-nimbus-7-level-3-zonal-means-composition-data-v001-samsn7l3zmtg-at-ges-disc-42559
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
SAMSN7L3ZMTG is the Nimbus-7 Stratospheric and Mesospheric Sounder (SAMS) Level 3 Zonal Means Composition Data Product. The Earth's surface is divided into 2.5-deg latitudinal zones that extend from 50 deg South to 67.5 deg North. Retrieved mixing ratios of nitrous oxide (N2O) and methane (CH4) are averaged over day and night, along with errors, at 31 pressure levels between 50 and 0.125 mbar. Because the N2O and CH4 channels cannot function simultaneously, only one type of measurement is made for any nominal day. The data were recovered from the original magnetic tapes, and are now stored online as one file in its original proprietary binary format.The data for this product are available from 1 January 1979 through 30 December 1981. The principal investigators for the SAMS experiment were Prof. John T. Houghton and Dr. Fredric W. Taylor from Oxford University.This product was previously available from the NSSDC with the identifier ESAD-00180 (old ID 78-098A-02C).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xinlin Chen; Pingyan Chen (2023). A Comparison of Four Methods for the Analysis of N-of-1 Trials [Dataset]. http://doi.org/10.1371/journal.pone.0087752

A Comparison of Four Methods for the Analysis of N-of-1 Trials

Explore at:

25 scholarly articles cite this dataset (View in Google Scholar)

docAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0087752

Dataset updated

Jun 2, 2023

Dataset provided by

PLOS ONE

Authors

Xinlin Chen; Pingyan Chen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ObjectiveTo provide a practical guidance for the analysis of N-of-1 trials by comparing four commonly used models.MethodsThe four models, paired t-test, mixed effects model of difference, mixed effects model and meta-analysis of summary data were compared using a simulation study. The assumed 3-cycles and 4-cycles N-of-1 trials were set with sample sizes of 1, 3, 5, 10, 20 and 30 respectively under normally distributed assumption. The data were generated based on variance-covariance matrix under the assumption of (i) compound symmetry structure or first-order autoregressive structure, and (ii) no carryover effect or 20% carryover effect. Type I error, power, bias (mean error), and mean square error (MSE) of effect differences between two groups were used to evaluate the performance of the four models.ResultsThe results from the 3-cycles and 4-cycles N-of-1 trials were comparable with respect to type I error, power, bias and MSE. Paired t-test yielded type I error near to the nominal level, higher power, comparable bias and small MSE, whether there was carryover effect or not. Compared with paired t-test, mixed effects model produced similar size of type I error, smaller bias, but lower power and bigger MSE. Mixed effects model of difference and meta-analysis of summary data yielded type I error far from the nominal level, low power, and large bias and MSE irrespective of the presence or absence of carryover effect.ConclusionWe recommended paired t-test to be used for normally distributed data of N-of-1 trials because of its optimal statistical performance. In the presence of carryover effects, mixed effects model could be used as an alternative.

Clear search

Close search

Google apps

Main menu

A Comparison of Four Methods for the Analysis of N-of-1 Trials

Questionnaire data on land use change of Industrial Heritage: Insights from...

Data from: Examining Chi-Square Test Statistics Under Conditions of Large...

Controlled Anomalies Time Series (CATS) Dataset

Statistical Area 2 2025 Clipped

Case for omitting tied observations in the two-sample t-test and the...

Statistical Area 2 2025

Controlled Anomalies Time Series (CATS) Dataset

SAMS/Nimbus-7 Level 3 Zonal Means Composition Data V001 (SAMSN7L3ZMTG) at...

A Comparison of Four Methods for the Analysis of N-of-1 Trials