9 datasets found
  1. f

    A Comparison of Four Methods for the Analysis of N-of-1 Trials

    • figshare.com
    doc
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinlin Chen; Pingyan Chen (2023). A Comparison of Four Methods for the Analysis of N-of-1 Trials [Dataset]. http://doi.org/10.1371/journal.pone.0087752
    Explore at:
    docAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Xinlin Chen; Pingyan Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveTo provide a practical guidance for the analysis of N-of-1 trials by comparing four commonly used models.MethodsThe four models, paired t-test, mixed effects model of difference, mixed effects model and meta-analysis of summary data were compared using a simulation study. The assumed 3-cycles and 4-cycles N-of-1 trials were set with sample sizes of 1, 3, 5, 10, 20 and 30 respectively under normally distributed assumption. The data were generated based on variance-covariance matrix under the assumption of (i) compound symmetry structure or first-order autoregressive structure, and (ii) no carryover effect or 20% carryover effect. Type I error, power, bias (mean error), and mean square error (MSE) of effect differences between two groups were used to evaluate the performance of the four models.ResultsThe results from the 3-cycles and 4-cycles N-of-1 trials were comparable with respect to type I error, power, bias and MSE. Paired t-test yielded type I error near to the nominal level, higher power, comparable bias and small MSE, whether there was carryover effect or not. Compared with paired t-test, mixed effects model produced similar size of type I error, smaller bias, but lower power and bigger MSE. Mixed effects model of difference and meta-analysis of summary data yielded type I error far from the nominal level, low power, and large bias and MSE irrespective of the presence or absence of carryover effect.ConclusionWe recommended paired t-test to be used for normally distributed data of N-of-1 trials because of its optimal statistical performance. In the presence of carryover effects, mixed effects model could be used as an alternative.

  2. m

    Questionnaire data on land use change of Industrial Heritage: Insights from...

    • data.mendeley.com
    Updated Jul 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arsalan Karimi (2023). Questionnaire data on land use change of Industrial Heritage: Insights from Decision-Makers in Shiraz, Iran [Dataset]. http://doi.org/10.17632/gk3z8gp7cp.2
    Explore at:
    Dataset updated
    Jul 20, 2023
    Authors
    Arsalan Karimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Shiraz, Iran
    Description

    The survey dataset for identifying Shiraz old silo’s new use which includes four components: 1. The survey instrument used to collect the data “SurveyInstrument_table.pdf”. The survey instrument contains 18 main closed-ended questions in a table format. Two of these, concern information on Silo’s decision-makers and proposed new use followed up after a short introduction of the questionnaire, and others 16 (each can identify 3 variables) are related to the level of appropriate opinions for ideal intervention in Façade, Openings, Materials and Floor heights of the building in four values: Feasibility, Reversibility, Compatibility and Social Benefits. 2. The raw survey data “SurveyData.rar”. This file contains an Excel.xlsx and a SPSS.sav file. The survey data file contains 50 variables (12 for each of the four values separated by colour) and data from each of the 632 respondents. Answering each question in the survey was mandatory, therefor there are no blanks or non-responses in the dataset. In the .sav file, all variables were assigned with numeric type and nominal measurement level. More details about each variable can be found in the Variable View tab of this file. Additional variables were created by grouping or consolidating categories within each survey question for simpler analysis. These variables are listed in the last columns of the .xlsx file. 3. The analysed survey data “AnalysedData.rar”. This file contains 6 “SPSS Statistics Output Documents” which demonstrate statistical tests and analysis such as mean, correlation, automatic linear regression, reliability, frequencies, and descriptives. 4. The codebook “Codebook.rar”. The detailed SPSS “Codebook.pdf” alongside the simplified codebook as “VariableInformation_table.pdf” provides a comprehensive guide to all 50 variables in the survey data, including numerical codes for survey questions and response options. They serve as valuable resources for understanding the dataset, presenting dictionary information, and providing descriptive statistics, such as counts and percentages for categorical variables.

  3. f

    Data from: Examining Chi-Square Test Statistics Under Conditions of Large...

    • tandf.figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dexin Shi; Christine DiStefano; Heather L. McDaniel; Zhehan Jiang (2023). Examining Chi-Square Test Statistics Under Conditions of Large Model Size and Ordinal Data [Dataset]. http://doi.org/10.6084/m9.figshare.6070703.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Dexin Shi; Christine DiStefano; Heather L. McDaniel; Zhehan Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study examined the effect of model size on the chi-square test statistics obtained from ordinal factor analysis models. The performance of six robust chi-square test statistics were compared across various conditions, including number of observed variables (p), number of factors, sample size, model (mis)specification, number of categories, and threshold distribution. Results showed that the unweighted least squares (ULS) robust chi-square statistics generally outperform the diagonally weighted least squares (DWLS) robust chi-square statistics. The ULSM estimator performed the best overall. However, when fitting ordinal factor analysis models with a large number of observed variables and small sample size, the ULSM-based chi-square tests may yield empirical variances that are noticeably larger than the theoretical values and inflated Type I error rates. On the other hand, when the number of observed variables is very large, the mean- and variance-corrected chi-square test statistics (e.g., based on ULSMV and WLSMV) could produce empirical variances conspicuously smaller than the theoretical values and Type I error rates lower than the nominal level, and demonstrate lower power rates to reject misspecified models. Recommendations for applied researchers and future empirical studies involving large models are provided.

  4. Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  5. Statistical Area 2 2025 Clipped

    • datafinder.stats.govt.nz
    csv, dwg, geodatabase +6
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2025). Statistical Area 2 2025 Clipped [Dataset]. https://datafinder.stats.govt.nz/layer/120969-statistical-area-2-2025-clipped/
    Explore at:
    pdf, csv, geopackage / sqlite, kml, geodatabase, mapinfo tab, dwg, mapinfo mif, shapefileAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Area covered
    Description

    Refer to the 'Current Geographic Boundaries Table' layer for a list of all current geographies and recent updates.

    This dataset is the definitive version of the annually released statistical area 2 (SA2) boundaries as at 1 January 2025 as defined by Stats NZ, clipped to the coastline. This clipped version has been created for cartographic purposes and so does not fully represent the official full extent boundaries. This clipped version contains 2,311 SA2 areas.

    SA2 is an output geography that provides higher aggregations of population data than can be provided at the statistical area 1 (SA1) level. The SA2 geography aims to reflect communities that interact together socially and economically. In populated areas, SA2s generally contain similar sized populations.

    The SA2 should:

    form a contiguous cluster of one or more SA1s,

    excluding exceptions below, allow the release of multivariate statistics with minimal data suppression,

    capture a similar type of area, such as a high-density urban area, farmland, wilderness area, and water area,

    be socially homogeneous and capture a community of interest. It may have, for example:

    • a shared road network,
    • shared community facilities,
    • shared historical or social links, or
    • socio-economic similarity,

    form a nested hierarchy with statistical output geographies and administrative boundaries. It must:

    • be built from SA1s,
    • either define or aggregate to define SA3s, urban areas, territorial authorities, and regional councils.

    SA2s in city council areas generally have a population of 2,000–4,000 residents while SA2s in district council areas generally have a population of 1,000–3,000 residents.

    In major urban areas, an SA2 or a group of SA2s often approximates a single suburb. In rural areas, rural settlements are included in their respective SA2 with the surrounding rural area.

    SA2s in urban areas where there is significant business and industrial activity, for example ports, airports, industrial, commercial, and retail areas, often have fewer than 1,000 residents. These SA2s are useful for analysing business demographics, labour markets, and commuting patterns.

    In rural areas, some SA2s have fewer than 1,000 residents because they are in conservation areas or contain sparse populations that cover a large area.

    To minimise suppression of population data, small islands with zero or low populations close to the mainland, and marinas are generally included in their adjacent land-based SA2.

    Zero or nominal population SA2s

    To ensure that the SA2 geography covers all of New Zealand and aligns with New Zealand’s topography and local government boundaries, some SA2s have zero or nominal populations. These include:

    • SA2s where territorial authority boundaries straddle regional council boundaries. These SA2s each have fewer than 200 residents and are: Arahiwi, Tiroa, Rangataiki, Kaimanawa, Taharua, Te More, Ngamatea, Whangamomona, and Mara.
    • SA2s created for single islands or groups of islands that are some distance from the mainland or to separate large unpopulated islands from urban areas
    • SA2s that represent inland water, inlets or oceanic areas including: inland lakes larger than 50 square kilometres, harbours larger than 40 square kilometres, major ports, other non-contiguous inlets and harbours defined by territorial authority, and contiguous oceanic areas defined by regional council.
    • SA2s for non-digitised oceanic areas, offshore oil rigs, islands, and the Ross Dependency. Each SA2 is represented by a single meshblock. The following 16 SA2s are held in non-digitised form (SA2 code; SA2 name):

    400001; New Zealand Economic Zone, 400002; Oceanic Kermadec Islands, 400003; Kermadec Islands, 400004; Oceanic Oil Rig Taranaki, 400005; Oceanic Campbell Island, 400006; Campbell Island, 400007; Oceanic Oil Rig Southland, 400008; Oceanic Auckland Islands, 400009; Auckland Islands, 400010 ; Oceanic Bounty Islands, 400011; Bounty Islands, 400012; Oceanic Snares Islands, 400013; Snares Islands, 400014; Oceanic Antipodes Islands, 400015; Antipodes Islands, 400016; Ross Dependency.

    SA2 numbering and naming

    Each SA2 is a single geographic entity with a name and a numeric code. The name refers to a geographic feature or a recognised place name or suburb. In some instances where place names are the same or very similar, the SA2s are differentiated by their territorial authority name, for example, Gladstone (Carterton District) and Gladstone (Invercargill City).

    SA2 codes have six digits. North Island SA2 codes start with a 1 or 2, South Island SA2 codes start with a 3 and non-digitised SA2 codes start with a 4. They are numbered approximately north to south within their respective territorial authorities. To ensure the north–south code pattern is maintained, the SA2 codes were given 00 for the last two digits when the geography was created in 2018. When SA2 names or boundaries change only the last two digits of the code will change.

    Clipped Version

    This clipped version has been created for cartographic purposes and so does not fully represent the official full extent boundaries.

    High-definition version

    This high definition (HD) version is the most detailed geometry, suitable for use in GIS for geometric analysis operations and for the computation of areas, centroids and other metrics. The HD version is aligned to the LINZ cadastre.

    Macrons

    Names are provided with and without tohutō/macrons. The column name for those without macrons is suffixed ‘ascii’.

    Digital data

    Digital boundary data became freely available on 1 July 2007.

    Further information

    To download geographic classifications in table formats such as CSV please use Ariā

    For more information please refer to the Statistical standard for geographic areas 2023.

    Contact: geography@stats.govt.nz

  6. f

    Case for omitting tied observations in the two-sample t-test and the...

    • plos.figshare.com
    tiff
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monnie McGee (2023). Case for omitting tied observations in the two-sample t-test and the Wilcoxon-Mann-Whitney Test [Dataset]. http://doi.org/10.1371/journal.pone.0200837
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Monnie McGee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When the distributional assumptions for a t-test are not met, the default position of many analysts is to resort to a rank-based test, such as the Wilcoxon-Mann-Whitney Test to compare the difference in means between two samples. The Wilcoxon-Mann-Whitney Test presents no danger of tied observations when the observations in the data are continuous. However, in practice, observations are discretized due various logical reasons, or the data are ordinal in nature. When ranks are tied, most textbooks recommend using mid-ranks to replace the tied ranks, a practice that affects the distribution of the Wilcoxon-Mann-Whitney Test under the null hypothesis. Other methods for breaking ties have also been proposed. In this study, we examine four tie-breaking methods—average-scores, mid-ranks, jittering, and omission—for their effects on Type I and Type II error of the Wilcoxon-Mann-Whitney Test and the two-sample t-test for various combinations of sample sizes, underlying population distributions, and percentages of tied observations. We use the results to determine the maximum percentage of ties for which the power and size are seriously affected, and for which method of tie-breaking results in the best Type I and Type II error properties. Not surprisingly, the underlying population distribution of the data has less of an effect on the Wilcoxon-Mann-Whitney Test than on the t-test. Surprisingly, we find that the jittering and omission methods tend to hold Type I error at the nominal level, even for small sample sizes, with no substantial sacrifice in terms of Type II error. Furthermore, the t-test and the Wilcoxon-Mann-Whitney Test are equally effected by ties in terms of Type I and Type II error; therefore, we recommend omitting tied observations when they occur for both the two-sample t-test and the Wilcoxon-Mann-Whitney due to the bias in Type I error that is created when tied observations are left in the data, in the case of the t-test, or adjusted using mid-ranks or average-scores, in the case of the Wilcoxon-Mann-Whitney.

  7. Statistical Area 2 2025

    • datafinder.stats.govt.nz
    csv, dwg, geodatabase +6
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2025). Statistical Area 2 2025 [Dataset]. https://datafinder.stats.govt.nz/layer/120978-statistical-area-2-2025/
    Explore at:
    pdf, csv, kml, mapinfo tab, shapefile, geopackage / sqlite, geodatabase, dwg, mapinfo mifAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Area covered
    Description

    Refer to the 'Current Geographic Boundaries Table' layer for a list of all current geographies and recent updates.

    This dataset is the definitive version of the annually released statistical area 2 (SA2) boundaries as at 1 January 2025 as defined by Stats NZ. This version contains 2,395 SA2s (2,379 digitised and 16 with empty or null geometries (non-digitised)).

    SA2 is an output geography that provides higher aggregations of population data than can be provided at the statistical area 1 (SA1) level. The SA2 geography aims to reflect communities that interact together socially and economically. In populated areas, SA2s generally contain similar sized populations.

    The SA2 should:

    form a contiguous cluster of one or more SA1s,

    excluding exceptions below, allow the release of multivariate statistics with minimal data suppression,

    capture a similar type of area, such as a high-density urban area, farmland, wilderness area, and water area,

    be socially homogeneous and capture a community of interest. It may have, for example:

    • a shared road network,
    • shared community facilities,
    • shared historical or social links, or
    • socio-economic similarity,

    form a nested hierarchy with statistical output geographies and administrative boundaries. It must:

    • be built from SA1s,
    • either define or aggregate to define SA3s, urban areas, territorial authorities, and regional councils.

    SA2s in city council areas generally have a population of 2,000–4,000 residents while SA2s in district council areas generally have a population of 1,000–3,000 residents.

    In major urban areas, an SA2 or a group of SA2s often approximates a single suburb. In rural areas, rural settlements are included in their respective SA2 with the surrounding rural area.

    SA2s in urban areas where there is significant business and industrial activity, for example ports, airports, industrial, commercial, and retail areas, often have fewer than 1,000 residents. These SA2s are useful for analysing business demographics, labour markets, and commuting patterns.

    In rural areas, some SA2s have fewer than 1,000 residents because they are in conservation areas or contain sparse populations that cover a large area.

    To minimise suppression of population data, small islands with zero or low populations close to the mainland, and marinas are generally included in their adjacent land-based SA2.

    Zero or nominal population SA2s

    To ensure that the SA2 geography covers all of New Zealand and aligns with New Zealand’s topography and local government boundaries, some SA2s have zero or nominal populations. These include:

    • SA2s where territorial authority boundaries straddle regional council boundaries. These SA2s each have fewer than 200 residents and are: Arahiwi, Tiroa, Rangataiki, Kaimanawa, Taharua, Te More, Ngamatea, Whangamomona, and Mara.
    • SA2s created for single islands or groups of islands that are some distance from the mainland or to separate large unpopulated islands from urban areas
    • SA2s that represent inland water, inlets or oceanic areas including: inland lakes larger than 50 square kilometres, harbours larger than 40 square kilometres, major ports, other non-contiguous inlets and harbours defined by territorial authority, and contiguous oceanic areas defined by regional council.
    • SA2s for non-digitised oceanic areas, offshore oil rigs, islands, and the Ross Dependency. Each SA2 is represented by a single meshblock. The following 16 SA2s are held in non-digitised form (SA2 code; SA2 name):

    400001; New Zealand Economic Zone, 400002; Oceanic Kermadec Islands, 400003; Kermadec Islands, 400004; Oceanic Oil Rig Taranaki, 400005; Oceanic Campbell Island, 400006; Campbell Island, 400007; Oceanic Oil Rig Southland, 400008; Oceanic Auckland Islands, 400009; Auckland Islands, 400010 ; Oceanic Bounty Islands, 400011; Bounty Islands, 400012; Oceanic Snares Islands, 400013; Snares Islands, 400014; Oceanic Antipodes Islands, 400015; Antipodes Islands, 400016; Ross Dependency.

    SA2 numbering and naming

    Each SA2 is a single geographic entity with a name and a numeric code. The name refers to a geographic feature or a recognised place name or suburb. In some instances where place names are the same or very similar, the SA2s are differentiated by their territorial authority name, for example, Gladstone (Carterton District) and Gladstone (Invercargill City).

    SA2 codes have six digits. North Island SA2 codes start with a 1 or 2, South Island SA2 codes start with a 3 and non-digitised SA2 codes start with a 4. They are numbered approximately north to south within their respective territorial authorities. To ensure the north–south code pattern is maintained, the SA2 codes were given 00 for the last two digits when the geography was created in 2018. When SA2 names or boundaries change only the last two digits of the code will change.

    High-definition version

    This high definition (HD) version is the most detailed geometry, suitable for use in GIS for geometric analysis operations and for the computation of areas, centroids and other metrics. The HD version is aligned to the LINZ cadastre.

    Macrons

    Names are provided with and without tohutō/macrons. The column name for those without macrons is suffixed ‘ascii’.

    Digital data

    Digital boundary data became freely available on 1 July 2007.

    Further information

    To download geographic classifications in table formats such as CSV please use Ariā

    For more information please refer to the Statistical standard for geographic areas 2023.

    Contact: geography@stats.govt.nz

  8. Controlled Anomalies Time Series (CATS) Dataset

    • kaggle.com
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2023). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. https://www.kaggle.com/datasets/patrickfleith/controlled-anomalies-time-series-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables)including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include** both nominal and anomalous segments.** This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
      • Contamination level of 0.038. This means about 3.8% of the observations (rows) are anomalous.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches. The categories are available in the dataset and in the metadata.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Suitable for root cause analysis. In addition to the anomaly category, the time series channel in which the anomaly first developed itself is recorded and made available as part of the metadata. This can be useful to evaluate the performance of algorithm to trace back anomalies to the right root cause channel.
    • Affected channels. In addition to the knowledge of the root cause channel in which the anomaly first developed itself, we provide information of channels possibly affected by the anomaly. This can also be useful to evaluate the explainability of anomaly detection systems which may point out to the anomalous channels (root cause and affected).
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during**** our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    The dataset provider, Solenix, is an international company providing software e...

  9. SAMS/Nimbus-7 Level 3 Zonal Means Composition Data V001 (SAMSN7L3ZMTG) at...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). SAMS/Nimbus-7 Level 3 Zonal Means Composition Data V001 (SAMSN7L3ZMTG) at GES DISC - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sams-nimbus-7-level-3-zonal-means-composition-data-v001-samsn7l3zmtg-at-ges-disc-42559
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    SAMSN7L3ZMTG is the Nimbus-7 Stratospheric and Mesospheric Sounder (SAMS) Level 3 Zonal Means Composition Data Product. The Earth's surface is divided into 2.5-deg latitudinal zones that extend from 50 deg South to 67.5 deg North. Retrieved mixing ratios of nitrous oxide (N2O) and methane (CH4) are averaged over day and night, along with errors, at 31 pressure levels between 50 and 0.125 mbar. Because the N2O and CH4 channels cannot function simultaneously, only one type of measurement is made for any nominal day. The data were recovered from the original magnetic tapes, and are now stored online as one file in its original proprietary binary format.The data for this product are available from 1 January 1979 through 30 December 1981. The principal investigators for the SAMS experiment were Prof. John T. Houghton and Dr. Fredric W. Taylor from Oxford University.This product was previously available from the NSSDC with the identifier ESAD-00180 (old ID 78-098A-02C).

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xinlin Chen; Pingyan Chen (2023). A Comparison of Four Methods for the Analysis of N-of-1 Trials [Dataset]. http://doi.org/10.1371/journal.pone.0087752

A Comparison of Four Methods for the Analysis of N-of-1 Trials

Explore at:
25 scholarly articles cite this dataset (View in Google Scholar)
docAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Xinlin Chen; Pingyan Chen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ObjectiveTo provide a practical guidance for the analysis of N-of-1 trials by comparing four commonly used models.MethodsThe four models, paired t-test, mixed effects model of difference, mixed effects model and meta-analysis of summary data were compared using a simulation study. The assumed 3-cycles and 4-cycles N-of-1 trials were set with sample sizes of 1, 3, 5, 10, 20 and 30 respectively under normally distributed assumption. The data were generated based on variance-covariance matrix under the assumption of (i) compound symmetry structure or first-order autoregressive structure, and (ii) no carryover effect or 20% carryover effect. Type I error, power, bias (mean error), and mean square error (MSE) of effect differences between two groups were used to evaluate the performance of the four models.ResultsThe results from the 3-cycles and 4-cycles N-of-1 trials were comparable with respect to type I error, power, bias and MSE. Paired t-test yielded type I error near to the nominal level, higher power, comparable bias and small MSE, whether there was carryover effect or not. Compared with paired t-test, mixed effects model produced similar size of type I error, smaller bias, but lower power and bigger MSE. Mixed effects model of difference and meta-analysis of summary data yielded type I error far from the nominal level, low power, and large bias and MSE irrespective of the presence or absence of carryover effect.ConclusionWe recommended paired t-test to be used for normally distributed data of N-of-1 trials because of its optimal statistical performance. In the presence of carryover effects, mixed effects model could be used as an alternative.

Search
Clear search
Close search
Google apps
Main menu