100+ datasets found
  1. Confidence Interval Examples

    • figshare.com
    application/cdfv2
    Updated Jun 28, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Rollinson (2016). Confidence Interval Examples [Dataset]. http://doi.org/10.6084/m9.figshare.3466364.v2
    Explore at:
    application/cdfv2Available download formats
    Dataset updated
    Jun 28, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Emily Rollinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Examples demonstrating how confidence intervals change depending on the level of confidence (90% versus 95% versus 99%) and on the size of the sample (CI for n=20 versus n=10 versus n=2). Developed for BIO211 (Statistics and Data Analysis: A Conceptual Approach) at Stony Brook University in Fall 2015.

  2. f

    Data from: A Statistical Inference Course Based on p-Values

    • figshare.com
    • tandf.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Martin (2023). A Statistical Inference Course Based on p-Values [Dataset]. http://doi.org/10.6084/m9.figshare.3494549.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Ryan Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introductory statistical inference texts and courses treat the point estimation, hypothesis testing, and interval estimation problems separately, with primary emphasis on large-sample approximations. Here, I present an alternative approach to teaching this course, built around p-values, emphasizing provably valid inference for all sample sizes. Details about computation and marginalization are also provided, with several illustrative examples, along with a course outline. Supplementary materials for this article are available online.

  3. r

    The banksia plot: a method for visually comparing point estimates and...

    • researchdata.edu.au
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Turner; Joanne McKenzie; Emily Karahalios; Elizabeth Korevaar (2024). The banksia plot: a method for visually comparing point estimates and confidence intervals across datasets [Dataset]. http://doi.org/10.26180/25286407.V2
    Explore at:
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    Monash University
    Authors
    Simon Turner; Joanne McKenzie; Emily Karahalios; Elizabeth Korevaar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Companion data for the creation of a banksia plot:

    Background:

    In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.

    Methods:

    The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.

    Results:

    In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.

    Conclusions:

    The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.

    This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1

  4. Winkler Interval score metric

    • kaggle.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). Winkler Interval score metric [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/winkler-interval-score-metric
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Model performance evaluation: The Mean Winkler Interval score (MWIS)

    We can assess the overall performance of a regression model that produces prediction intervals by using the mean Winkler Interval score [1,2,3] which, for an individual interval, is given by:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4051350%2Fe3bd94c6047815c0304b3851fc325a7c%2FWinkler_Interval_Score.png?generation=1700042360776825&alt=media" alt="">

    where \(y\) is the true value, \(u\) it the upper prediction interval, \(l\) is the lower prediction interval, and \(\alpha\) is (1-coverage). For example, for 90% coverage, \(\alpha = 0.1\). Note that the Winkler Interval score constitutes a proper scoring rule [2,3].

    Python code: Usage example

    Attach this dataset to a notebook, then:

    import sys
    sys.path.append('/kaggle/input/winkler-interval-score-metric/')
    import MWIS_metric
    help(MWIS_metric.score)
    
    MWIS,coverage = MWIS_metric.score(predictions["y_true"],predictions["lower"],predictions["upper"],alpha)
    print(f"Local MWI score   ",round(MWIS,3))
    print("Predictions coverage  ", round(coverage*100,1),"%")
    
  5. Estimating Confidence Intervals for 2020 Census Statistics Using Approximate...

    • registry.opendata.aws
    Updated Aug 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Census Bureau (2024). Estimating Confidence Intervals for 2020 Census Statistics Using Approximate Monte Carlo Simulation (2010 Census Proof of Concept) [Dataset]. https://registry.opendata.aws/census-2010-amc-mdf-replicates/
    Explore at:
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Approximate Monte Carlo (AMC) method seed Privacy Protected Microdata File (PPMF0) and PPMF replicates (PPMF1, PPMF2, ..., PPMF25) are a set of microdata files intended for use in estimating the magnitude of error(s) introduced by the 2020 Decennial Census Disclosure Avoidance System (DAS) into the Redistricting and DHC products. The PPMF0 was created by executing the 2020 DAS TopDown Algorithm (TDA) using the confidential 2010 Census Edited File (CEF) as the initial input; the replicates were then created by executing the 2020 DAS TDA repeatedly with the PPMF0 as its initial input. Inspired by analogy to the use of bootstrap methods in non-private contexts, U.S. Census Bureau (USCB) researchers explored whether simple calculations based on comparing each PPMFi to the PPMF0 could be used to reliably estimate the scale of errors introduced by the 2020 DAS, and generally found this approach worked well.

    The PPMF0 and PPMFi files contained here are provided so that external researchers can estimate properties of DAS-introduced error without privileged access to internal USCB-curated data sets; further information on the estimation methodology can be found in Ashmead et. al 2024.

    The 2010 DHC AMC seed PPMF0 and PPMF replicates have been cleared for public dissemination by the USCB Disclosure Review Board (CBDRB-FY24-DSEP-0002). The 2010 PPMF0 included in these files was produced using the same parameters and settings as were used to produce the 2010 Demonstration Data Product Suite (2023-04-03) PPMF, but represents an independent execution of the TopDown Algorithm. The PPMF0 and PPMF replicates contain all Person and Units attributes necessary to produce the Redistricting and DHC publications for both the United States and Puerto Rico, and include geographic detail down to the Census Block level. They do not include attributes specific to either the Detailed DHC-A or Detailed DHC-B products; in particular, data on Major Race (e.g., White Alone) is included, but data on Detailed Race (e.g., Cambodian) is not included in the PPMF0 and replicates.

    The 2020 AMC replicate files for estimating confidence intervals for the official 2020 Census statistics are available.

  6. DEMANDE Dataset

    • zenodo.org
    • researchdiscovery.drexel.edu
    zip
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph A. Gallego-Mejia; Joseph A. Gallego-Mejia; Fabio A Gonzalez; Fabio A Gonzalez (2023). DEMANDE Dataset [Dataset]. http://doi.org/10.5281/zenodo.7822851
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joseph A. Gallego-Mejia; Joseph A. Gallego-Mejia; Fabio A Gonzalez; Fabio A Gonzalez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the features and probabilites of ten different functions. Each dataset is saved using numpy arrays. \item The data set \textit{Arc} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\mathcal{N}(x_2|0,4)\mathcal{N}(x_1|0.25x_2^2,1)$$ where $$\mathcal{N}(u|\mu,\sigma^2)$$ denotes the density function of a normal distribution with mean $$\mu$$ and variance $$\sigma^2$$. \cite{Papamakarios2017} used this data set to evaluate his neural density estimation methods. \item The data set \textit{Potential 1} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\frac{1}{2}\left(\frac{||x||-2}{0.4}\right)^2 - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_1-2}{0.6}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_1+2}{0.6}\right]^2\right\}\right)}$$ with a normalizing constant of approximately 6.52 calculated by Monte Carlo integration. \item The data set \textit{Potential 2} corresponds to a two-dimensional random sample drawn from a random vector $$X=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)=\frac{1}{2}\left[ \frac{x_2-w_1(x)}{0.4}\right]^2$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$ with a normalizing constant of approximately 8 calculated by Monte Carlo integration. \item The data set \textit{Potential 3} corresponds to a two-dimensional random sample drawn from a random vector $$x=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)= - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)}{0.35}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)+w_2(x)}{0.35}^2\right]\right\}\right)}$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$ and $$w_2(x)=3 \exp \left\{-\frac{1}{2}\left[ \frac{x_1-1}{0.6}\right]^2\right\}$$ with a normalizing constant of approximately 13.9 calculated by Monte Carlo integration. \item The data set \textit{Potential 4} corresponds to a two-dimensional random sample drawn from a random vector $$x=(X_1,X_2)$$ with probability density function given by $$f(x_1,x_2)= - \ln{\left(\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)}{0.4}\right]^2\right\}+\exp\left\{-\frac{1}{2}\left[\frac{x_2-w_1(x)+w_3(x)}{0.35}^2\right]\right\}\right)}$$ where $$w_1(x)=\sin{(\frac{2\pi x_1}{4})}$$, $$w_3(x)=3 \sigma \left(\left[ \frac{x_1-1}{0.3}\right]^2\right)$$, and $$\sigma(x)= \frac{1}{1+\exp(x)}$$ with a normalizing constant of approximately 13.9 calculated by Monte Carlo integration. \item The data set \textit{2D mixture} corresponds to a two-dimensional random sample drawn from the random vector $$x=(X_1, X_2)$$ with a probability density function given by $$f(x) = \frac{1}{2}\mathcal{N}(x|\mu_1,\Sigma_1) + \frac{1}{2}\mathcal{N}(x|\mu_2,\Sigma_2)$$ with means and covariance matrices $$\mu_1 = [1, -1]^T$$, $$\mu_2 = [-2, 2]^T$$, $$\Sigma_1=\left[\begin{array}{cc} 1 & 0 \\ 0 & 2 \end{array}\right]$$, and $$\Sigma_1=\left[\begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array}\right]$$ \item The data set \textit{10D-mixture} corresponds to a 10-dimensional random sample drawn from the random vector $$x=(X_1,\cdots,X_{10})$$ with a mixture of four diagonal normal probability density functions $$\mathcal{N}(X_i|\mu_i, \sigma_i)$$, where each $$\mu_i$$ is drawn uniformly in the interval $$[-0.5,0.5]$$, and the $$\sigma_i$$ is drawn uniformly in the interval $$[-0.01, 0.5]$$. Each diagonal normal probability density has the same probability of being drawn $$1/4$$.

  7. Traffic Signal Change and Clearance Interval Pooled Fund Study Utah BSM...

    • catalog.data.gov
    • data.virginia.gov
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Highway Administration (2025). Traffic Signal Change and Clearance Interval Pooled Fund Study Utah BSM Trajectories Sample [Dataset]. https://catalog.data.gov/dataset/traffic-signal-change-and-clearance-interval-pooled-fund-study-utah-bsm-trajectories-sampl
    Explore at:
    Dataset updated
    Nov 4, 2025
    Dataset provided by
    Federal Highway Administrationhttps://highways.dot.gov/
    Description

    This dataset contains timestamped Basic Safety Messages (BSMs) collected from connected vehicles operating in Utah from vendor PANASONIC as part of the ITS JPO's Traffic Signal Change and Clearance Interval Pooled Fund Study. The data includes GPS location, speed, heading, accelerations, and brake status at 10 Hz frequency. These BSMs were transmitted from vehicles equipped with aftermarket onboard units (OBUs) and have been anonymized. The dataset supports research related to vehicle kinematics during signal change intervals and interactions with traffic signal states. To request the full dataset please email data.itsjpo@dot.gov.

  8. Condition Data with Random Recording Time

    • kaggle.com
    zip
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prognostics @ HSE (2022). Condition Data with Random Recording Time [Dataset]. https://www.kaggle.com/datasets/prognosticshse/condition-data-with-random-recording-time/data
    Explore at:
    zip(1167682 bytes)Available download formats
    Dataset updated
    Jun 10, 2022
    Authors
    Prognostics @ HSE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context: This data set originates from a practice-relevant degradation process, which is representative for Prognostics and Health Management (PHM) applications. The observed degradation process is the clogging of filters when separating of solid particles from gas. A test bench is used for this purpose, which performs automated life testing of filter media by loading them. For testing, dust complying with ISO standard 12103-1 and with a known particle size distribution is employed. The employed filter media is made of randomly oriented non-woven fibre material. Further data sets are generated for various practice-relevant data situations which do not correspond to the ideal conditions of full data coverage. These data sets are uploaded to Kaggle by the user "Prognostics @ HSE" in a continuous process. In order to avoid the carryover between two data sets, a different configuration of the filter tests is used for each uploaded practice-relevant data situation, for example by selecting a different filter media.

    Detailed specification: For more information about the general operation and the components used, see the provided description file Random Recording Condition Data Data Set.pdf

    Given data situation: In order to implement a predictive maintenance policy, knowledge about the time of failure respectively about the remaining useful life (RUL) of the technical system is necessary. The time of failure or the RUL can be predicted on the basis of condition data that indicate the damage progression of a technical system over time. However, the collection of condition data in typical industrial PHM applications is often only possible in an incomplete manner. An example is the collection of data during defined test cycles with specific loads, carried at intervals. For instance, this approach is often used with machining centers, where test cycles are only carried out between finished machining jobs or work shifts. Due to different work pieces, the machining time varies and the test cycle with the recording of condition data is not performed equidistantly. This results in a data characteristic that is comparable to a random sample of continuously recorded condition data. Another example that may result in such a data characteristic comes from the effort to reduce data volumes when recording condition data. Attempts can be made to keep the amount of data with unchanged damage as small as possible. One possible measure is not to transmit and store the continuous sensor readings, but rather sections of them, which also leads to gaps in the data available for prognosis. In the present data set, the life cycle of filters or rather their condition data, represented by the differential pressure, is considered. Failure of the filter occurs when the differential pressure across the filter exceeds 600 Pa. The time until a filter failure occurs depends especially on the amount of dust supplied per time, which is constant within a run-to-failure cycle. The previously explained data characteristics are addressed by means of corresponding training and test data. The training data is structured as follows: A run-to-failure cycle contains n batches of data. The number n varies between the cycles and depends on the duration of the batches and the time interval between the individual batches. The duration and time interval of the batches are random variables. A data batch includes the sensor readings of differential pressure and flow rate for the filter, the start and end time of the batch, and RUL information related to the end time of the batch. The sensor readings of the differential pressure and flow rate are recorded at a constant sampling rate. Figure 6 shows an illustrative run-to-failure cycle with multiple batches. The test data are randomly right-censored. They are also made of batches with a random duration and time interval between the batches. For each batch contained, the start and end time are given, as well as the sensor readings within the batch. The RUL is not given for each batch but only for the last data point of the right-censored run-to-failure cycle.

    Task: The aim is to predict the RUL of the censored filter test cycles given in the test data. In order to predict the RUL, training and test data are given, each consisting of 60 and 40 run-to-failure cycles. The test data contains random right-censored run-to-failure cycles and the respective RUL for the prediction task. The main challenge is to make the best use of the incompletely recorded training and test data to provide the most accurate prediction possible. Due to the detailed description of the setup and the various physical filter models described in literature, it is possible to support the actual data-driven models by integrating physical knowledge respectively models in the sense of theory-guided data science or informed machi...

  9. Z

    Data from: HRV-ACC: a dataset with R-R intervals and accelerometer data for...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamil Książek; Wilhelm Masarczyk; Przemysław Głomb; Michał Romaszewski; Iga Stokłosa; Piotr Ścisło; Paweł Dębski; Robert Pudlo; Piotr Gorczyca; Magdalena Piegza (2023). HRV-ACC: a dataset with R-R intervals and accelerometer data for the diagnosis of psychotic disorders using a Polar H10 wearable sensor [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8171265
    Explore at:
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    Psychiatric Department of the Multidisciplinary Hospital in Tarnowskie Góry
    Institute of Psychology, Humanitas University in Sosnowiec
    Department of Psychiatry, Faculty of Medical Sciences in Zabrze, Medical University of Silesia
    Institute of Theoretical and Applied Informatics, Polish Academy of Sciences
    Department of Psychoprophylaxis, Faculty of Medical Sciences in Zabrze, Medical University of Silesia
    Authors
    Kamil Książek; Wilhelm Masarczyk; Przemysław Głomb; Michał Romaszewski; Iga Stokłosa; Piotr Ścisło; Paweł Dębski; Robert Pudlo; Piotr Gorczyca; Magdalena Piegza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT

    The issue of diagnosing psychotic diseases, including schizophrenia and bipolar disorder, in particular, the objectification of symptom severity assessment, is still a problem requiring the attention of researchers. Two measures that can be helpful in patient diagnosis are heart rate variability calculated based on electrocardiographic signal and accelerometer mobility data. The following dataset contains data from 30 psychiatric ward patients having schizophrenia or bipolar disorder and 30 healthy persons. The duration of the measurements for individuals was usually between 1.5 and 2 hours. R-R intervals necessary for heart rate variability calculation were collected simultaneously with accelerometer data using a wearable Polar H10 device. The Positive and Negative Syndrome Scale (PANSS) test was performed for each patient participating in the experiment, and its results were attached to the dataset. Furthermore, the code for loading and preprocessing data, as well as for statistical analysis, was included on the corresponding GitHub repository.

    BACKGROUND

    Heart rate variability (HRV), calculated based on electrocardiographic (ECG) recordings of R-R intervals stemming from the heart's electrical activity, may be used as a biomarker of mental illnesses, including schizophrenia and bipolar disorder (BD) [Benjamin et al]. The variations of R-R interval values correspond to the heart's autonomic regulation changes [Berntson et al, Stogios et al]. Moreover, the HRV measure reflects the activity of the sympathetic and parasympathetic parts of the autonomous nervous system (ANS) [Task Force of the European Society of Cardiology the North American Society of Pacing Electrophysiology, Matusik et al]. Patients with psychotic mental disorders show a tendency for a change in the centrally regulated ANS balance in the direction of less dynamic changes in the ANS activity in response to different environmental conditions [Stogios et al]. Larger sympathetic activity relative to the parasympathetic one leads to lower HRV, while, on the other hand, higher parasympathetic activity translates to higher HRV. This loss of dynamic response may be an indicator of mental health. Additional benefits may come from measuring the daily activity of patients using accelerometry. This may be used to register periods of physical activity and inactivity or withdrawal for further correlation with HRV values recorded at the same time.

    EXPERIMENTS

    In our experiment, the participants were 30 psychiatric ward patients with schizophrenia or BD and 30 healthy people. All measurements were performed using a Polar H10 wearable device. The sensor collects ECG recordings and accelerometer data and, additionally, prepares a detection of R wave peaks. Participants of the experiment had to wear the sensor for a given time. Basically, it was between 1.5 and 2 hours, but the shortest recording was 70 minutes. During this time, evaluated persons could perform any activity a few minutes after starting the measurement. Participants were encouraged to undertake physical activity and, more specifically, to take a walk. Due to patients being in the medical ward, they received instruction to take a walk in the corridors at the beginning of the experiment. They were to repeat the walk 30 minutes and 1 hour after the first walk. The subsequent walks were to be slightly longer (about 3, 5 and 7 minutes, respectively). We did not remind or supervise the command during the experiment, both in the treatment and the control group. Seven persons from the control group did not receive this order and their measurements correspond to freely selected activities with rest periods but at least three of them performed physical activities during this time. Nevertheless, at the start of the experiment, all participants were requested to rest in a sitting position for 5 minutes. Moreover, for each patient, the disease severity was assessed using the PANSS test and its scores are attached to the dataset.

    The data from sensors were collected using Polar Sensor Logger application [Happonen]. Such extracted measurements were then preprocessed and analyzed using the code prepared by the authors of the experiment. It is publicly available on the GitHub repository [Książek et al].

    Firstly, we performed a manual artifact detection to remove abnormal heartbeats due to non-sinus beats and technical issues of the device (e.g. temporary disconnections and inappropriate electrode readings). We also performed anomaly detection using Daubechies wavelet transform. Nevertheless, the dataset includes raw data, while a full code necessary to reproduce our anomaly detection approach is available in the repository. Optionally, it is also possible to perform cubic spline data interpolation. After that step, rolling windows of a particular size and time intervals between them are created. Then, a statistical analysis is prepared, e.g. mean HRV calculation using the RMSSD (Root Mean Square of Successive Differences) approach, measuring a relationship between mean HRV and PANSS scores, mobility coefficient calculation based on accelerometer data and verification of dependencies between HRV and mobility scores.

    DATA DESCRIPTION

    The structure of the dataset is as follows. One folder, called HRV_anonymized_data contains values of R-R intervals together with timestamps for each experiment participant. The data was properly anonymized, i.e. the day of the measurement was removed to prevent person identification. Files concerned with patients have the name treatment_X.csv, where X is the number of the person, while files related to the healthy controls are named control_Y.csv, where Y is the identification number of the person. Furthermore, for visualization purposes, an image of the raw RR intervals for each participant is presented. Its name is raw_RR_{control,treatment}_N.png, where N is the number of the person from the control/treatment group. The collected data are raw, i.e. before the anomaly removal. The code enabling reproducing the anomaly detection stage and removing suspicious heartbeats is publicly available in the repository [Książek et al]. The structure of consecutive files collecting R-R intervals is following:

        Phone timestamp
        RR-interval [ms]
    
    
        12:43:26.538000
        651
    
    
        12:43:27.189000
        632
    
    
        12:43:27.821000
        618
    
    
        12:43:28.439000
        621
    
    
        12:43:29.060000
        661
    
    
        ...
        ...
    

    The first column contains the timestamp for which the distance between two consecutive R peaks was registered. The corresponding R-R interval is presented in the second column of the file and is expressed in milliseconds.
    The second folder, called accelerometer_anonymized_data contains values of accelerometer data collected at the same time as R-R intervals. The naming convention is similar to that of the R-R interval data: treatment_X.csv and control_X.csv represent the data coming from the persons from the treatment and control group, respectively, while X is the identification number of the selected participant. The numbers are exactly the same as for R-R intervals. The structure of the files with accelerometer recordings is as follows:

        Phone timestamp
        X [mg]
        Y [mg]
        Z [mg]
    
    
        13:00:17.196000
        -961
        -23
        182
    
    
        13:00:17.205000
        -965
        -21
        181
    
    
        13:00:17.215000
        -966
        -22
        187
    
    
        13:00:17.225000
        -967
        -26
        193
    
    
        13:00:17.235000
        -965
        -27
        191
    
    
        ...
        ...
        ...
        ...
    

    The first column contains a timestamp, while the next three columns correspond to the currently registered acceleration in three axes: X, Y and Z, in milli-g unit.

    We also attached a file with the PANSS test scores (PANSS.csv) for all patients participating in the measurement. The structure of this file is as follows:

        no_of_person
        PANSS_P
        PANSS_N
        PANSS_G
        PANSS_total
    
    
        1
        8
        13
        22
        43
    
    
        2
        11
        7
        18
        36
    
    
        3
        14
        30
        44
        88
    
    
        4
        18
        13
        27
        58
    
    
        ...
        ...
        ...
        ...
        ..
    

    The first column contains the identification number of the patient, while the three following columns refer to the PANSS scores related to positive, negative and general symptoms, respectively.

    USAGE NOTES

    All the files necessary to run the HRV and/or accelerometer data analysis are available on the GitHub repository [Książek et al]. HRV data loading, preprocessing (i.e. anomaly detection and removal), as well as the calculation of mean HRV values in terms of the RMSSD, is performed in the main.py file. Also, Pearson's correlation coefficients between HRV values and PANSS scores and the statistical tests (Levene's and Mann-Whitney U tests) comparing the treatment and control groups are computed. By default, a sensitivity analysis is made, i.e. running the full pipeline for different settings of the window size for which the HRV is calculated and various time intervals between consecutive windows. Preparing the heatmaps of correlation coefficients and corresponding p-values can be done by running the utils_advanced_plots.py file after performing the sensitivity analysis. Furthermore, a detailed analysis for the one selected set of hyperparameters may be prepared (by setting sensitivity_analysis = False), i.e. for 15-minute window sizes, 1-minute time intervals between consecutive windows and without data interpolation method. Also, patients taking quetiapine may be excluded from further calculations by setting exclude_quetiapine = True because this medicine can have a strong impact on HRV [Hattori et al].

    The accelerometer data processing may be performed using the utils_accelerometer.py file. In this case, accelerometer recordings are downsampled to ensure the same timestamps as for R-R intervals and, for each participant, the mobility coefficient is calculated. Then, a correlation

  10. Data from: Real data example.

    • plos.figshare.com
    xlsx
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Wang; Lili Tian; Li Yan (2024). Real data example. [Dataset]. http://doi.org/10.1371/journal.pone.0314705.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jia Wang; Lili Tian; Li Yan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In genomic study, log transformation is a common prepossessing step to adjust for skewness in data. This standard approach often assumes that log-transformed data is normally distributed, and two sample t-test (or its modifications) is used for detecting differences between two experimental conditions. However, recently it was shown that two sample t-test can lead to exaggerated false positives, and the Wilcoxon-Mann-Whitney (WMW) test was proposed as an alternative for studies with larger sample sizes. In addition, studies have demonstrated that the specific distribution used in modeling genomic data has profound impact on the interpretation and validity of results. The aim of this paper is three-fold: 1) to present the Exp-gamma distribution (exponential-gamma distribution stands for log-transformed gamma distribution) as a proper biological and statistical model for the analysis of log-transformed protein abundance data from single-cell experiments; 2) to demonstrate the inappropriateness of two sample t-test and the WMW test in analyzing log-transformed protein abundance data; 3) to propose and evaluate statistical inference methods for hypothesis testing and confidence interval estimation when comparing two independent samples under the Exp-gamma distributions. The proposed methods are applied to analyze protein abundance data from a single-cell dataset.

  11. Datasets for Multivariate Time Series Forecasting

    • kaggle.com
    zip
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanwen Hu (2024). Datasets for Multivariate Time Series Forecasting [Dataset]. https://www.kaggle.com/datasets/limpidcloud/datasets-for-multivariate-time-series-forecasting/data
    Explore at:
    zip(348708965 bytes)Available download formats
    Dataset updated
    Aug 29, 2024
    Authors
    Hanwen Hu
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Description

    Electricity

    This dataset contains electricity consumption of 370 clients from 2011 to 2014. Each column represent one client. Values are collected in kW of each 15 min.

    ETT

    The electricity transformer temperature dataset records the loads and oil temperature of 2 electricity transformers at 2 stations from 2016.07 to 2018.07. It collects data at a frequency of 15 minutes.

    Exchange

    This dataset is a collection of the daily exchange rates of eight countries, namely Australia, Britain, Canada, Switzerland, China, Japan, New Zealand and Singapore, from 1990 to 2016.

    QPS

    The QPS dataset records the number of queries of ten applications from May to June in 2022. The queries are measured every minute.

    Solar

    This dataset consists of 1 year (2006) of 5-minute solar power (mW) for 137 photovoltaic power plants in Alabama State.

    Traffic

    This dataset describes the hourly road occupancy rates (ranges from 0 to 1) measured by 862 sensors on San Francisco Bay Area freeways from 2015 to 2016.

    Weather

    This dataset contains meteorological observations of WS Beutenberg in Germany from 2004 to 2023. Data points are recorded at 10-minute intervals, comprising a total of 20 variables such as temperature, air pressure, humidity etc.

    Information

    DatasetLenDimTimeInterval
    Electricity1402563704Y15 min
    ETT6968014 (2*7)2Y15 min
    Exchange7587827Y1 day
    QPS30240103W1 min
    Solar1051201371Y5 min
    Traffic175448622Y1 h
    Weather10519202020Y10 min

    Datasets with Unified Intervals

    We also generate datasets with the same sampling interval of 1 hour based on the above original datasets. The generation methods are list as follows.

    DatasetLenIntervalMethod
    Electricity350641 hAverage
    ETT174201 hfrom ETTh
    QPS5041 hSummation
    Solar87601 hAverage
    Traffic175441 h-
    Weather1753201 hSample

    Thus, dataset Electricity, ETT, QPS, Solar, Traffic and Weather can be more fairly compared. 24, 48, 84 and 180 are recommended as prediction lengths to evaluate the performance of time series forecasting models on these five datasets.

  12. p

    Counts of Salmonella infection reported in UNITED STATES OF AMERICA:...

    • tycho.pitt.edu
    • data.niaid.nih.gov
    Updated Apr 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Willem G Van Panhuis; Anne L Cross; Donald S Burke (2018). Counts of Salmonella infection reported in UNITED STATES OF AMERICA: 1999-2017 [Dataset]. https://www.tycho.pitt.edu/dataset/US.302231008
    Explore at:
    Dataset updated
    Apr 1, 2018
    Dataset provided by
    Project Tycho, University of Pittsburgh
    Authors
    Willem G Van Panhuis; Anne L Cross; Donald S Burke
    Time period covered
    1999 - 2017
    Area covered
    United States
    Description

    Project Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.

    Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.

    Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".

  13. Z

    Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miha Mohorčič; Aleš Simončič; Mihael Mohorčič; Andrej Hrovat (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
    Explore at:
    Dataset updated
    Jan 6, 2023
    Authors
    Miha Mohorčič; Aleš Simončič; Mihael Mohorčič; Andrej Hrovat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

    This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

    It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

    Related dataset

    Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

    Measurement setup

    The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

    The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

    The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

    Data preprocessing

    The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

    PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

    Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
    Missing IE fields in the captured PR are not included in PR_IE_DATA.

    When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

    {'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

    where PR_data is structured as follows:

    { 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

    This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

    At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

    Folder structure

    For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

    The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

    Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

    Environments description

    The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

    Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

    Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

    Known dataset shortcomings

    Due to technical and physical limitations, the dataset contains some identified deficiencies.

    PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

    Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

    The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

     Location 1 - Piazza del Duomo - Chierici
    

    The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

     Location 2 - Via Etnea - Piazza del Duomo
    

    The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

     Location 3 - Via Etnea - Piazza Università
    

    Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

     Location 4 - Piazza Università
    

    This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

    Recognitions

    The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.

  14. f

    Dataset for: Semiparametric regression on cumulative incidence function with...

    • datasetcatalog.nlm.nih.gov
    • wiley.figshare.com
    Updated Jul 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu, Menggang; Yiannoutsos, Constantin T; Bakoyannis, Giorgos (2017). Dataset for: Semiparametric regression on cumulative incidence function with interval-censored competing risks data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001744765
    Explore at:
    Dataset updated
    Jul 17, 2017
    Authors
    Yu, Menggang; Yiannoutsos, Constantin T; Bakoyannis, Giorgos
    Description

    Many biomedical and clinical studies with time-to-event outcomes involve competing risks data. These data are frequently subject to interval censoring. This means that the failure time is not precisely observed, but is only known to lie between two observation times such as clinical visits in a cohort study. Not taking into account the interval censoring may result in biased estimation of the cause-specific cumulative incidence function, an important quantity in the competing risk framework, used for evaluating interventions in populations, for studying the prognosis of various diseases and for prediction and implementation science purposes. In this work we consider the class of semiparametric generalized odds-rate transformation models in the context of sieve maximum likelihood estimation based on B-splines. This large class of models includes both the proportional odds and the proportional subdistribution hazard models (i.e., the Fine-Gray model) as special cases. The estimator for the regression parameter is shown to be semiparametrically efficient and asymptotically normal. Simulation studies suggest that the method performs well even with small sample sizes. As an illustration we use the proposed method to analyze data from HIV-infected individuals obtained from a large cohort study in sub-Saharan Africa. We also provide the R function ciregic that implements the proposed method, and present an illustrate example.

  15. Emodata_v2

    • kaggle.com
    zip
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Las HTN (2024). Emodata_v2 [Dataset]. https://www.kaggle.com/datasets/lashtn/emodata-v2
    Explore at:
    zip(138625038 bytes)Available download formats
    Dataset updated
    Aug 2, 2024
    Authors
    Las HTN
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: emodata

    The emodata dataset is designed to analyze and predict emotions based on numerical labels and pixel data. It is structured to include information about emotion labels, pixel values, and their usage in training and testing. Below is a detailed description of the dataset:

    1. General Information - Purpose: Emotion analysis and prediction based on numerical scales and pixel data. - Total Samples: 49,400 - Emotion Labels: Represented as numerical intervals, each corresponding to a specific emotional intensity or category. - Pixel Data: Images are represented as pixel intensity values. - Data Split: - Training set: 82% of the data - Testing set: 18% of the data

    2. Emotion Labels

    • The labels are grouped into numerical intervals to categorize emotional intensity or types. Each interval corresponds to the count of samples:
      • 0.00 - 0.30: 6,221 samples
      • 0.90 - 1.20: 6,319 samples
      • 1.80 - 2.10: 6,420 samples
      • 3.00 - 3.30: 8,789 samples
      • 3.90 - 4.20: 7,498 samples
      • 4.80 - 5.10: 7,377 samples
      • 5.70 - 6.00: 6,763 samples
    • Statistical Summary:
      • Mean: 3.1
      • Standard Deviation: 1.94
      • Quantiles:
      • Minimum: 0
      • 25%: 1
      • Median: 3
      • 75%: 5
      • Maximum: 6

    3. Pixel Data

    • Unique Values:
      • Total Unique Values: 34,000
    • Most Common Pixel Intensities: Common pixel intensity values for various samples are listed, indicating grayscale or color representation.
    • Pixel Usage:
      • Training: 82%
      • Testing: 18%

    4. Data Quality

    • Valid Samples: 100% (49.4k samples)
    • Mismatched Samples: 0%
    • Missing Samples: 0%

    5. Usage

    This dataset is particularly suited for: - Emotion Classification Tasks: Training machine learning models to classify emotions based on numerical and image data. - Deep Learning Tasks: Utilizing pixel intensity data for convolutional neural networks (CNNs) to predict emotional states. - Statistical Analysis: Exploring the distribution of emotional intensities and their relationship with image features.

    Potential Applications

    • Sentiment Analysis
    • Emotion Detection in Images
    • Human-Computer Interaction Systems
    • AI-based Feedback Systems

    This dataset provides a comprehensive structure for emotion analysis through a combination of numerical and image data, making it versatile for both machine learning and deep learning applications.

  16. IoT_Health_Fitness_Tracking_System

    • kaggle.com
    zip
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2024). IoT_Health_Fitness_Tracking_System [Dataset]. https://www.kaggle.com/datasets/ziya07/iot-health-fitness-tracking-system/data
    Explore at:
    zip(34535 bytes)Available download formats
    Dataset updated
    Nov 27, 2024
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset simulates data collected from wearable IoT devices used by college students to monitor and track their health and fitness activities. The dataset contains 1000 rows, with each row representing a unique entry of health and fitness data recorded at random intervals for various activities.

    Columns in the Dataset: Device_ID:

    Type: String Description: Unique identifier for each wearable device. The devices are randomly assigned from a set of 10 devices (e.g., Device_1, Device_2, etc.). Example: Device_1, Device_5. Timestamp:

    Type: String (Datetime) Description: The exact timestamp when the data was recorded, generated at random intervals (between 10 to 30 minutes apart) starting from November 1, 2024. Example: 2024-11-01 00:00:00, 2024-11-01 00:30:00. Steps:

    Type: Integer Description: The number of steps taken by the student during a given time interval, representing physical activity level. Values are randomly generated in the range of 0 to 5000 steps. Example: 200, 3000. Heart_Rate:

    Type: Integer Description: The heart rate (in beats per minute) of the student, recorded by the wearable device. Values are randomly generated in the range of 60 to 180 bpm. Example: 75, 145. Calories_Burned:

    Type: Integer Description: The estimated number of calories burned by the student during a specific activity period. The values are randomly generated in the range of 50 to 500 calories. Example: 120, 350. Exercise_Duration:

    Type: Integer Description: The duration of exercise performed by the student, measured in minutes. Values range from 10 to 60 minutes. Example: 25, 45. Activity_Label (Target Column):

    Type: Categorical (String) Description: The type of activity performed during the data collection period. It serves as the target column for classification. Possible values are: Sedentary: Periods of inactivity or low movement. Walking: Moderate physical activity like walking. Running: Vigorous physical activity like running. Cycling: Cycling as a form of exercise. Example: Walking, Running. Activity_Confidence:

    Type: Float Description: The confidence score (between 0.85 and 1.0) assigned to the classification of the activity. This value reflects the accuracy of the activity recognition algorithm. Example: 0.95, 0.88. Temperature:

    Type: Float Description: The ambient temperature in Celsius during the data collection, randomly generated between 20°C and 30°C. Example: 22.5, 27.3. Location:

    Type: String Description: The location where the activity was recorded. Possible values are: Track: A running or cycling track. Classroom: Activity performed indoors, possibly during class breaks. Gym: Fitness activities performed in a gym. Park: Physical activities like walking or running in a park. Example: Park, Gym.

  17. z

    Counts of Influenza reported in UNITED STATES OF AMERICA: 1919-1951

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    json, xml, zip
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Willem Van Panhuis; Willem Van Panhuis; Anne Cross; Anne Cross; Donald Burke; Donald Burke (2024). Counts of Influenza reported in UNITED STATES OF AMERICA: 1919-1951 [Dataset]. http://doi.org/10.25337/t7/ptycho.v2.0/us.6142004
    Explore at:
    json, xml, zipAvailable download formats
    Dataset updated
    Jun 3, 2024
    Dataset provided by
    Project Tycho
    Authors
    Willem Van Panhuis; Willem Van Panhuis; Anne Cross; Anne Cross; Donald Burke; Donald Burke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 26, 1919 - Dec 8, 1951
    Area covered
    United States
    Description

    Project Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretabilty. We also formatted the data into a standard data format.

    Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datsets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of aquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.

    Depending on the intended use of a dataset, we recommend a few data processing steps before analysis:

    • Analyze missing data: Project Tycho datasets do not inlcude time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported.
    • Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exxclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".

  18. p

    Counts of Listeriosis reported in UNITED STATES OF AMERICA: 2000-2005

    • tycho.pitt.edu
    • data.niaid.nih.gov
    Updated Apr 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Willem G Van Panhuis; Anne L Cross; Donald S Burke (2018). Counts of Listeriosis reported in UNITED STATES OF AMERICA: 2000-2005 [Dataset]. https://www.tycho.pitt.edu/dataset/US.4241002
    Explore at:
    Dataset updated
    Apr 1, 2018
    Dataset provided by
    Project Tycho, University of Pittsburgh
    Authors
    Willem G Van Panhuis; Anne L Cross; Donald S Burke
    Time period covered
    2000 - 2005
    Area covered
    United States
    Description

    Project Tycho datasets contain case counts for reported disease conditions for countries around the world. The Project Tycho data curation team extracts these case counts from various reputable sources, typically from national or international health authorities, such as the US Centers for Disease Control or the World Health Organization. These original data sources include both open- and restricted-access sources. For restricted-access sources, the Project Tycho team has obtained permission for redistribution from data contributors. All datasets contain case count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team. The Project Tycho team has pre-processed datasets by adding new variables, such as standard disease and location identifiers, that improve data interpretability. We also formatted the data into a standard data format.

    Each Project Tycho dataset contains case counts for a specific condition (e.g. measles) and for a specific country (e.g. The United States). Case counts are reported per time interval. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we extracted case counts. One dataset can include many series of case count time intervals, such as "US measles cases as reported by CDC", or "US measles cases reported by WHO", or "US measles cases that originated abroad", etc.

    Depending on the intended use of a dataset, we recommend a few data processing steps before analysis: - Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (for many datasets, time series of case counts are incomplete, due to incompleteness of source documents) and users will need to add time intervals for which no count value is available. Project Tycho datasets do include time intervals for which a case count value of zero was reported. - Separate cumulative from non-cumulative time interval series. Case count time series in Project Tycho datasets can be "cumulative" or "fixed-intervals". Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc. It is common practice among public health agencies to report cases for cumulative time intervals. Case count series with fixed time intervals consist of mutually exclusive time intervals that all start and end on different dates and all have identical length (day, week, month, year). Given the different nature of these two types of case count data, we indicated this with an attribute for each count value, named "PartOfCumulativeCountSeries".

  19. intraday trading information for IBM stock

    • kaggle.com
    zip
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruvil Patel (2024). intraday trading information for IBM stock [Dataset]. https://www.kaggle.com/datasets/dhruvil633/intraday-trading-information-for-ibm-stock
    Explore at:
    zip(1207 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    Dhruvil Patel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    IBM Intraday Stock Price Data (5-Minute Intervals) This dataset provides comprehensive intraday trading data for IBM stock at 5-minute intervals, capturing essential price and volume metrics for each trading session. It is ideal for short-term trading analysis, pattern recognition, and intraday trend forecasting.

    Dataset Overview Each row in the dataset represents IBM's stock information for a specific 5-minute interval, including:

    Timestamp: The exact time (Eastern Time) for each data entry. Open: The stock price at the beginning of the interval. High: The highest price within the interval. Low: The lowest price within the interval. Close: The stock price at the end of the interval. Volume: The number of shares traded within the interval. Potential Uses This dataset is well-suited for various financial and quantitative analysis projects, such as:

    Volume and Price Movement Analysis: Identify periods with unusually high trading volume and investigate if they correspond with significant price changes or market events. Intraday Trend Analysis: Observe trends by plotting the closing prices over time to spot patterns in stock performance during a single trading day or across multiple days. Volatility Detection: Track intervals with a high difference between the high and low prices to detect periods of increased price volatility. Time-Series Forecasting: Use machine learning models to predict price movements based on historical intraday data and patterns. Example Analysis Ideas Visualize Price Movements: Plot open, high, low, and close prices over time to get a clear view of price trends and fluctuations. Analyze Volume Spikes: Find and investigate timestamps with high trading volume, which might indicate significant market activity. Apply Machine Learning: Use techniques such as LSTM, ARIMA, or other time-series forecasting models to predict short-term price movements. This dataset is especially valuable for traders, quantitative analysts, and developers building financial models or applications that require real-time market insights.

    About the Data This dataset was obtained via the Alpha Vantage API, using their TIME_SERIES_INTRADAY function. The data here represents IBM's intraday stock price movements on November 11, 2024, at 5-minute intervals.

  20. d

    Wetland Paleoecological Study of Coastal Louisiana: Sediment Cores and...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Wetland Paleoecological Study of Coastal Louisiana: Sediment Cores and Diatom Samples Dataset [Dataset]. https://catalog.data.gov/dataset/wetland-paleoecological-study-of-coastal-louisiana-sediment-cores-and-diatom-samples-datas
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Louisiana
    Description

    Wetland sediment data was collected from coastal Louisiana as part of a pilot study to develop a diatom-based proxy for past wetland water chemistry and the identification of sediment deposits for tropical storms. The complete dataset includes forty-six surface sediment samples and nine sediment cores. The surface sediment samples were collected in fresh to brackish marsh throughout the southwest Louisiana Chenier Plain and are located coincident with Coastwide Reference Monitoring System (CRMS). Sediment cores were collected at Rockefeller Wildlife Refuge. The data described here include sedimentary properties, radioisotopes, x-radiographs, and diatom species counts for depth-interval samples of sediment cores.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Emily Rollinson (2016). Confidence Interval Examples [Dataset]. http://doi.org/10.6084/m9.figshare.3466364.v2
Organization logoOrganization logo

Confidence Interval Examples

Explore at:
62 scholarly articles cite this dataset (View in Google Scholar)
application/cdfv2Available download formats
Dataset updated
Jun 28, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Emily Rollinson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Examples demonstrating how confidence intervals change depending on the level of confidence (90% versus 95% versus 99%) and on the size of the sample (CI for n=20 versus n=10 versus n=2). Developed for BIO211 (Statistics and Data Analysis: A Conceptual Approach) at Stony Brook University in Fall 2015.

Search
Clear search
Close search
Google apps
Main menu