Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The p-value is a likelihood ratio p-value and thus identical for both comparison measures. The numbers needed to treat (NNT) were based on the estimated risk difference.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a small dataset of various global indicators developed for use in a course teaching research methods at the Croft Institute for International Studies at the University of Mississippi. The data is ready to be directly imported into SPSS, Stata, or other statistical packages. A brief codebook includes descriptions of each variable, the indicator's reference year(s), and links to the original sources. The data is cross-sectional, country-level data centered on 2015 as the primary reference year. Some data come from the most recent election or averages from a handful of years. The dataset includes socioeconomic and political data drawn from sources and indicators from the World Bank, the UNDP, and International IDEA. It also includes popular indexes (and some key components) from Freedom House, Polity IV, the Economist's Democracy Index, the Heritage Foundation's Index of Economic Freedom, and the Fund for Peace's Fragile States Index. The dataset also includes various types of data (nominal, ordinal, interval, and ratio), useful for pedagogical examples of how to handle statistical data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Developing a confidence interval for the ratio of two quantities is an important task in statistics because of its omnipresence in real world applications. For such a problem, the MOVER-R (method of variance recovery for the ratio) technique, which is based on the recovery of variance estimates from confidence limits of the numerator and the denominator separately, was proposed as a useful and efficient approach. However, this method implicitly assumes that the confidence interval for the denominator never includes zero, which might be violated in practice. In this article, we first use a new framework to derive the MOVER-R confidence interval, which does not require the above assumption and covers the whole parameter space. We find that MOVER-R can produce an unbounded confidence interval, just like the well-known Fieller method. To overcome this issue, we further propose the penalized MOVER-R. We prove that the new method differs from MOVER-R only at the second order. It, however, always gives a bounded and analytic confidence interval. Through simulation studies and a real data application, we show that the penalized MOVER-R generally provides a better confidence interval than MOVER-R in terms of controlling the coverage probability and the median width.
This is the data set behind the Wind Generation Interactive Query Tool created by the CEC. The visualization tool interactively displays wind generation over different time intervals in three-dimensional space. The viewer can look across the state to understand generation patterns of regions with concentrations of wind power plants. The tool aids in understanding high and low periods of generation. Operation of the electric grid requires that generation and demand are balanced in each period. The height and color of columns at wind generation areas are scaled and shaded to represent capacity factors (CFs) of the areas in a specific time interval. Capacity factor is the ratio of the energy produced to the amount of energy that could ideally have been produced in the same period using the rated nameplate capacity. Due to natural variations in wind speeds, higher factors tend to be seen over short time periods, with lower factors over longer periods. The capacity used is the reported nameplate capacity from the Quarterly Fuel and Energy Report, CEC-1304A. CFs are based on wind plants in service in the wind generation areas.Renewable energy resources like wind facilities vary in size and geographic distribution within each state. Resource planning, land use constraints, climate zones, and weather patterns limit availability of these resources and where they can be developed. National, state, and local policies also set limits on energy generation and use. An example of resource planning in California is the Desert Renewable Energy Conservation Plan. By exploring the visualization, a viewer can gain a three-dimensional understanding of temporal variation in generation CFs, along with how the wind generation areas compare to one another. The viewer can observe that areas peak in generation in different periods. The large range in CFs is also visible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGrowing evidence suggests that intervention for smoking cessation enhances alcohol abstinence in treatment settings for alcohol dependence. However, research in this field is rare in Asians.MethodWe prospectively investigated the association of smoking status with drinking status using 9 surveys mailed during a 12-month period in 198 Japanese alcohol-dependent men (70 never/ex-smokers and 128 smokers) who admitted for the first time and completed a 3-month inpatient program for simultaneous alcohol abstinence and smoking cessation.ResultsNonsmoking during the first month after discharge and at the end of follow-up was reported in 28.9% and 25.0% of the baseline smokers, respectively. Kaplan-Meier estimates showed that a 12-month alcohol abstinence and heavy-drinking-free status were more frequent among never/ex-smokers (45.1% and 59.8%, respectively) and baseline smokers who quit smoking during the first month after discharge (59.0% and 60.8%, respectively), compared with sustained smokers (30.0% and 41.2%, respectively). Among the baseline smokers, the multivariate odds ratio (95% confidence interval) for smoking cessation during the first month were 2.77 (1.01–7.61) for alcohol abstinence during the period and 2.50 (1.00–6.25) for use of varenicline, a smoking cessation agent, during the inpatient program. After adjusting for age, drinking profile, lifestyle, family history of heavy or problem drinking, lifetime episodes of other major psychiatric disorders, and medications at discharge, the multivariate hazard ratios (HRs) for drinking lapse were 0.57 (0.37–0.89) for the never/ex-smoking and 0.41 (0.23–0.75) for new smoking cessation groups, respectively, compared with sustained smoking, while the corresponding HRs for heavy-drinking lapse were 0.55 (0.33–0.90) and 0.47 (0.25–0.88), respectively. The HR for drinking lapse was 0.63 (0.42–0.95) for the nonsmoking group (vs. smoking) during the observation period, while the HR for heavy-drinking lapse was 0.58 (0.37–0.91) for the nonsmoking group (vs. smoking) during the observation period. Other significant variables that worsened drinking outcomes were higher daily alcohol intake prior to hospitalization, family history of heavy or problem drinking and psychiatric medications at discharge.ConclusionNonsmoking was associated with better outcomes on the drinking status of Japanese alcohol-dependent men, and a smoking cessation program may be recommended to be integrated into alcohol abstinence programs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a supporting dataset for "Intermodel comparison of the atmospheric composition changes due to emissions from a potential future supersonic aircraft fleet" (https://doi.org/10.5194/acp-25-2515-2025). This dataset contains all data necessary to reproduce the results of the publication.
In this work four state-of-the-science atmospheric chemistry transport models (EMAC, GEOS-Chem, LMDZ-INCA, MOZART-3) are used to evaluate the effects of three supersonic emission scenarios on a 2050 atmosphere. The future atmosphere and emissions are based on the SSP 3.7 scenario.
For each of the models this dataset includes the volume mixing ratio (vmr) and mass distribution of several key species through the atmosphere during the last years of model integration. The LMDZ-INCA and GEOS-Chem data span a 3 year interval, whereas 6 years of data is provided for EMAC and the MOZART-3 data is already annually averaged. We refer to the associated publication for more information regarding these timescales.
The dataset contains atmospheric histories for 4 different emission scenarios, denoted as A0 to A3. The A0 scenario is a baseline scenario with no supersonic aviation (only subsonic). In the A1 scenario part of the subsonic civil aviation is replaced by supersonic aircraft operating at mach 2/0. Scenario A2 is a variant of A1 with triple the emission of nitrogen oxides from the supersonic aircraft, and scenario A3 considers the partial replacement of subsonic civil aviation with mach 1.6 supersonic aircraft instead at a lower cruise altitude. For detailed descriptions of the scenarios we refer to the associated publication.
The files are named using the following convention (MODEL)_(SCENARIO)_(VARIABLE).nc4, for example the file GEOS-Chem_A1_H2O_mass.nc4 contains H2O mass distributions for the A1 scenario from the GEOS-Chem model. The following variables are included in this dataset:
For the MOZART-3 model some data is provided as differences instead, where the SCENARIO entry of (A1-A0) indicates a file with data of the difference between the respective fields of the A1 and A0 scenarios.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDCC Traffic Congestion Saturation Flow Data for January to June 2023. Traffic volumes, traffic saturation, and congestion data for sites across South Dublin County. Used by traffic management to control stage timings on junctions. It is recommended that this dataset is read in conjunction with the ‘Traffic Data Site Names SDCC’ dataset.A detailed description of each column heading can be referenced below;scn: Site Serial numberregion: A group of Nodes that are operated under SCOOT control at the same common cycle time. Normally these will be nodes between which co-ordination is desirable. Some of the nodes may be double cycling at half of the region cycle time.system: SCOOT STC UTC (UTC-MX)locn: Locationssite: Site numbersday: Days of the week Monday to Sunday. Abbreviations; MO,TU,WE,TH,FR,SA,SU.date: Reflects correct actual Date of when data was collected.start_time: NOTE - Please ignore the date displayed in this column. The actual data collection date is correctly displayed in the 'date' column. The date displayed here is the date of when report was run and extracted from the system, but correctly reflects start time of 15 minute intervals. end_time: End time of 15 minute intervals.flow: A representation of demand (flow) for each link built up over several minutes by the SCOOT model. SCOOT has two profiles:(1) Short – Raw data representing the actual values over the previous few minutes(2) Long – A smoothed average of values over a longer periodSCOOT will choose to use the appropriate profile depending on a number of factors.flow_pc: Same as above ref PC SCOOTcong: Congestion is directly measured from the detector. If the detector is placed beyond the normal end of queue in the street it is rarely covered by stationary traffic, except of course when congestion occurs. If any detector shows standing traffic for the whole of an interval this is recorded. The number of intervals of congestion in any cycle is also recorded.The percentage congestion is calculated from:No of congested intervals x 4 x 100 cycle time in seconds.This percentage of congestion is available to view and more importantly for the optimisers to take into account.cong_pc: Same as above ref PC SCOOTdsat: The ratio of the demand flow to the maximum possible discharge flow, i.e. it is the ratio of the demand to the discharge rate (Saturation Occupancy) multiplied by the duration of the effective green time. The Split optimiser will try to minimise the maximum degree of saturation on links approaching the node.
Table from the American Community Survey (ACS) 5-year series on poverty and employment status related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B23025 Employment Status for the Population 16 years and over, B23024 Poverty Status by Disability Status by Employment Status for the Population 20 to 64 years, B17010 Poverty Status of Families by Family Type by Presence of Related Children under 18 years, C17002 Ratio of Income to Poverty Level in the Past 12 Months. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B23025, B23024, B17010, C17002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
The data are issued from an empirical laboratory research on the banana weevil movements. The data set 'records.txt' contains detection records of individuals moving between 2 patches. It permits to determine the population dynamics of weevils. The data set 'alldata5.txt ' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 5 minutes. The data set 'alldata10.txt' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 10 minutes. The data set 'alldata15.txt ' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 15 minutes. The data set 'alldata20.txt ' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 20 minutes. The data set 'alldata25.txt ' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 25 minutes. The data set 'alldata30.txt ' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 30 minutes. The data set 'alldata60.txt ' contains the individual responses to environmental conditions (sex ratio, population density) when the 'records.txt' data set is analysed for a time interval of 60 minutes. The data set 'id.txt ' contains the individual characteristics (sex, identification number). The data set 'Starting_Dates.txt ' contains the experimental conditions of repetitions.
The American Community Survey Education Tabulation (ACS-ED) is a custom tabulation of the ACS produced for the National Center of Education Statistics (NCES) by the U.S. Census Bureau. The ACS-ED provides a rich collection of social, economic, demographic, and housing characteristics for school systems, school-age children, and the parents of school-age children. In addition to focusing on school-age children, the ACS-ED provides enrollment iterations for children enrolled in public school. The data profiles include percentages (along with associated margins of error) that allow for comparison of school district-level conditions across the U.S. For more information about the NCES ACS-ED collection, visit the NCES Education Demographic and Geographic Estimates (EDGE) program at: https://nces.ed.gov/programs/edge/Demographic/ACSAnnotation values are negative value representations of estimates and have values when non-integer information needs to be represented. See the table below for a list of common Estimate/Margin of Error (E/M) values and their corresponding Annotation (EA/MA) values.All information contained in this file is in the public _domain. Data users are advised to review NCES program documentation and feature class metadata to understand the limitations and appropriate use of these data. -9 An '-9' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small. -8 An '-8' means that the estimate is not applicable or not available. -6 A '-6' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution. -5 A '-5' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. -3 A '-3' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate. -2 A '-2' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.
The American Community Survey Education Tabulation (ACS-ED) is a custom tabulation of the ACS produced for the National Center of Education Statistics (NCES) by the U.S. Census Bureau. The ACS-ED provides a rich collection of social, economic, demographic, and housing characteristics for school systems, school-age children, and the parents of school-age children. In addition to focusing on school-age children, the ACS-ED provides enrollment iterations for children enrolled in public school. The data profiles include percentages (along with associated margins of error) that allow for comparison of school district-level conditions across the U.S. For more information about the NCES ACS-ED collection, visit the NCES Education Demographic and Geographic Estimates (EDGE) program at: https://nces.ed.gov/programs/edge/Demographic/ACSAnnotation values are negative value representations of estimates and have values when non-integer information needs to be represented. See the table below for a list of common Estimate/Margin of Error (E/M) values and their corresponding Annotation (EA/MA) values.All information contained in this file is in the public domain. Data users are advised to review NCES program documentation and feature class metadata to understand the limitations and appropriate use of these data.-9An '-9' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small.-8An '-8' means that the estimate is not applicable or not available.-6A '-6' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.-5A '-5' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.-3A '-3' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate.-2A '-2' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundSuboptimal utilization of antiretroviral therapy (ART) services remains a problem among adolescents in low- and middle-income countries, which has a negative impact on their response to treatment and increases the risk of developing resistance. Optimal use is essential to enhancing treatment efficacy. We investigated the optimal use of ART service and predictors among adolescents living with HIV (ALHIV) in northern Uganda.MethodsWe used a cross-sectional study design to collect quantitative data from 293 ALHIV at three health facilities in Lira municipality, northern Uganda. We used an interviewer-administered questionnaire and data abstraction form. Data were analysed using SPSS version 23 software. Descriptive analysis and logistic regressions were performed to determine the relationship between the predictor and outcome variables. Statistical significance was determined at P-value
Protein expression varies as a result of intricate regulation of synthesis and degradation of messenger RNAs (mRNA) and proteins. Studies of dynamic regulation typically rely on time-course data sets of mRNA and protein expression, yet there are no statistical methods that integrate these multiomics data and deconvolute individual regulatory processes of gene expression control underlying the observed concentration changes. To address this challenge, we developed Protein Expression Control Analysis (PECA), a method to quantitatively dissect protein expression variation into the contributions of mRNA synthesis/degradation and protein synthesis/degradation, termed RNA-level and protein-level regulation respectively. PECA computes the rate ratios of synthesis versus degradation as the statistical summary of expression control during a given time interval at each molecular level and computes the probability that the rate ratio changed between adjacent time intervals, indicating regulation change at the time point. Along with the associated false-discovery rates, PECA gives the complete description of dynamic expression control, that is, which proteins were up- or down-regulated at each molecular level and each time point. Using PECA, we analyzed two yeast data sets monitoring the cellular response to hyperosmotic and oxidative stress. The rate ratio profiles reported by PECA highlighted a large magnitude of RNA-level up-regulation of stress response genes in the early response and concordant protein-level regulation with time delay. However, the contributions of RNA- and protein-level regulation and their temporal patterns were different between the two data sets. We also observed several cases where protein-level regulation counterbalanced transcriptomic changes in the early stress response to maintain the stability of protein concentrations, suggesting that proteostasis is a proteome-wide phenomenon mediated by post-transcriptional regulation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CESNET-TimeSeries24: The dataset for network traffic forecasting and anomaly detection
The dataset called CESNET-TimeSeries24 was collected by long-term monitoring of selected statistical metrics for 40 weeks for each IP address on the ISP network CESNET3 (Czech Education and Science Network). The dataset encompasses network traffic from more than 275,000 active IP addresses, assigned to a wide variety of devices, including office computers, NATs, servers, WiFi routers, honeypots, and video-game consoles found in dormitories. Moreover, the dataset is also rich in network anomaly types since it contains all types of anomalies, ensuring a comprehensive evaluation of anomaly detection methods.Last but not least, the CESNET-TimeSeries24 dataset provides traffic time series on institutional and IP subnet levels to cover all possible anomaly detection or forecasting scopes. Overall, the time series dataset was created from the 66 billion IP flows that contain 4 trillion packets that carry approximately 3.7 petabytes of data. The CESNET-TimeSeries24 dataset is a complex real-world dataset that will finally bring insights into the evaluation of forecasting models in real-world environments.
Please cite the usage of our dataset as:
Koumar, J., Hynek, K., Čejka, T. et al. CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting. Sci Data 12, 338 (2025). https://doi.org/10.1038/s41597-025-04603-x@Article{cesnettimeseries24, author={Koumar, Josef and Hynek, Karel and {\v{C}}ejka, Tom{\'a}{\v{s}} and {\v{S}}i{\v{s}}ka, Pavel}, title={CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting}, journal={Scientific Data}, year={2025}, month={Feb}, day={26}, volume={12}, number={1}, pages={338}, issn={2052-4463}, doi={10.1038/s41597-025-04603-x}, url={https://doi.org/10.1038/s41597-025-04603-x}}
Time series
We create evenly spaced time series for each IP address by aggregating IP flow records into time series datapoints. The created datapoints represent the behavior of IP addresses within a defined time window of 10 minutes. The vector of time-series metrics v_{ip, i} describes the IP address ip in the i-th time window. Thus, IP flows for vector v_{ip, i} are captured in time windows starting at t_i and ending at t_{i+1}. The time series are built from these datapoints.
Datapoints created by the aggregation of IP flows contain the following time-series metrics:
Simple volumetric metrics: the number of IP flows, the number of packets, and the transmitted data size (i.e. number of bytes)
Unique volumetric metrics: the number of unique destination IP addresses, the number of unique destination Autonomous System Numbers (ASNs), and the number of unique destination transport layer ports. The aggregation of \textit{Unique volumetric metrics} is memory intensive since all unique values must be stored in an array. We used a server with 41 GB of RAM, which was enough for 10-minute aggregation on the ISP network.
Ratios metrics: the ratio of UDP/TCP packets, the ratio of UDP/TCP transmitted data size, the direction ratio of packets, and the direction ratio of transmitted data size
Average metrics: the average flow duration, and the average Time To Live (TTL)
Multiple time aggregation: The original datapoints in the dataset are aggregated by 10 minutes of network traffic. The size of the aggregation interval influences anomaly detection procedures, mainly the training speed of the detection model. However, the 10-minute intervals can be too short for longitudinal anomaly detection methods. Therefore, we added two more aggregation intervals to the datasets--1 hour and 1 day.
Time series of institutions: We identify 283 institutions inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution's data.
Time series of institutional subnets: We identify 548 institution subnets inside the CESNET3 network. These time series aggregated per each institution ID provide a view of the institution subnet's data.
Data Records
The file hierarchy is described below:
cesnet-timeseries24/
|- institution_subnets/
| |- agg_10_minutes/.csv
| |- agg_1_hour/.csv
| |- agg_1_day/.csv
| |- identifiers.csv
|- institutions/
| |- agg_10_minutes/.csv
| |- agg_1_hour/.csv
| |- agg_1_day/.csv
| |- identifiers.csv
|- ip_addresses_full/
| |- agg_10_minutes//.csv
| |- agg_1_hour//.csv
| |- agg_1_day//.csv
| |- identifiers.csv
|- ip_addresses_sample/
| |- agg_10_minutes/.csv
| |- agg_1_hour/.csv
| |- agg_1_day/.csv
| |- identifiers.csv
|- times/
| |- times_10_minutes.csv
| |- times_1_hour.csv
| |- times_1_day.csv
|- ids_relationship.csv |- weekends_and_holidays.csv
The following list describes time series data fields in CSV files:
id_time: Unique identifier for each aggregation interval within the time series, used to segment the dataset into specific time periods for analysis.
n_flows: Total number of flows observed in the aggregation interval, indicating the volume of distinct sessions or connections for the IP address.
n_packets: Total number of packets transmitted during the aggregation interval, reflecting the packet-level traffic volume for the IP address.
n_bytes: Total number of bytes transmitted during the aggregation interval, representing the data volume for the IP address.
n_dest_ip: Number of unique destination IP addresses contacted by the IP address during the aggregation interval, showing the diversity of endpoints reached.
n_dest_asn: Number of unique destination Autonomous System Numbers (ASNs) contacted by the IP address during the aggregation interval, indicating the diversity of networks reached.
n_dest_port: Number of unique destination transport layer ports contacted by the IP address during the aggregation interval, representing the variety of services accessed.
tcp_udp_ratio_packets: Ratio of packets sent using TCP versus UDP by the IP address during the aggregation interval, providing insight into the transport protocol usage pattern. This metric belongs to the interval <0, 1> where 1 is when all packets are sent over TCP, and 0 is when all packets are sent over UDP.
tcp_udp_ratio_bytes: Ratio of bytes sent using TCP versus UDP by the IP address during the aggregation interval, highlighting the data volume distribution between protocols. This metric belongs to the interval <0, 1> with same rule as tcp_udp_ratio_packets.
dir_ratio_packets: Ratio of packet directions (inbound versus outbound) for the IP address during the aggregation interval, indicating the balance of traffic flow directions. This metric belongs to the interval <0, 1>, where 1 is when all packets are sent in the outgoing direction from the monitored IP address, and 0 is when all packets are sent in the incoming direction to the monitored IP address.
dir_ratio_bytes: Ratio of byte directions (inbound versus outbound) for the IP address during the aggregation interval, showing the data volume distribution in traffic flows. This metric belongs to the interval <0, 1> with the same rule as dir_ratio_packets.
avg_duration: Average duration of IP flows for the IP address during the aggregation interval, measuring the typical session length.
avg_ttl: Average Time To Live (TTL) of IP flows for the IP address during the aggregation interval, providing insight into the lifespan of packets.
Moreover, the time series created by re-aggregation contains following time series metrics instead of n_dest_ip, n_dest_asn, and n_dest_port:
sum_n_dest_ip: Sum of numbers of unique destination IP addresses.
avg_n_dest_ip: The average number of unique destination IP addresses.
std_n_dest_ip: Standard deviation of numbers of unique destination IP addresses.
sum_n_dest_asn: Sum of numbers of unique destination ASNs.
avg_n_dest_asn: The average number of unique destination ASNs.
std_n_dest_asn: Standard deviation of numbers of unique destination ASNs)
sum_n_dest_port: Sum of numbers of unique destination transport layer ports.
avg_n_dest_port: The average number of unique destination transport layer ports.
std_n_dest_port: Standard deviation of numbers of unique destination transport layer ports.
Moreover, files identifiers.csv in each dataset type contain IDs of time series that are present in the dataset. Furthermore, the ids_relationship.csv file contains a relationship between IP addresses, Institutions, and institution subnets. The weekends_and_holidays.csv contains information about the non-working days in the Czech Republic.
The data comprises various attributes taken from signals measured using ECG recorded for different individuals having different heart rates at the time the measurement was taken. These various features contribute to the heart rate at the given instant of time for the individual.
You have been provided with a total of 7 CSV files with the names as follows: time_domain_features_train.csv - This file contains all time domain features of heart rate for training data frequency_domain_features_train.csv - This file contains all frequency domain features of heart rate for training data heart_rate_non_linear_features_train.csv - This file contains all non linear features of heart rate for training data time_domain_features_test.csv - This file contains all time domain features of heart rate for testing data frequency_domain_features_test.csv - This file contains all frequency domain features of heart rate for testing data heart_rate_non_linear_features_test.csv - This file contains all non linear features of heart rate for testing data sample_submission.csv - This file contains the format in which you need to make submissions to the portal
Following is the data dictionary for the features you will come across in the files mentioned:
MEAN_RR - Mean of RR intervals MEDIAN_RR - Median of RR intervals SDRR - Standard deviation of RR intervals RMSSD - Root mean square of successive RR interval differences SDSD - Standard deviation of successive RR interval differences SDRR_RMSSD - Ratio of SDRR / RMSSD pNN25 - Percentage of successive RR intervals that differ by more than 25 ms pNN50 - Percentage of successive RR intervals that differ by more than 50 ms KURT - Kurtosis of distribution of successive RR intervals SKEW - Skew of distribution of successive RR intervals MEAN_REL_RR - Mean of relative RR intervals MEDIAN_REL_RR - Median of relative RR intervals SDRR_REL_RR - Standard deviation of relative RR intervals RMSSD_REL_RR - Root mean square of successive relative RR interval differences SDSD_REL_RR - Standard deviation of successive relative RR interval differences SDRR_RMSSD_REL_RR - Ratio of SDRR/RMSSD for relative RR interval differences KURT_REL_RR - Kurtosis of distribution of relative RR intervals SKEW_REL_RR - Skewness of distribution of relative RR intervals uuid - Unique ID for each patient VLF - Absolute power of the very low frequency band (0.0033 - 0.04 Hz) VLF_PCT - Principal component transform of VLF LF - Absolute power of the low frequency band (0.04 - 0.15 Hz) LF_PCT - Principal component transform of LF LF_NU - Absolute power of the low frequency band in normal units HF - Absolute power of the high frequency band (0.15 - 0.4 Hz) HF_PCT - Principal component transform of HF HF_NU - Absolute power of the highest frequency band in normal units TP - Total power of RR intervals LF_HF - Ratio of LF to HF HF_LF - Ratio of HF to LF SD1 - Poincaré plot standard deviation perpendicular to the line of identity SD2 - Poincaré plot standard deviation along the line of identity Sampen - sample entropy which measures the regularity and complexity of a time series higuci - higuci fractal dimension of heartrate datasetId - ID of the whole dataset condition - condition of the patient at the time the data was recorded HR - Heart rate of the patient at the time of data recorded
Objective
The objective is to build a regressor model which can predict the heart rate of an individual. This prediction can help to monitor stress levels of the individual.
Reference :- Great learning
The American Community Survey Education Tabulation (ACS-ED) is a custom tabulation of the ACS produced for the National Center of Education Statistics (NCES) by the U.S. Census Bureau. The ACS-ED provides a rich collection of social, economic, demographic, and housing characteristics for school systems, school-age children, and the parents of school-age children. In addition to focusing on school-age children, the ACS-ED provides enrollment iterations for children enrolled in public school. The data profiles include percentages (along with associated margins of error) that allow for comparison of school district-level conditions across the U.S. For more information about the NCES ACS-ED collection, visit the NCES Education Demographic and Geographic Estimates (EDGE) program at: https://nces.ed.gov/programs/edge/Demographic/ACSAnnotation values are negative value representations of estimates and have values when non-integer information needs to be represented. See the table below for a list of common Estimate/Margin of Error (E/M) values and their corresponding Annotation (EA/MA) values.All information contained in this file is in the public domain. Data users are advised to review NCES program documentation and feature class metadata to understand the limitations and appropriate use of these data.-9An '-9' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small.-8An '-8' means that the estimate is not applicable or not available.-6A '-6' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.-5A '-5' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.-3A '-3' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate.-2A '-2' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CityPropStats provides aggregated property statistics for 795 cities and towns (i.e., Metropolitan and Micropolitan statistical areas) in the conterminous United States. These statistics include sum, mean, median, Gini index and entropy of residential floor space, cadastral parcel size, floor-area ratio, and property value, approximately for the reference year 2020, aggregated by building construction year in decadal steps (cumulative and incremental) from 1910 to 2020.Cumulative statistics: CBSA_Property_Statistics_1910-2020_cumulative.csvDecadal time slices statistics: CBSA_Property_Statistics_1910-2020_decadal_slices.csvData source: Zillow Transaction and Assessment Dataset (ZTRAX), provided to University of Colorado Boulder via a data share agreement (2016-2023).CityPropStats is a supplementary dataset to:Ortman, Scott G., Amy Bogaard, Jessica Munson, Dan Lawrence, Adam S. Green, Gary M. Feinman, Shadreck Chirikure, Johannes H. Uhl, and Stefan Leyk. "Changes in agglomeration and productivity are poor predictors of inequality across the archaeological record." Proceedings of the National Academy of Sciences 122, no. 16 (2025): e2400693122. https://doi.org/10.1073/pnas.2400693122Column description:cbsa_idCBSA GEOIDcbsa_nameFull namecbsa_typeCBSA type (metro vs micropolitan statistical area)year_fromEarliest year for selection interval of properties based on their construction yearyear_toLatest year for selection interval of properties based on their construction yearcbsa_popCBSA population or population change (US Census)tot_res_propsTotal residential propertiestot_res_area_sqkmTotal indoor area of residential properties in sqkmavg_res_area_sqmAverage indoor area of residential properties in sqmmedian_res_area_sqmMedian indoor area of residential properties in sqmq25_res_area_sqm25th percentile of indoor area of residential properties in sqmq75_res_area_sqm75th percentile of indoor area of residential properties in sqmgini_res_areaGini index of residential property indoor areatot_prop_value_usdTotal residential property value in USDmedian_prop_value_usdMedian residential property value in USDq25_prop_value_usd25th percentile of residential property values in USDq75_prop_value_usd75th percentile of residential property values in USDgini_prop_valueGini index of residential property valuestot_lot_area_sqkmTotal lot (cadastral parcel) area in sqkmavg_lot_area_sqmMean lot area in sqmmedian_lot_area_sqmMedian lot area in sqmq25_lot_area_sqm25th percentile of lot area in sqmq75_lot_area_sqm75th percentile of lot area in sqmgini_lot_areaGini index of lot areaavg_farMean floor-area-ratio (FAR), with FAR being the ratio of building indoor area and lot area, based on residential propertiesmedian_farMedian floor-area-ratio (FAR), with FAR being the ratio of building indoor area and lot area, based on residential propertiesq25_far25th percentile of floor-area-ratio (FAR), with FAR being the ratio of building indoor area and lot area, based on residential propertiesq75_far75th percentile of floor-area-ratio (FAR), with FAR being the ratio of building indoor area and lot area, based on residential propertiesentropy_res_areaShannon entropy of the indoor area of residential properties, based on propertiesentropy_prop_valueShannon entropy of the property value of residential properties, based on propertiesentropy_lot_areaShannon entropy of the lot size of residential properties, based on propertiesarea_completenessRatio of properties with a valid indoor area attribute [0,1]value_completenessRatio of properties with a valid property value attribute [0,1]lotsize_completenessRatio of properties with a valid indoor area, property value, and lot size attribute [0,1]area_value_completenessRatio of properties with a valid lot size attribute [0,1]area_value_lotsize_completenessRatio of properties with both a valid indoor area and property value attribute [0,1]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDCC Traffic Congestion Saturation Flow Data for January to June 2022. Traffic volumes, traffic saturation, and congestion data for sites across South Dublin County. Used by traffic management to control stage timings on junctions. It is recommended that this dataset is read in conjunction with the ‘Traffic Data Site Names SDCC’ dataset.A detailed description of each column heading can be referenced below;scn: Site Serial numberregion: A group of Nodes that are operated under SCOOT control at the same common cycle time. Normally these will be nodes between which co-ordination is desirable. Some of the nodes may be double cycling at half of the region cycle time.system: SCOOT STC UTC (UTC-MX)locn: Locationssite: Site numbersday: Days of the week Monday to Sunday. Abbreviations; MO,TU,WE,TH,FR,SA,SU.date: Reflects correct actual Date of when data was collected.start_time: NOTE - Please ignore the date displayed in this column. The actual data collection date is correctly displayed in the 'date' column. The date displayed here is the date of when report was run and extracted from the system, but correctly reflects start time of 15 minute intervals. end_time: End time of 15 minute intervals.flow: A representation of demand (flow) for each link built up over several minutes by the SCOOT model. SCOOT has two profiles:(1) Short – Raw data representing the actual values over the previous few minutes(2) Long – A smoothed average of values over a longer periodSCOOT will choose to use the appropriate profile depending on a number of factors.flow_pc: Same as above ref PC SCOOTcong: Congestion is directly measured from the detector. If the detector is placed beyond the normal end of queue in the street it is rarely covered by stationary traffic, except of course when congestion occurs. If any detector shows standing traffic for the whole of an interval this is recorded. The number of intervals of congestion in any cycle is also recorded.The percentage congestion is calculated from:No of congested intervals x 4 x 100 cycle time in seconds.This percentage of congestion is available to view and more importantly for the optimisers to take into account.cong_pc: Same as above ref PC SCOOTdsat: The ratio of the demand flow to the maximum possible discharge flow, i.e. it is the ratio of the demand to the discharge rate (Saturation Occupancy) multiplied by the duration of the effective green time. The Split optimiser will try to minimise the maximum degree of saturation on links approaching the node.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spatial correlation raises challenges in estimating confidence intervals for region specific event rates and rate ratios between geographic units that are nested. Methods have been proposed to incorporate spatial correlation by assuming various distributions for the structure of autocorrelation patterns. However, the derivation of these statistics based on approximation may have to condition on the distributional assumption underlying the data generating process, which may not hold for certain situations. This paper explores the feasibility of utilizing a Bayesian convolution model (BCM), which includes an uncorrelated heterogeneity (UH) and a conditional autoregression (CAR) component to accommodate both uncorrelated and correlated spatial heterogeneity, to estimate the 95% confidence intervals for age-adjusted rate ratios among geographic regions with existing spatial correlations. A simulation study is conducted and a BCM method is applied to two cancer incidence datasets to calculate age-adjusted rate/ratio for the counties in the State of Kentucky relative to the entire state. In comparison to three existing methods, without and with spatial correlation, the Bayesian convolution model-based estimation provides moderate shrinkage effect for the point estimates based on the neighbor structure across regions and produces a wider interval due to the inclusion of uncertainty in the spatial autocorrelation parameters. The overall spatial pattern of region incidence rate from BCM approach appears to be like the direct estimates and other methods for both datasets, even though “smoothing” occurs in some local regions. The Bayesian Convolution Model allows flexibility in the specification of risk components and can improve the accuracy of interval estimates of age-adjusted rate ratios among geographical regions as it considers spatial correlation.
The American Community Survey Education Tabulation (ACS-ED) is a custom tabulation of the ACS produced for the National Center of Education Statistics (NCES) by the U.S. Census Bureau. The ACS-ED provides a rich collection of social, economic, demographic, and housing characteristics for school systems, school-age children, and the parents of school-age children. In addition to focusing on school-age children, the ACS-ED provides enrollment iterations for children enrolled in public school. The data profiles include percentages (along with associated margins of error) that allow for comparison of school district-level conditions across the U.S. For more information about the NCES ACS-ED collection, visit the NCES Education Demographic and Geographic Estimates (EDGE) program at: https://nces.ed.gov/programs/edge/Demographic/ACSAnnotation values are negative value representations of estimates and have values when non-integer information needs to be represented. See the table below for a list of common Estimate/Margin of Error (E/M) values and their corresponding Annotation (EA/MA) values.All information contained in this file is in the public domain. Data users are advised to review NCES program documentation and feature class metadata to understand the limitations and appropriate use of these data.-9An '-9' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small.-8An '-8' means that the estimate is not applicable or not available.-6A '-6' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.-5A '-5' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.-3A '-3' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate.-2A '-2' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The p-value is a likelihood ratio p-value and thus identical for both comparison measures. The numbers needed to treat (NNT) were based on the estimated risk difference.