35 datasets found
  1. n

    ECMWF ERA5: 10 ensemble member surface level analysis parameter data

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ECMWF ERA5: 10 ensemble member surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ensemble%20runs
    Explore at:
    Dataset updated
    Dec 8, 2023
    Description

    This dataset contains ERA5 surface level analysis parameter data from 10 ensemble runs. ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble members were used to derive means and spread data (see linked datasets). Ensemble means and spreads were calculated from the ERA5t 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data. The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

  2. n

    ECMWF ERA5t: ensemble spreads of surface level analysis parameter data

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 28, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ECMWF ERA5t: ensemble spreads of surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?format=Data%20are%20netCDF%20formatted%20with%20internal%20compression.
    Explore at:
    Dataset updated
    Jul 28, 2021
    Description

    This dataset contains ensemble spreads for the ERA5 initial release (ERA5t) surface level analysis parameter data ensemble means (see linked dataset). ERA5t is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project initial release available upto 5 days behind the present data. CEDA will maintain a 6 month rolling archive of these data with overlap to the verified ERA5 data - see linked datasets on this record. The ensemble means and spreads are calculated from the ERA5t 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data. The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed and, if required, amended before the full ERA5 release. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record.

  3. ERA5 hourly data on pressure levels from 1940 to present

    • cds.climate.copernicus.eu
    grib
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 hourly data on pressure levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.bd0915c6
    Explore at:
    gribAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf

    Time period covered
    Jan 1, 1940 - Jun 3, 2025
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on pressure levels from 1940 to present".

  4. NCEP GEFS Mean Spread West Atlantic Forecast Products Imagery

    • data.ucar.edu
    image
    Updated Dec 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Centers for Environmental Prediction (NCEP) (2024). NCEP GEFS Mean Spread West Atlantic Forecast Products Imagery [Dataset]. http://doi.org/10.26023/PZ24-3M8N-TA0G
    Explore at:
    imageAvailable download formats
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    National Centers for Environmental Prediction (NCEP)
    Time period covered
    Jun 15, 2011 - Aug 1, 2011
    Area covered
    Description

    This dataset contains gif images from the National Weather Service - National Centers for Environmental Prediction (NCEP) Global Ensemble Forecast System (GEFS) Mean Spread West Atlantic forecasts during the Ice in Clouds Experiment - Tropical (ICE-T) project. Note: There are no data available for 20110715-20110722.

  5. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2024). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    May 31, 2024
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  6. The SPARC Data Initiative CFC-11, CFC-12, HF and SF6 climatologies from...

    • doi.pangaea.de
    • search.dataone.org
    • +1more
    html, tsv
    Updated 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michaela I Hegglin; Bernd Funke; Susann Tegtmeier; John Anderson; John C Gille; Ashley Jones; Lesley Smith; Thomas von Clarmann; Kaley A Walker (2016). The SPARC Data Initiative CFC-11, CFC-12, HF and SF6 climatologies from international satellite limb sounders [Dataset]. http://doi.org/10.1594/PANGAEA.849223
    Explore at:
    tsv, htmlAvailable download formats
    Dataset updated
    2016
    Dataset provided by
    PANGAEA
    Authors
    Michaela I Hegglin; Bernd Funke; Susann Tegtmeier; John Anderson; John C Gille; Ashley Jones; Lesley Smith; Thomas von Clarmann; Kaley A Walker
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Variables measured
    File name, Uniform resource locator/link to file
    Description

    A quality assessment of the CFC-11 (CCl3F), CFC-12 (CCl2F2), HF, and SF6 products from limb-viewing satellite instruments is provided by means of a detailed intercomparison. The climatologies in the form of monthly zonal mean time series are obtained from HALOE, MIPAS, ACE-FTS, and HIRDLS within the time period 1991-2010. The intercomparisons focus on the mean biases of the monthly and annual zonal mean fields and aim to identify their vertical, latitudinal and temporal structure. The CFC evaluations (based on MIPAS, ACE-FTS and HIRDLS) reveal that the uncertainty in our knowledge of the atmospheric CFC-11 and CFC-12 mean state, as given by satellite data sets, is smallest in the tropics and mid-latitudes at altitudes below 50 and 20 hPa, respectively, with a 1sigma multi-instrument spread of up to ±5 %. For HF, the situation is reversed. The two available data sets (HALOE and ACE-FTS) agree well above 100 hPa, with a spread in this region of ±5 to ±10 %, while at altitudes below 100 hPa the HF annual mean state is less well known, with a spread ±30 % and larger. The atmospheric SF6 annual mean states derived from two satellite data sets (MIPAS and ACE-FTS) show only very small differences with a spread of less than ±5 % and often below ±2.5 %. While the overall agreement among the climatological data sets is very good for large parts of the upper troposphere and lower stratosphere (CFCs, SF6) or middle stratosphere (HF), individual discrepancies have been identified. Pronounced deviations between the instrument climatologies exist for particular atmospheric regions which differ from gas to gas. Notable features are differently shaped isopleths in the subtropics, deviations in the vertical gradients in the lower stratosphere and in the meridional gradients in the upper troposphere, and inconsistencies in the seasonal cycle. Additionally, long-term drifts between the instruments have been identified for the CFC-11 and CFC-12 time series. The evaluations as a whole provide guidance on what data sets are the most reliable for applications such as studies of atmospheric transport and variability, model-measurement comparisons and detection of long-term trends.

  7. n

    ECMWF ERA5: ensemble spreads of surface level analysis parameter data

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ECMWF ERA5: ensemble spreads of surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?orgName=European%20Centre%20for%20Medium-Range%20Weather%20Forecasts%20(ECMWF)
    Explore at:
    Dataset updated
    Jul 28, 2021
    Description

    This dataset contains ensemble spreads for the ERA5 surface level analysis parameter data ensemble means (see linked dataset). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data. The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

  8. f

    Statistics of ‘diffrate’.

    • figshare.com
    xls
    Updated Mar 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuancheng Si; Saralees Nadarajah; Zongxin Zhang; Chunmin Xu (2024). Statistics of ‘diffrate’. [Dataset]. http://doi.org/10.1371/journal.pone.0299164.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Yuancheng Si; Saralees Nadarajah; Zongxin Zhang; Chunmin Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the dynamic landscape of financial markets, accurate forecasting of stock indices remains a pivotal yet challenging task, essential for investors and policymakers alike. This study is motivated by the need to enhance the precision of predicting the Shanghai Composite Index’s opening price spread, a critical measure reflecting market volatility and investor sentiment. Traditional time series models like ARIMA have shown limitations in capturing the complex, nonlinear patterns inherent in stock price movements, prompting the exploration of advanced methodologies. The aim of this research is to bridge the gap in forecasting accuracy by developing a hybrid model that integrates the strengths of ARIMA with deep learning techniques, specifically Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. This novel approach leverages the ARIMA model’s proficiency in linear trend analysis and the deep learning models’ capability in modeling nonlinear dependencies, aiming to provide a comprehensive tool for market prediction. Utilizing a comprehensive dataset covering the period from December 20, 1990, to June 2, 2023, the study develops and assesses the efficacy of ARIMA, LSTM, GRU, ARIMA-LSTM, and ARIMA-GRU models in forecasting the Shanghai Composite Index’s opening price spread. The evaluation of these models is based on key statistical metrics, including Mean Squared Error (MSE) and Mean Absolute Error (MAE), to gauge their predictive accuracy. The findings indicate that the hybrid models, ARIMA-LSTM and ARIMA-GRU, perform better in forecasting the opening price spread of the Shanghai Composite Index than their standalone counterparts. This outcome suggests that combining traditional statistical methods with advanced deep learning algorithms can enhance stock market prediction. The research contributes to the field by providing evidence of the potential benefits of integrating different modeling approaches for financial forecasting, offering insights that could inform investment strategies and financial decision-making.

  9. t

    Heat index at 2 m above ground: A globally gridded dataset based on...

    • service.tib.eu
    Updated Nov 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Heat index at 2 m above ground: A globally gridded dataset based on reanalysis data from 1979-2013, links to GeoTIFFs - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-841057
    Explore at:
    Dataset updated
    Nov 30, 2024
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    The increase in global mean temperatures resulting from climate change has wide reaching consequences for the earth's ecosystems and other natural systems. Many studies have been devoted to evaluating the distribution and effects of these changes. We go a step further and evaluate global changes to the heat index, a measure of temperature as perceived by humans. Heat index, which is computed from temperature and relative humidity, is more important than temperature for the health of humans and other animals. Even in cases where the heat index does not reach dangerous levels from a health perspective, it has been shown to be an important factor in worker productivity and thus in economic productivity. We compute heat index from dewpoint temperature and absolute temperature 2 m above ground from the ERA-Interim reanalysis dataset for the years 1979-2013. The data is provided aggregated to daily minima, means and maxima. Furthermore, the data is temporally aggregated to monthly and yearly values and spatially aggregated to the level of countries after being weighted by population density in order to demonstrate its usefulness for the analysis of its impact on human health and productivity. The resulting data deliver insights into the spatiotemporal development of near-ground heat index during the course of the past 3 decades. It is shown that the impact of changing heat index is unevenly distributed through space and time, affecting some areas differently than others. The likelihood of dangerous heat index events has increased globally. Also, heat index climate groups that would formerly be expected closer to the tropics have spread latitudinally to include areas closer to the poles. The data can serve in future studies as a basis for evaluating and understanding the evolution of heat index in the course of climate change, as well as its impact on human health and productivity.

  10. Z

    Network traffic datasets with novel extended IP flow called NetTiSA flow

    • data.niaid.nih.gov
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josef Koumar (2024). Network traffic datasets with novel extended IP flow called NetTiSA flow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8301042
    Explore at:
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    Josef Koumar
    Jaroslav Pešek
    Tomáš Čejka
    Karel Hynek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Network traffic datasets with novel extended IP flow called NetTiSA flow

    Datasets were created for the paper: NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification -- Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka -- which is published in The International Journal of Computer and Telecommunications Networking https://doi.org/10.1016/j.comnet.2023.110147Please cite the usage of our datasets as:

    Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka, "NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification", Computer Networks, Volume 240, 2024, 110147, ISSN 1389-1286

    @article{KOUMAR2024110147, title = {NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification}, journal = {Computer Networks}, volume = {240}, pages = {110147}, year = {2024}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2023.110147}, url = {https://www.sciencedirect.com/science/article/pii/S1389128623005923}, author = {Josef Koumar and Karel Hynek and Jaroslav Pešek and Tomáš Čejka} }

    This Zenodo repository contains 23 datasets created from 15 well-known published datasets, which are cited in the table below. Each dataset contains the NetTiSA flow feature vector.

    NetTiSA flow feature vector

    The novel extended IP flow called NetTiSA (Network Time Series Analysed) flow contains a universal bandwidth-constrained feature vector consisting of 20 features. We divide the NetTiSA flow classification features into three groups by computation. The first group of features is based on classical bidirectional flow information---a number of transferred bytes, and packets. The second group contains statistical and time-based features calculated using the time-series analysis of the packet sequences. The third type of features can be computed from the previous groups (i.e., on the flow collector) and improve the classification performance without any impact on the telemetry bandwidth.

    Flow features

    The flow features are:

    Packets is the number of packets in the direction from the source to the destination IP address.

    Packets in reverse order is the number of packets in the direction from the destination to the source IP address.

    Bytes is the size of the payload in bytes transferred in the direction from the source to the destination IP address.

    Bytes in reverse order is the size of the payload in bytes transferred in the direction from the destination to the source IP address.

    Statistical and Time-based features

    The features that are exported in the extended part of the flow. All of them can be computed (exactly or in approximative) by stream-wise computation, which is necessary for keeping memory requirements low. The second type of feature set contains the following features:

    Mean represents mean of the payload lengths of packets

    Min is the minimal value from payload lengths of all packets in a flow

    Max is the maximum value from payload lengths of all packets in a flow

    Standard deviation is a measure of the variation of payload lengths from the mean payload length

    Root mean square is the measure of the magnitude of payload lengths of packets

    Average dispersion is the average absolute difference between each payload length of the packet and the mean value

    Kurtosis is the measure describing the extent to which the tails of a distribution differ from the tails of a normal distribution

    Mean of relative times is the mean of the relative times which is a sequence defined as (st = {t_1 - t_1, t_2 - t_1, ..., t_n - t_1} )

    Mean of time differences is the mean of the time differences which is a sequence defined as (dt = { t_j - t_i | j = i + 1, i \in {1, 2, \dots, n - 1} }.)

    Min from time differences is the minimal value from all time differences, i.e., min space between packets.

    Max from time differences is the maximum value from all time differences, i.e., max space between packets.

    Time distribution describes the deviation of time differences between individual packets within the time series. The feature is computed by the following equation:(tdist = \frac{ \frac{1}{n-1} \sum_{i=1}^{n-1} \left| \mu_{{dt_{n-1}}} - dt_i \right| }{ \frac{1}{2} \left(max\left({dt_{n-1}}\right) - min\left({dt_{n-1}}\right) \right) })

    Switching ratio represents a value change ratio (switching) between payload lengths. The switching ratio is computed by equation:(sr = \frac{s_n}{\frac{1}{2} (n - 1)})

        where \(s_n\) is number of switches.
    

    Features computed at the collectorThe third set contains features that are computed from the previous two groups prior to classification. Therefore, they do not influence the network telemetry size and their computation does not put additional load to resource-constrained flow monitoring probes. The NetTiSA flow combined with this feature set is called the Enhanced NetTiSA flow and contains the following features:

    Max minus min is the difference between minimum and maximum payload lengths

    Percent deviation is the dispersion of the average absolute difference to the mean value

    Variance is the spread measure of the data from its mean

    Burstiness is the degree of peakedness in the central part of the distribution

    Coefficient of variation is a dimensionless quantity that compares the dispersion of a time series to its mean value and is often used to compare the variability of different time series that have different units of measurement

    Directions describe a percentage ratio of packet direction computed as (\frac{d_1}{ d_1 + d_0}), where (d_1) is a number of packets in a direction from source to destination IP address and (d_0) the opposite direction. Both (d_1) and (d_0) are inside the classical bidirectional flow.

    Duration is the duration of the flow

    The NetTiSA flow is implemented into IP flow exporter ipfixprobe.

    Description of dataset files

    In the following table is a description of each dataset file:

    File name

    Detection problem

    Citation of the original raw dataset

    botnet_binary.csv Binary detection of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.

    botnet_multiclass.csv Multi-class classification of botnet S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.

    cryptomining_design.csv Binary detection of cryptomining; the design part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022

    cryptomining_evaluation.csv Binary detection of cryptomining; the evaluation part Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022

    dns_malware.csv Binary detection of malware DNS Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.

    doh_cic.csv Binary detection of DoH Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020

    doh_real_world.csv Binary detection of DoH Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022

    dos.csv Binary detection of DoS Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.

    edge_iiot_binary.csv Binary detection of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.

    edge_iiot_multiclass.csv Multi-class classification of IoT malware Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.

    https_brute_force.csv Binary detection of HTTPS Brute Force Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020

    ids_cic_binary.csv Binary detection of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.

    ids_cic_multiclass.csv Multi-class classification of intrusion in IDS Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.

    unsw_binary.csv Binary detection of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.

    unsw_multiclass.csv Multi-class classification of intrusion in IDS Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.

    iot_23.csv Binary detection of IoT malware Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23

    ton_iot_binary.csv Binary detection of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021

    ton_iot_multiclass.csv Multi-class classification of IoT malware Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets.

  11. i

    Copernicus

    • sextant.ifremer.fr
    • pigma.org
    www:link +1
    Updated Sep 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ERA5 monthly averaged data on single levels from 1979 to present (2021). Copernicus [Dataset]. https://sextant.ifremer.fr/geonetwork/srv/api/records/ff2cd349-ecab-48e1-817a-1ed87dc0c4be
    Explore at:
    www:link-1.0-http--publication-url, www:linkAvailable download formats
    Dataset updated
    Sep 6, 2021
    Dataset provided by
    ERA5 monthly averaged data on single levels from 1979 to present
    Area covered
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 4 to 7 decades. Currently data is available from 1950, split into Climate Data Store entries for 1950-1978 (preliminary back extension) and from 1979 onwards (final release plus timely updates, this page). ERA5 replaces the ERA-Interim reanalysis.

    Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.

    ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread.

    ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has not been the case and when this does occur users will be notified.

    The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications.

    An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines.

    Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities).

    The present entry is "ERA5 monthly mean data on single levels from 1979 to present".

  12. ERA5 hourly time-series data on single levels from 1940 to present

    • cds-stable-bopen.copernicus-climate.eu
    • cds.climate.copernicus.eu
    netcdf
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 hourly time-series data on single levels from 1940 to present [Dataset]. https://cds-stable-bopen.copernicus-climate.eu/datasets/reanalysis-era5-single-levels-timeseries
    Explore at:
    netcdfAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/bopen-cds2-stable-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/bopen-cds2-stable-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf

    Time period covered
    Jan 1, 1940 - Dec 6, 2024
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The dataset presented here is a regridded subset of the full ERA5 data set on native resolution that is stored in a format designed for retrieving long time-series for a single point. When the requested location does not match the exact location of a grid point then the nearest grid point is used instead. It is this source of ERA5 data that is used by the ERA-Explorer to ensure response times required for the interactive web-application. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines.

  13. NOAA/CIRES Twentieth Century Global Reanalysis Version 2c

    • rda.ucar.edu
    • oidc.rda.ucar.edu
    • +1more
    Updated Mar 16, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gilbert Compo; Jeffrey Whitaker; Prashant Sardeshmukh; Robert Allan; Chesley McColl; Xungang Yin; Benjamin Giese; Russell Vose; Nobuki Matsui; Linden Ashcroft; Renate Auchmann; Mac Benoy; Pierre Bessemoulin; Theo Brandsma; Philip Brohan; Manola Brunet; Joseph Comeaux; Thomas Cram; Richard Crouthamel; Pavel Groisman; Hans Hersbach; Philip Jones; Trausti Jonsson; Sylvie Jourdain; Gail Kelly; Kenneth Knapp; Andries Kruger; Hisayuki Kubota; Gianluca Lentini; Andrew Lorrey; Neal Lott; Sandra Lubker; Jurg Luterbacher; Gareth Marshall; Maurizio Maugeri; Cary Mock; Hing Mok; Oyvind Nordli; Rajmund Przybylak; Mark Rodwell; Thomas Ross; Douglas Schuster; Lidija Srnec; Maria Valente; Zsuzsanna Vizi; Xiaolan Wang; Nancy Westcott; John Woollen; Steven Worley (2015). NOAA/CIRES Twentieth Century Global Reanalysis Version 2c [Dataset]. http://doi.org/10.5065/D6N877TW
    Explore at:
    Dataset updated
    Mar 16, 2015
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Gilbert Compo; Jeffrey Whitaker; Prashant Sardeshmukh; Robert Allan; Chesley McColl; Xungang Yin; Benjamin Giese; Russell Vose; Nobuki Matsui; Linden Ashcroft; Renate Auchmann; Mac Benoy; Pierre Bessemoulin; Theo Brandsma; Philip Brohan; Manola Brunet; Joseph Comeaux; Thomas Cram; Richard Crouthamel; Pavel Groisman; Hans Hersbach; Philip Jones; Trausti Jonsson; Sylvie Jourdain; Gail Kelly; Kenneth Knapp; Andries Kruger; Hisayuki Kubota; Gianluca Lentini; Andrew Lorrey; Neal Lott; Sandra Lubker; Jurg Luterbacher; Gareth Marshall; Maurizio Maugeri; Cary Mock; Hing Mok; Oyvind Nordli; Rajmund Przybylak; Mark Rodwell; Thomas Ross; Douglas Schuster; Lidija Srnec; Maria Valente; Zsuzsanna Vizi; Xiaolan Wang; Nancy Westcott; John Woollen; Steven Worley
    Time period covered
    Dec 31, 1850 - Dec 31, 2014
    Area covered
    Earth
    Description

    The Twentieth Century Reanalysis Project, produced by the Earth System Research Laboratory Physical Sciences Division from NOAA and the University of Colorado Cooperative Institute for Research in Environmental Sciences, is an effort to produce a global reanalysis dataset spanning a portion of the nineteenth century and the entire twentieth century (1851 - near present), assimilating only surface observations of synoptic pressure. Boundary conditions of pentad sea surface temperature and monthly sea ice concentration and time-varying solar, volcanic, and carbon dioxide radiative forcings are prescribed. Products include 6-hourly ensemble mean and spread analysis fields on a 2 by 2 degree global latitude-longitude grid, and 3 and 6-hourly ensemble mean and spread forecast (first guess) fields on a global Gaussian T62 grid. Fields are accessible in yearly time series (1 file per parameter) and monthly synoptic time (all parameters per synoptic hour) files. This dataset provides the first estimates of global tropospheric variability spanning 1851 to 2012 at six-hourly resolution. Fields from 1851 to 1860 are a first attempt at this period and will be improved in future versions. Fields from 1861 to 2011 are most relevant for climate and weather studies. 20th Century Reanalysis Version 2c uses the same model as version 2 with new sea ice boundary conditions from the COBE-SST2 (Hirahara et al. 2014), new pentad Simple Ocean Data Assimilation with sparse input (SODAsi.2, Giese et al. 2015) sea surface temperature fields from through 2012, Daily High-Resolution-Blended Analyses for Sea Surface Temperature starting with 2013, and additional observations from ISPD version 3.2.9.

    A low pressure bias in marine pressures from the US Maury Collection (Woodruff et al. 2005, Wallbrink et al. 2009, ) appears to have affected the resultant 20CR version 2c mass-related fields (e.g., pressure, geopotential height) from 1851 to about 1865. Please see opportunities for improvement [https://www.esrl.noaa.gov/psd/data/gridded/20thC_ReanV2c/opportunities.html] for additional information.

    The Twentieth Century Reanalysis Project version 2c used resources of the National Energy Research Scientific Computing Center managed by Lawrence Berkeley National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Version 2c is a contribution to the international Atmospheric Circulation Reconstructions over the Earth initiative. Support for the Twentieth Century Reanalysis Project is provided by the U.S. Department of Energy Office of Science (BER) and the NOAA Climate Program Office MAPP program.

  14. i

    MEDSEA_CH1_Product_1 / Wind and wave data set from MARINA project

    • sextant.ifremer.fr
    • pigma.org
    doi, ogc:ows-c +1
    Updated Nov 21, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Athens (2016). MEDSEA_CH1_Product_1 / Wind and wave data set from MARINA project [Dataset]. https://sextant.ifremer.fr/geonetwork/srv/api/records/c669abf9-31a3-40d0-9954-3e8a31f2bf73
    Explore at:
    ogc:ows-c, www:link, doiAvailable download formats
    Dataset updated
    Nov 21, 2016
    Dataset provided by
    EMODnet Medsea Checkpoint
    University of Athens
    Time period covered
    Jan 1, 2001 - Dec 31, 2010
    Area covered
    Description

    Today's normative and regulatory requirements to assess the producible energy from wind rely on in situ measurements (mast with anemometric sensors), which are extremely costly to Implement offshore. However, proof should be provided that hindcast model results are highly reliable, in order to provide an equivalent assessment. Very high resolution models is also the key issue in decision making for a proper siting that is relaying on the consistency of all datasets provided in the assessment. In this tender the products of the FP7 MARINA project will be used. 10-year (2001-2010) highresolution atmospheric, wave, tidal and ocean current simulations will be used. The model outputs are at high resolution (0.05x0.05 degree horizontal resolution, 1-hour time resolution, 5-vertical levels at 10,40,80,120,180 m). The wave parameters are co-located with the meteorological output fields. Satellite altimetry data from ENVISAT and JASON satellites have been assimilated in the system. Other wind and wave satellite data sets will be also analyzed (Synthetic Aperture Radars-SAR for example). At the same co-located points the tidal and ocean current data together with bathymetry are available. For preselected points in the North Western Mediterranean (Spain-France-ltaly areas) directional wave spectra data have been saved and are available. From SKIRON meteorological model available parameters are: WIND SPEED (m/s), WIND DIRECTION (deg), AIR PRESSURE (hPa), AIR DENSITY (Kgr/m3), TEMPERATURE (K), MODEL SEAMASK From the wave model available parameters: SIGNIFICANT WAVE HEIGHT (m), MEAN WAVE DIRECTION (deg), WAVE MEAN PERIOD (s), PEAK WAVE PRERIOD (s), SWELL WAVE HEIGHT (m), MEAN SWELL PERIOD (s), MEAN DIRECTIONAL SPREAD, WINDSEA MEAN DIRECTIONAL SPREAD, SWELL MEAN DIRECTIONAL SPREAD, MAXIMUM WAVE HEIGHT (m)

  15. SmartBay Ireland Galway Bay Buoy Wave - Dataset - data.gov.ie

    • data.gov.ie
    Updated Nov 2, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.gov.ie (2016). SmartBay Ireland Galway Bay Buoy Wave - Dataset - data.gov.ie [Dataset]. https://data.gov.ie/dataset/smartbay-ireland-galway-bay-buoy-wave
    Explore at:
    Dataset updated
    Nov 2, 2016
    Dataset provided by
    data.gov.ie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Ireland, Galway
    Description

    This data comprises wave data collected from the SmartBay buoy moored in Galway Bay. The TRIAXYS Directional Wave Sensor collects wave data and returns the following parameters: No Zero Crossings (Number) HAvg (Average Wave Height) (Meters) Tz (Mean Spectral Period) (Seconds) HMax (Max Wave Height) (Meters) HSig (Significant Wave Height) (Meters) TSig (Significant Period) (Seconds) H10 (Meters) T10 (Seconds) TAvg (Mean Wave Period) (Seconds) TP (Peak Period) (Seconds) TP5 (Seconds) HMO (Meters) Mean Direction (Degrees) Mean Spread (Degrees) The TRIAXYS Directional Wave Sensor is comprised of three accelerometers and three rate sensors that ultimately measure the total displacement along the three orthogonal axes of the floating platform. In addition, this sensor is equipped with a gimballed fluxgate compass to measure true magnetic direction.

  16. Trace-Share Dataset for Evaluation of Trace Meaning Preservation

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated May 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milan Cermak; Milan Cermak; Tomas Madeja; Tomas Madeja (2020). Trace-Share Dataset for Evaluation of Trace Meaning Preservation [Dataset]. http://doi.org/10.5281/zenodo.3547528
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    May 7, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Milan Cermak; Milan Cermak; Tomas Madeja; Tomas Madeja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains all data used during the evaluation of trace meaning preservation. Archives are protected by password "trace-share" to avoid false detection by antivirus software.

    For more information, see the project repository at https://github.com/Trace-Share.

    Selected Attack Traces

    The following list contains trace datasets used for evaluation. Each attack was chosen to have not only a different meaning but also different statistical properties.

    • dos_http_flood — the capture of GET and POST requests sent to one server by one attacker (HTTP~traffic);
    • ftp_bruteforce — short and unsuccessful attempt to guess a user’s password for FTP service (FTP traffic);
    • ponyloader_botnet — Pony Loader botnet used for stealing of credentials from 3 target devices reporting to single IP with a large number of intermediate addresses (DNS and HTTP traffic);
    • scan — the capture of nmap tool that scans given subnet using ICMP echo and TCP SYN requests (consist of ARP, ICMP, and TCP traffic);
    • wannacry_ransomware — the capture of Wanacry ransomware that spreads in a domain with three workstations, a domain controller, and a file-sharing server (SMB and SMBv2 traffic).

    Background Traffic Data

    Publicly available dataset CSE-CIC-IDS-2018 was used as a background traffic data. The evaluation uses data from the day Thursday-01-03-2018 containing a sufficient proportion of regular traffic without any statistically significant attacks. Only traffic aimed at victim machines (range 172.31.69.0/24) is used to reduce less significant traffic.

    Evaluation Results and Dataset Structure

    • Traces variants (traces.zip)
      • ./traces-original/ — trace PCAP files and crawled details in YAML format;
      • ./traces-normalized — normalized PCAP files and details in YAML format;
      • ./traces-adjusted — adjusted PCAP files using various timestamp generation settings, combination configuration in YAML format, and lables provided by ID2T in XML format.
    • Extracted alerts (alerts.zip)
      • ./alerts-original/ — extracted Suricata alerts, Suricata log, and full Suricata output for all original trace files;
      • ./alerts-normalized/ — extracted Suricata alerts, Suricata log, and full Suricata output for all normalized trace files;
      • ./alerts-adjusted/ — extracted Suricata alerts, Suricata log, and full Suricata output for all adjusted trace files.
    • Evaluation results
      • *.csv files in the root directory — data contains extracted alert signatures and their count per each trace variant.

  17. f

    Datasheet1_Mobility data shows effectiveness of control strategies for...

    • frontiersin.figshare.com
    • figshare.com
    pdf
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuval Berman; Shannon D. Algar; David M. Walker; Michael Small (2024). Datasheet1_Mobility data shows effectiveness of control strategies for COVID-19 in remote, sparse and diffuse populations.pdf [Dataset]. http://doi.org/10.3389/fepid.2023.1201810.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 7, 2024
    Dataset provided by
    Frontiers
    Authors
    Yuval Berman; Shannon D. Algar; David M. Walker; Michael Small
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data that is collected at the individual-level from mobile phones is typically aggregated to the population-level for privacy reasons. If we are interested in answering questions regarding the mean, or working with groups appropriately modeled by a continuum, then this data is immediately informative. However, coupling such data regarding a population to a model that requires information at the individual-level raises a number of complexities. This is the case if we aim to characterize human mobility and simulate the spatial and geographical spread of a disease by dealing in discrete, absolute numbers. In this work, we highlight the hurdles faced and outline how they can be overcome to effectively leverage the specific dataset: Google COVID-19 Aggregated Mobility Research Dataset (GAMRD). Using a case study of Western Australia, which has many sparsely populated regions with incomplete data, we firstly demonstrate how to overcome these challenges to approximate absolute flow of people around a transport network from the aggregated data. Overlaying this evolving mobility network with a compartmental model for disease that incorporated vaccination status we run simulations and draw meaningful conclusions about the spread of COVID-19 throughout the state without de-anonymizing the data. We can see that towns in the Pilbara region are highly vulnerable to an outbreak originating in Perth. Further, we show that regional restrictions on travel are not enough to stop the spread of the virus from reaching regional Western Australia. The methods explained in this paper can be therefore used to analyze disease outbreaks in similarly sparse populations. We demonstrate that using this data appropriately can be used to inform public health policies and have an impact in pandemic responses.

  18. E-commerce Sales Prediction Dataset

    • kaggle.com
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevil Dhinoja (2024). E-commerce Sales Prediction Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10197264
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Dataset provided by
    Kaggle
    Authors
    Nevil Dhinoja
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    E-commerce Sales Prediction Dataset

    This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

    📂 Dataset Overview

    The dataset includes 1,000 records across the following features:

    Column NameDescription
    DateThe date of the sale (01-01-2023 onward).
    Product_CategoryCategory of the product (e.g., Electronics, Sports, Other).
    PricePrice of the product (numerical).
    DiscountDiscount applied to the product (numerical).
    Customer_SegmentBuyer segment (e.g., Regular, Occasional, Other).
    Marketing_SpendMarketing budget allocated for sales (numerical).
    Units_SoldNumber of units sold per transaction (numerical).

    📊 Data Summary

    General Properties

    Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.

    Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).

    Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.

    Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.

    Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.

    Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.

    Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.

    📈 Data Visualizations

    The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.

    💡 How the Data Was Created

    The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

    1. Feature Engineering:

      • Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
      • Generated dependent features like units sold based on logical relationships.
    2. Data Simulation:

      • Python Libraries: Used NumPy and Pandas to generate and distribute values.
      • Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
    3. Validation:

      • Verified data consistency with no missing or invalid values.
      • Ensured logical correlations (e.g., higher discounts → increased units sold).

    Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.

    🛠 Example Usage: Sales Prediction Model

    Here’s an example of building a predictive model using Linear Regression:

    Written in python

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    
    # Load the dataset
    df = pd.read_csv('ecommerce_sales.csv')
    
    # Feature selection
    X = df[['Price', 'Discount', 'Marketing_Spend']]
    y = df['Units_Sold']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Model training
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Evaluation
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f'Mean Squared Error: {mse:.2f}')
    print(f'R-squared: {r2:.2f}')
    
  19. g

    Corona data donation - Partial data set Vital data

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corona data donation - Partial data set Vital data [Dataset]. https://gimi9.com/dataset/eu_https-zenodo-org-record-8229284/
    Explore at:
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data from fitness wristbands and smartwatches, so-called wearables, can provide indications of symptoms of COVID-19 disease. With the help of the Corona data donation app (CDA), citizens were able to make such data available to the Robert Koch Institute for scientific purposes. Together with information from other sources, e.g. official reporting data on case numbers, these data help scientists to better record and understand the spread of the coronavirus.The data points provided in this repository contain spatially and temporally aggregated information on the mean resting heart rate, the mean daily step count and the mean sleep duration per day and per county and district. A visual and interactive preparation of the data can already be found in the Vitaldaten-Explorer, which was provided by the CDA team.The data points provided here serve the further use in science and the interested public. They cover the full CDA survey period from April 2020 to December 2022. Since the data provided are spatial averages, it is not possible to draw conclusions about individuals.

  20. n

    ECMWF ERA5.1: ensemble spreads of surface level analysis parameter data for...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Sep 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ECMWF ERA5.1: ensemble spreads of surface level analysis parameter data for 2000-2006 [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ensemble%20run
    Explore at:
    Dataset updated
    Sep 18, 2021
    Description

    This dataset contains spreads for the ERA5.1 surface level analysis parameter data ensemble means (see linked dataset) over the period 2000-2006. ERA5.1 is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project re-run for 2000-2006 to improve upon the cold bias in the lower stratosphere seen in ERA5 (see technical memorandum 859 in the linked documentation section for further details). The ensemble means and spreads are calculated from the ERA5.1 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). The main ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data, ERA5t, are also available upto 5 days behind the present. A limited selection of data from these runs are also available via CEDA, whilst full access is available via the Copernicus Data Store.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). ECMWF ERA5: 10 ensemble member surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ensemble%20runs

ECMWF ERA5: 10 ensemble member surface level analysis parameter data

Explore at:
Dataset updated
Dec 8, 2023
Description

This dataset contains ERA5 surface level analysis parameter data from 10 ensemble runs. ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble members were used to derive means and spread data (see linked datasets). Ensemble means and spreads were calculated from the ERA5t 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data. The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

Search
Clear search
Close search
Google apps
Main menu