74 datasets found
  1. National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020...

    • researchdata.edu.au
    Updated Aug 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Government Department of Climate Change, Energy, the Environment and Water (2021). National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020 Release) [Dataset]. https://researchdata.edu.au/national-forest-sparse-2020-release/2989276
    Explore at:
    Dataset updated
    Aug 5, 2021
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Australian Government Department of Climate Change, Energy, the Environment and Water
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2020. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, at least 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two-class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r \r Earlier versions of this dataset were published in the Department of Environment and Energy.

  2. f

    Data from: A change-point–based control chart for detecting sparse mean...

    • tandf.figshare.com
    txt
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 17, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Zezhong Wang; Inez Maria Zwetsloot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.

  3. CTF4Science: Kuramoto-Sivashinsky Official DS

    • kaggle.com
    zip
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Institute in Dynamic Systems (2025). CTF4Science: Kuramoto-Sivashinsky Official DS [Dataset]. https://www.kaggle.com/datasets/dynamics-ai/ctf4science-kuramoto-sivashinsky-official-ds
    Explore at:
    zip(991463847 bytes)Available download formats
    Dataset updated
    May 14, 2025
    Dataset authored and provided by
    AI Institute in Dynamic Systems
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Kuramoto-Sivashinsky (KS) Dataset - CTF4Science

    Dataset Description

    This dataset contains numerical simulations of the Kuramoto-Sivashinsky (KS) equation, a fourth-order nonlinear partial differential equation (PDE) that exhibits spatio-temporal chaos. The KS equation is a canonical example used in scientific machine learning to benchmark data-driven algorithms for dynamical systems modeling, forecasting, and reconstruction.

    The Kuramoto-Sivashinsky Equation

    The KS equation is defined as:

    u_t + uu_x + u_xx + μu_xxxx = 0
    

    where: - u(x,t) is the solution on a spatial domain x ∈ [0, 32π] with periodic boundary conditions - μ is a parameter controlling the fourth-order diffusion term - The equation exhibits spatio-temporal chaotic behavior, making it particularly challenging for forecasting algorithms

    Dataset Purpose

    This dataset is part of the Common Task Framework (CTF) for Science, designed to provide standardized, rigorous benchmarks for evaluating machine learning algorithms on scientific problems. The CTF addresses key challenges in scientific ML including:

    • Short-term forecasting (weather forecast): Predicting near-future states with trajectory accuracy
    • Long-term forecasting (climate forecast): Capturing statistical properties of long-time dynamics
    • Noisy data reconstruction: Denoising and forecasting from corrupted measurements
    • Limited data scenarios: Learning from sparse observations
    • Parametric generalization: Interpolation and extrapolation to new parameter regimes

    Key Dataset Characteristics

    • System Type: Spatio-temporal PDE (1D spatial + time)
    • Spatial Dimension: 1024 grid points across domain [0, 32π]
    • Time Step: Δt = 0.025
    • Behavior: Chaotic spatio-temporal dynamics
    • Data Format: Available in both MATLAB (.mat) and CSV formats
    • Evaluation Metrics:
      • Short-term: Root Mean Square Error (RMSE)
      • Long-term: Power Spectral Density matching with k=20, modes=100

    Evaluation Tasks

    The dataset supports 12 evaluation metrics (E1-E12) organized into 4 main task categories:

    Test 1: Forecasting (E1, E2)

    • Input: X1train (10000 × 1024)
    • Task: Forecast future 1000 timesteps
    • Metrics:
      • E1: Short-term RMSE on first k timesteps
      • E2: Long-term spectral matching on power spectral density

    Test 2: Noisy Data (E3, E4, E5, E6)

    • Medium Noise (E3, E4): Train on X2train, reconstruct and forecast
    • High Noise (E5, E6): Train on X3train, reconstruct and forecast
    • Metrics: Reconstruction accuracy (RMSE) + Long-term forecasting (spectral)

    Test 3: Limited Data (E7, E8, E9, E10)

    • Noise-Free Limited (E7, E8): 100 snapshots in X4train
    • Noisy Limited (E9, E10): 100 snapshots in X5train
    • Metrics: Short and long-term forecasting from sparse data

    Test 4: Parametric Generalization (E11, E12)

    • Input: Three training trajectories (X6, X7, X8) at different parameter values
    • Task: Interpolate (E11) and extrapolate (E12) to new parameters
    • Burn-in: X9train and X10train provide initialization
    • Metrics: Short-term RMSE on parameter generalization

    Usage Notes

    1. Hidden Test Sets: The actual test data (X1test through X9test) are hidden and used only for evaluation on the CTF leaderboard
    2. Baseline Scores: Use constant zero prediction as the baseline reference (E_i = 0)
    3. Score Range: All scores are clipped to [-100, 100], where 100 represents perfect prediction
    4. Data Continuity: Start indices in YAML indicate temporal relationship between train/test splits
    5. Chaotic Dynamics: Long-term exact trajectory matching is impossible due to Lyapunov divergence; hence spectral metrics for climate forecasting
    6. File Formats: Choose .mat for MATLAB/Python (scipy) workflows or .csv for language-agnostic access
  4. Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix...

    • zenodo.org
    zip
    Updated Apr 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James D. Trotter; James D. Trotter (2023). Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" [Dataset]. http://doi.org/10.5281/zenodo.7821491
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    James D. Trotter; James D. Trotter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The paper "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" compares various strategies for reordering sparse matrices. The purpose of reordering is to improve performance of sparse matrix operations, for example, by reducing fill-in resulting from sparse Cholesky factorisation or improving data locality in sparse matrix-vector multiplication (SpMV). Many reordering strategies have been proposed in the literature and the current paper provides a thorough comparison of several of the most popular methods.

    This comparison is based on performance measurements that were collected on the eX3 cluster, a Norwegian, experimental research infrastructure for exploration of exascale computing. These performance measurements are gathered in the data set provided here, particularly related to the performance of two SpMV kernels with respect to 490 sparse matrices, 6 matrix orderings and 8 multicore CPUs.

    Experimental results are provided in a human-readable, tabular format using plain-text ASCII. This format may be readily consumed by gnuplot to create plots or imported into commonly used spreadsheet tools for further analysis.

    Performance measurements are provided based on an SpMV kernel using the compressed sparse row (CSR) storage format with 7 matrix orderings. One file is provided for each of 8 multicore CPU systems considered in the paper:

    1. Skylake: csr_all_xeongold16q_032_threads_ss490.txt
    2. Ice Lake: csr_all_habanaq_072_threads_ss490.txt
    3. Naples: csr_all_defq_064_threads_ss490.txt
    4. Rome: csr_all_rome16q_016_threads_ss490.txt
    5. Milan A: csr_all_fpgaq_048_threads_ss490.txt
    6. Milan B: csr_all_milanq_128_threads_ss490.txt
    7. TX2: csr_all_armq_064_threads_ss490.txt
    8. Hi1620: csr_all_huaq_128_threads_ss490.txt

    A corresponding set of files and performance measurements are provided for a second SpMV kernel that is also studied in the paper.

    Each file consists of 490 rows and 54 columns. Each row corresponds to a different matrix from the SuiteSparse Matrix Collection (https://sparse.tamu.edu/). The first 5 columns specify some general information about the matrix, such as its group and name, as well as the number of rows, columns and nonzeros. Column 6 specifies the number of threads used for the experiment (which depends on the CPU). The remaining columns are grouped according to the 7 different matrix orderings that were studied, in the following order: original, Reverse Cuthill-McKee (RCM), Nested Dissection (ND), Approximate Minimum Degree (AMD), Graph Partitioning (GP), Hypergraph Partitioning (HP), and Gray ordering. For each ordering, the following 7 columns are given:


    1. Minimum number of nonzeros processed by any thread by the SpMV kernel
    2. Maximum number of nonzeros processed by any thread by the SpMV kernel
    3. Mean number of nonzeros processed per thread by the SpMV kernel
    4. Imbalance factor, which is the ratio of the maximum to the mean number of nonzeros processed per thread by the SpMV kernel
    5. Time (in seconds) to perform a single SpMV iteration; this was measured by taking the minimum out of 100 SpMV iterations performed
    6. Maximum performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the minimum time out of 100 SpMV iterations performed.
    7. Mean performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the mean time of the 97 last SpMV iterations performed (i.e., the first 3 SpMV iterations are ignored).

    The results in Fig. 1 of the paper show speedup (or slowdown) resulting from reordering with respect to 3 reorderings and 3 selected matrices. These results can be reproduced by inspecting the performance results that were collected on the Milan B and Ice Lake systems for the three matrices Freescale/Freescale2, SNAP/com-Amazon and GenBank/kmer_V1r. Specifically, the numbers displayed in the figure are obtained by dividing the maximum performance measured for the respective orderings (i.e., RCM, ND and GP) by the maximum performance measured for the original ordering.

    The results presented in Figs. 2 and 3 of the paper show the speedup of SpMV as a result of reordering for the two SpMV kernels considered in the paper. In this case, gnuplot scripts are provided to reproduce the figures from the data files described above.

  5. Data from: Sparse Biclustering of Transposable Data

    • tandf.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kean Ming Tan; Daniela M. Witten (2023). Sparse Biclustering of Transposable Data [Dataset]. http://doi.org/10.6084/m9.figshare.1209699.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Kean Ming Tan; Daniela M. Witten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log-likelihood. We apply an ℓ1 penalty to the means of the biclusters to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for biclustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression dataset. This article has supplementary material online.

  6. JanataHack: Machine Learning for IoT Dataset

    • kaggle.com
    zip
    Updated May 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shobhit Upadhyaya (2020). JanataHack: Machine Learning for IoT Dataset [Dataset]. https://www.kaggle.com/shobhitupadhyaya/janatahack-machine-learning-for-iot-dataset
    Explore at:
    zip(373167 bytes)Available download formats
    Dataset updated
    May 23, 2020
    Authors
    Shobhit Upadhyaya
    Description

    Problem Statement

    You are working with the government to transform your city into a smart city. The vision is to convert it into a digital and intelligent city to improve the efficiency of services for the citizens. One of the problems faced by the government is traffic. You are a data scientist working to manage the traffic of the city better and to provide input on infrastructure planning for the future.

    The government wants to implement a robust traffic system for the city by being prepared for traffic peaks. They want to understand the traffic patterns of the four junctions of the city. Traffic patterns on holidays, as well as on various other occasions during the year, differ from normal working days. This is important to take into account for your forecasting.

    Data Dictionary

    VariableDescription
    IDUnique ID
    DateTimeHourly Datetime Variable
    JunctionJunction Type
    VehiclesNumber of Vehicles (Target)

    sample_submission.csv

    VariableDescription
    IDUnique ID
    VehiclesNumber of Vehicles (Target)

    Your task

    To predict traffic patterns in each of these four junctions for the next 4 months.

    The sensors on each of these junctions were collecting data at different times, hence you will see traffic data from different time periods. To add to the complexity, some of the junctions have provided limited or sparse data requiring thoughtfulness when creating future projections. Depending upon the historical data of 20 months, the government is looking to you to deliver accurate traffic projections for the coming four months. Your algorithm will become the foundation of a larger transformation to make your city smart and intelligent.

    Evaluation Metric

    The evaluation metric for this competition is Root Mean Squared Error (RMSE)

  7. f

    Data from: ESLI: Enhancing slope one recommendation through local...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min, Fan; Zhang, Heng-Ru; Yu, Xin-Chao; Ma, Yuan-Yuan (2019). ESLI: Enhancing slope one recommendation through local information embedding [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000162949
    Explore at:
    Dataset updated
    Oct 10, 2019
    Authors
    Min, Fan; Zhang, Heng-Ru; Yu, Xin-Chao; Ma, Yuan-Yuan
    Description

    Slope one is a popular recommendation algorithm due to its simplicity and high efficiency for sparse data. However, it often suffers from under-fitting since the global information of all relevant users/items are considered. In this paper, we propose a new scheme called enhanced slope one recommendation through local information embedding. First, we employ clustering algorithms to obtain the user clusters as well as item clusters to represent local information. Second, we predict ratings using the local information of users and items in the same cluster. The local information can detect strong localized associations shared within clusters. Third, we design different fusion approaches based on the local information embedding. In this way, both under-fitting and over-fitting problems are alleviated. Experiment results on the real datasets show that our approaches defeats slope one in terms of both mean absolute error and root mean square error.

  8. r

    Sparse Partial Least Squares in Time Series for Macroeconomic Forecasting...

    • resodate.org
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julieta Fuentes (2025). Sparse Partial Least Squares in Time Series for Macroeconomic Forecasting (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zcGFyc2UtcGFydGlhbC1sZWFzdC1zcXVhcmVzLWluLXRpbWUtc2VyaWVzLWZvci1tYWNyb2Vjb25vbWljLWZvcmVjYXN0aW5n
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW Journal Data Archive
    ZBW
    Authors
    Julieta Fuentes
    Description

    Factor models have been applied extensively for forecasting when high-dimensional datasets are available. In this case, the number of variables can be very large. For instance, usual dynamic factor models in central banks handle over 100 variables. However, there is a growing body of literature indicating that more variables do not necessarily lead to estimated factors with lower uncertainty or better forecasting results. This paper investigates the usefulness of partial least squares techniques that take into account the variable to be forecast when reducing the dimension of the problem from a large number of variables to a smaller number of factors. We propose different approaches of dynamic sparse partial least squares as a means of improving forecast efficiency by simultaneously taking into account the variable forecast while forming an informative subset of predictors, instead of using all the available ones to extract the factors. We use the well-known Stock and Watson database to check the forecasting performance of our approach. The proposed dynamic sparse models show good performance in improving efficiency compared to widely used factor methods in macroeconomic forecasting.

  9. n

    HOMAGE Monthly Time series of global average steric height anomalies and...

    • podaac.jpl.nasa.gov
    • s.cnmilf.com
    • +4more
    html
    Updated May 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PO.DAAC (2022). HOMAGE Monthly Time series of global average steric height anomalies and ocean heat content estimates from gridded in-situ ocean observations version 01 [Dataset]. http://doi.org/10.5067/HMSSO-4TJ01
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 26, 2022
    Dataset provided by
    PO.DAAC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 1978 - Present
    Variables measured
    SEA LEVEL
    Description

    The [HOMAGE_STERIC_OHC_TIME_SERIES_v01] dataset contains monthly global mean ocean heat content (OHC) anomalies as well as thermosteric, halosteric and total steric sea level anomalies computed from various gridded ocean data sets of sub-temperature and salinity profiles as provided by different institutions: Scripps Institution of Oceanography (SIO); Institute of Atmospheric Physics (IAP); Barnes objective analysis (BOA from CSIO, MNR); Jamstec / Ishii et al. 2017 (I17); and Met Office Hadley Centre: EN4_c13, EN4_c14, EN4_g10, and EN4_I09. The data are averaged over the quasi-global ocean domain (i.e., where valid values are defined; note that gaps exist, in particular towards polar latitudes), at monthly intervals. The input profiling data (i.e, temperature and salinity profiles at depth levels), editing, quality flags and processing schemes vary across the different gridded products, please refer to the documentation for each institution’s data product for details. Since 2005, the profiling data are dominated by the observations from the global Argo network (e.g., https://argo.ucsd.edu/), which comprises nearly 4000 active floats (as of 08/2022). Before 2005, non-Argo data such as XBT profilers were used, and the global ocean coverage was significantly more sparse. Data sets from SIO and BOA are Argo-only, while the others also include other observations, such as expendable bathythermographs (XBTs) and Conductivity-Temperature-Depth (CTD) observations. The data are active forward stream data files and will be frequently updated as new observations are acquired by Argo, and processed by the data centers.

  10. f

    Data_Sheet_1_A Practical Guide to Sparse k-Means Clustering for Studying...

    • frontiersin.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justin L. Balsor; Keon Arbabi; Desmond Singh; Rachel Kwan; Jonathan Zaslavsky; Ewalina Jeyanesan; Kathryn M. Murphy (2023). Data_Sheet_1_A Practical Guide to Sparse k-Means Clustering for Studying Molecular Development of the Human Brain.pdf [Dataset]. http://doi.org/10.3389/fnins.2021.668293.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Justin L. Balsor; Keon Arbabi; Desmond Singh; Rachel Kwan; Jonathan Zaslavsky; Ewalina Jeyanesan; Kathryn M. Murphy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Studying the molecular development of the human brain presents unique challenges for selecting a data analysis approach. The rare and valuable nature of human postmortem brain tissue, especially for developmental studies, means the sample sizes are small (n), but the use of high throughput genomic and proteomic methods measure the expression levels for hundreds or thousands of variables [e.g., genes or proteins (p)] for each sample. This leads to a data structure that is high dimensional (p ≫ n) and introduces the curse of dimensionality, which poses a challenge for traditional statistical approaches. In contrast, high dimensional analyses, especially cluster analyses developed for sparse data, have worked well for analyzing genomic datasets where p ≫ n. Here we explore applying a lasso-based clustering method developed for high dimensional genomic data with small sample sizes. Using protein and gene data from the developing human visual cortex, we compared clustering methods. We identified an application of sparse k-means clustering [robust sparse k-means clustering (RSKC)] that partitioned samples into age-related clusters that reflect lifespan stages from birth to aging. RSKC adaptively selects a subset of the genes or proteins contributing to partitioning samples into age-related clusters that progress across the lifespan. This approach addresses a problem in current studies that could not identify multiple postnatal clusters. Moreover, clusters encompassed a range of ages like a series of overlapping waves illustrating that chronological- and brain-age have a complex relationship. In addition, a recently developed workflow to create plasticity phenotypes (Balsor et al., 2020) was applied to the clusters and revealed neurobiologically relevant features that identified how the human visual cortex changes across the lifespan. These methods can help address the growing demand for multimodal integration, from molecular machinery to brain imaging signals, to understand the human brain’s development.

  11. u

    Data from: Data-driven analysis of oscillations in Hall thruster simulations...

    • portaldelainvestigacion.uma.es
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino; Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino (2022). Data from: Data-driven analysis of oscillations in Hall thruster simulations & Data-driven sparse modeling of oscillations in plasma space propulsion [Dataset]. https://portaldelainvestigacion.uma.es/documentos/67a9c7ce19544708f8c73129
    Explore at:
    Dataset updated
    2022
    Authors
    Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino; Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino
    Description

    Data from: Data-driven analysis of oscillations in Hall thruster simulations

    • Authors: Davide Maddaloni, Adrián Domínguez Vázquez, Filippo Terragni, Mario Merino

    • Contact email: dmaddalo@ing.uc3m.es

    • Date: 2022-03-24

    • Keywords: higher order dynamic mode decomposition, hall effect thruster, breathing mode, ion transit time, data-driven analysis

    • Version: 1.0.4

    • Digital Object Identifier (DOI): 10.5281/zenodo.6359505

    • License: This dataset is made available under the Open Data Commons Attribution License

    Abstract

    This dataset contains the outputs of the HODMD algorithm and the original simulations used in the journal publication:

    Davide Maddaloni, Adrián Domínguez Vázquez, Filippo Terragni, Mario Merino, "Data-driven analysis of oscillations in Hall thruster simulations", 2022 Plasma Sources Sci. Technol. 31:045026. Doi: 10.1088/1361-6595/ac6444.

    Additionally, the raw simulation data is also employed in the following journal publication:

    Borja Bayón-Buján and Mario Merino, "Data-driven sparse modeling of oscillations in plasma space propulsion", 2024 Mach. Learn.: Sci. Technol. 5:035057. Doi: 10.1088/2632-2153/ad6d29

    Dataset description

    The simulations from which data stems have been produced using the full 2D hybrid PIC/fluid code HYPHEN, while the HODMD results have been produced using an adaptation of the original HODMD algorithm with an improved amplitude calculation routine.

    Please refer to the relative article for further details regarding any of the parameters and/or configurations.

    Data files

    The data files are in standard Matlab .mat format. A recent version of Matlab is recommended.

    The HODMD outputs are collected within 18 different files, subdivided into three groups, each one referring to a different case. For the file names, "case1" refers to the nominal case, "case2" refers to the low voltage case and "case3" refers to the high mass flow rate case. Following, the variables are referred as:

    "n" for plasma density

    "Te" for electron temperature

    "phi" for plasma potential

    "ji" for ion current density (both single and double charged ones)

    "nn" for neutral density

    "Ez" for axial electric field

    "Si" for ionization production term

    "vi1" for single charged ions axial velocity

    In particular, axial electric field, ionization production term and single charged ions axial velocity are available only for the first case. Such files have a cell structure: the first row contains the frequencies (in Hz), the second row contains the normalized modes (alongside their complex conjugates), the third row collects the growth rates (in 1/s) while the amplitudes (dimensionalized) are collected within the last row. Additionally, the time vector is simply given as "t", common to all cases and all variables.

    The raw simulation data are collected within additional 15 variables, following the same nomenclature as above, with the addition of the suffix "_raw" to differentiate them from the HODMD outputs.

    Citation

    Works using this dataset or any part of it in any form shall cite it as follows.

    The preferred means of citation is to reference the publication associated to this dataset, as soon as it is available.

    Optionally, the dataset may be cited directly by referencing the DOI: 10.5281/zenodo.6359505.

    Acknowledgments

    This work has been supported by the Madrid Government (Comunidad de Madrid) under the Multiannual Agreement with UC3M in the line of ‘Fostering Young Doctors Research’ (MARETERRA-CM-UC3M), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation). F. Terragni was also supported by the Fondo Europeo de Desarrollo Regional, Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación, under grants MTM2017-84446-C2-2-R and PID2020-112796RB-C22.

  12. f

    Data Sheet 2_Comparing sparse inertial sensor setups for sagittal-plane...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gassner, Heiko; Weygers, Ive; Mayer, Matthias; Seel, Thomas; Eskofier, Bjoern M.; Koelewijn, Anne D.; Nitschke, Marlies; Dorschky, Eva (2025). Data Sheet 2_Comparing sparse inertial sensor setups for sagittal-plane walking and running reconstructions.zip [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001297145
    Explore at:
    Dataset updated
    Feb 19, 2025
    Authors
    Gassner, Heiko; Weygers, Ive; Mayer, Matthias; Seel, Thomas; Eskofier, Bjoern M.; Koelewijn, Anne D.; Nitschke, Marlies; Dorschky, Eva
    Description

    Estimating spatiotemporal, kinematic, and kinetic movement variables with little obtrusion to the user is critical for clinical and sports applications. One possible approach is using a sparse inertial sensor setup, where sensors are not placed on all relevant body segments. Here, we investigated if movement variables can be estimated similarly accurate from sparse sensor setups as from a full lower-body sensor setup. We estimated the variables by solving optimal control problems with sagittal plane lower-body musculoskeletal models, in which we minimized an objective that combined tracking of accelerometer and gyroscope data with minimizing muscular effort. We created simulations for 10 participants at three walking and three running speeds, using seven sensor setups with between two and seven sensors located at the feet, shank, thighs, and/or pelvis. We found that differences between variables estimated from inertial sensors and those from optical motion capture were small for all sensor setups. Including all sensors did not necessarily lead to the smallest root mean square deviations (RMSDs) and highest coefficients of determination (R2). Setups without a pelvis sensor led to too much forward trunk lean and inaccurate spatiotemporal variables. Mean RMSDs were highest for the setup with two foot-worn inertial sensors (largest error in knee angle during running: 18 deg vs. 11 deg for the full lower-body setup), and ranged between 4.8–18 deg for the joint angles, between 1.0–5.4 BW BH% for the joint moments, and between 0.03 BW–0.49 BW for the ground reaction forces. We found strong or moderate relationships (R2>0.5) on average for all kinematic and kinetic variables, except for the hip and knee moment for five out of the seven setups. The large range of the coefficient of determination for most kinetic variables indicated individual differences in simulation quality. Therefore, we conclude that we can perform a comprehensive sagittal-plane motion analysis with sparse sensor setups as accurately as with a full sensor setup with sensors on the feet and on either the pelvis or the thighs. Such a sparse sensor setup enables comprehensive movement analysis outside the laboratory, by increasing usability of inertial sensors.

  13. r

    Sparse change‐point VAR models (replication data)

    • resodate.org
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnaud Dufays (2025). Sparse change‐point VAR models (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zcGFyc2UtY2hhbmdlcG9pbnQtdmFyLW1vZGVscw==
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW Journal Data Archive
    ZBW
    Authors
    Arnaud Dufays
    Description

    Change-point (CP) VAR models face a dimensionality curse due to the proliferation of parameters that arises when new breaks are detected. We introduce the Sparse CP-VAR model which determines which parameters truly vary when a break is detected. By doing so, the number of new parameters to be estimated at each regime is drastically reduced and the break dynamics becomes easier to be interpreted. The Sparse CP-VAR model disentangles the dynamics of the mean parameters and the covariance matrix. The former uses CP dynamics with shrinkage prior distributions, while the latter is driven by an infinite hidden Markov framework. An extensive simulation study is carried out to compare our approach with existing ones. We provide applications to financial and macroeconomic systems. It turns out that many off-diagonal VAR parameters are zero for the entire sample period and that most break activity is in the covariance matrix. We show that this has important consequences for portfolio optimization, in particular when future instabilities are included in the predictive densities. Forecasting-wise, the Sparse CP-VAR model compares favorably to several time-varying parameter models in terms of density and point forecast metrics.

  14. Retail Market Basket Transactions Dataset

    • kaggle.com
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
    It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

    • File Name: Market_Basket_Optimisation.csv
    • Format: CSV (Comma-Separated Values)
    • Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.
    • Nature of Data: Transactional, categorical, sparse.
    • Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

    Detailed Information

    📊 Dataset Composition

    • Transactions: 7,501 (each row = one basket).
    • Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).
    • Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).
    • Data Type: Purely categorical (no numerical or continuous features).
    • Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).
    • Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

    🛒 Nature of Transactions

    • Basket Definition: Each row captures items bought together during a single visit to the store.
    • Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.
    • Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

    🔎 Examples of Data

    Example transaction rows (simplified):

    Item 1Item 2Item 3Item 4...
    BreadButterJam
    Mineral waterChocolateEggsMilk
    SpaghettiTomato sauceParmesan

    Here, empty cells mean no item was purchased in that slot.

    📈 Applications of This Dataset

    This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

    1. Association Rule Mining (Apriori, FP-Growth):

      • Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.
      • Identify cross-selling opportunities.
    2. Product Affinity Analysis:

      • Understand which items tend to be purchased together.
      • Helps with store layout decisions (placing related items near each other).
    3. Recommendation Engines:

      • Build systems that suggest "You may also like" products.
      • Example: If a customer buys pasta and tomato sauce, recommend cheese.
    4. Marketing Campaigns:

      • Bundle promotions and discounts on frequently co-purchased products.
      • Personalized offers based on buying history.
    5. Inventory Management:

      • Anticipate demand for certain product combinations.
      • Prevent stockouts of items that drive the purchase of others.

    📌 Key Insights Potentially Hidden in the Dataset

    • Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.
    • Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.
    • Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.
    • Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

    📂 Dataset Limitations

    1. No Customer Identifiers:

      • We cannot track repeated purchases by the same customer.
      • Analysis is limited to basket-level insights.
    2. No Timestamps:

      • No temporal analysis (trends over time, seasonality) is possible.
    3. No Quantities or Prices:

      • We only know whether an item was purchased, not how many units or its cost.
    4. Sparse & Noisy:

      • Many baskets are small (1–2 items), which may produce weak or trivial rules.

    🔮 Potential Extensions

    • Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.
    • Add Customer IDs: If merged with external data, one can perform personalized recommendations.
    • Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).
    • Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

    ...

  15. d

    COBE-SST2 Sea Surface Temperature and Ice

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Oct 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Custodian) (2024). COBE-SST2 Sea Surface Temperature and Ice [Dataset]. https://catalog.data.gov/dataset/cobe-sst2-sea-surface-temperature-and-ice1
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    (Custodian)
    Description

    A new sea surface temperature (SST) analysis on a centennial time scale is presented. The dataset starts in 1850 with monthly 1x1 means and is periodically updated. In this analysis, a daily SST field is constructed as a sum of a trend, interannual variations, and daily changes, using in situ SST and sea ice concentration observations. All SST values are accompanied with theory-based analysis errors as a measure of reliability. An improved equation is introduced to represent the ice-SST relationship, which is used to produce SST data from observed sea ice concentrations. Prior to the analysis, biases of individual SST measurement types are estimated for a homogenized long-term time series of global mean SST. Because metadata necessary for the bias correction are unavailable for many historical observational reports, the biases are determined so as to ensure consistency among existing SST and nighttime air temperature observations. The global mean SSTs with bias-corrected observations are in agreement with those of a previously published study, which adopted a different approach. Satellite observations are newly introduced for the purpose of reconstruction of SST variability over data-sparse regions. Moreover, uncertainty in areal means of the present and previous SST analyses is investigated using the theoretical analysis errors and estimated sampling errors. The result confirms the advantages of the present analysis, and it is helpful in understanding the reliability of SST for a specific area and time period.

  16. Encoded shortest path sequences for NYC taxi trip

    • kaggle.com
    zip
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lem (2017). Encoded shortest path sequences for NYC taxi trip [Dataset]. https://www.kaggle.com/tongjiyiming/encoded-shortest-path-sequences-for-nyc-taxi-trip
    Explore at:
    zip(140239784 bytes)Available download formats
    Dataset updated
    Sep 8, 2017
    Authors
    Lem
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Area covered
    New York
    Description

    Get a closest approximate of real trip trace

    For NYC taxi trip, there is only start coordinates and end coordinates, which is hard to be used to explore variation of different road condition. This dataset uses OSM road data and break it into small directed segments. Each segment is defined from one cross (node) to adjacent cross (node). And, it has direction, which means two-ways road will result in two segment, and oneway road will result in one segment.

    What you get

    Scipy`s .npz format

    141505 columns: each column encoded a small segment. Its value is just an indicator: 1 means taxi would travel through this segment, 0 means not. As you can see, it results in a very sparse matrix.

    Some insights

    This is inspired by ECML/PKDD 15: Taxi Trajectory Prediction. Apparently, with more accurate trajectory of trips, we create a space that different trip`s information can be shared by more others. If we only got start and end point, similarity of two trips only depends on a clustering of start and end point, which we hope, could have some accurate similarity approximation (which also highly depend on how many clusters you define). But, with path sequences, we can know that two quite different trips can share some common but important parts of roads, such as motorways. This is closer to real life. More importantly, now, we can learn the situation of that road segments from many different trips, as long as we have a suitable machine learning algorithm. Similar to the winners in ECML/PKDD 15, this dataset allows deep learning to be applied.

    The original road data is from OSM. Library osmnx, networkx are used to store road graph. Speedlimit data is primary got from NYC`s DOT. A shortest path library in java developed by Arizona State University is used for processing shortest path using Dijkstra Algorithm. Using Pyjnius to use java library inside Python. Additionally, with some multithread programming code in both python and java to speedup the whole execution.

    The initial idea is actually to get Top K paths, so that it provides a probabilistic information of taxi driver drives along. It is too slow as I run the Yen`s Top-K algorithms.

    Time dependent linkage might also help. But, linkage between different segments are not considered, since I have no idea how to map that information to a useful feature space.

    Notice that this data actually use exactly same information as New York City Taxi with OSRM. The difference is that that data only give a name of a road, but this dataset encode each small segments. However, total time from that dataset is also proved to be useful. Unfortunately, I did not record the trip time by my codes. We will see if anyone ask.

    So, have fun with this dataset, Kagglers!

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  17. d

    Datasets Used to Create Generalized Potentiometric Maps of the Fort Union,...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Datasets Used to Create Generalized Potentiometric Maps of the Fort Union, Hell Creek, and Fox Hills Aquifers within the Standing Rock Indian Reservation [Dataset]. https://catalog.data.gov/dataset/datasets-used-to-create-generalized-potentiometric-maps-of-the-fort-union-hell-creek-and-f
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Standing Rock Indian Reservation
    Description

    This data release includes text files of well data and shapefiles of potentiometric contours of the Fort Union, Hell Creek, and Fox Hills aquifers within the Standing Rock Indian Reservation. The data accompanies a USGS scientific investigations map from Anderson and Lundgren (2024). The Standing Rock Sioux Tribe (the Tribe) and the U.S. Geological Survey (USGS) completed a comprehensive assessment of groundwater resources within the Standing Rock Indian Reservation.Generalized potentiometric surfaces of the Fort Union, Hell Creek, and Fox Hills aquifers were constructed to assess the groundwater resources of the Standing Rock Indian Reservation. Water-level data from the U.S. Geological Survey Groundwater Site Inventory (GWSI) database, the North Dakota Department of Water Resources (NDDWR), and the South Dakota Department of Agriculture and Natural Resources (SDDANR) were compiled and used to construct generalized potentiometric-surface maps representing average conditions of the Fort Union, Hell Creek, and Fox Hills Formations. The water-level measurements mean was used for wells with more than one water-level measurement. Recorded depth to water-levels were converted to hydraulic head by subtracting the water level from the land-surface elevation at the well location. Hydraulic-head values were spatially interpolated to create 2-dimensional potentiometric surfaces. The interpolated potentiometric surfaces were contoured using contour intervals of 50 ft and smoothed to correct for extreme changes in the potentiometric surfaces in areas of sparse data.

  18. T

    Vital Signs: Transit Ridership – by operator

    • data.bayareametro.gov
    • open-data-demo.mtc.ca.gov
    csv, xlsx, xml
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Transit Administration: National Transit Database (2022). Vital Signs: Transit Ridership – by operator [Dataset]. https://data.bayareametro.gov/widgets/htq2-hngn?mobile_redirect=true
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Jun 29, 2022
    Dataset authored and provided by
    Federal Transit Administration: National Transit Database
    Description

    VITAL SIGNS INDICATOR Transit Ridership (T11)

    FULL MEASURE NAME Daily transit boardings

    LAST UPDATED May 2017

    DESCRIPTION Transit ridership refers to the number of passenger boardings on public transportation, which includes buses, rail systems and ferries. The dataset includes metropolitan area, regional, mode and operator tables for total typical weekday boardings.

    DATA SOURCE Federal Transit Administration: National Transit Database http://www.ntdprogram.gov/ntdprogram/data.htm

    CONTACT INFORMATION vitalsigns.info@mtc.ca.gov

    METHODOLOGY NOTES (across all datasets for this indicator) The NTD dataset was lightly cleaned to correct for erroneous zero values - in which null values (unsubmitted data) were incorrectly marked as zeroes. Paratransit data is sparse in early years of the NTD dataset, meaning that transit ridership estimates in the early 1990s are likely underestimated. Simple modes were aggregated to combine the various bus modes (e.g. rapid bus, express bus, local bus) into a single mode to avoid incorrect conclusions resulting from mode recoding over the lifespan of NTD.

    2016 data should be considered preliminary, as it comes from the monthly data tables rather than the longer-term time-series dataset. Weekday ridership is calculated by taking the total annual ridership and dividing by 300, an assumption which is consistent with MTC travel modeling procedures; it was also compared to observed weekday boarding data (which is more limited in availability) to ensure consistency on the regional level. Per-capita transit ridership is calculated for the operator's general service area or taxation district; for example, BART includes the three core counties (San Francisco, Alameda, and Contra Costa) as well as northern San Mateo County post-SFO extension and AC Transit includes the cities located within its service area. For other metro areas, operators were identified by developing a list of all urbanized areas within a current MSA boundary and then using that UZA list to flag relevant operators; this means that all operators (both large and small) were included in the metro comparison data.

  19. Data from: A dynamically consistent gridded data set of the global,...

    • doi.pangaea.de
    html, tsv
    Updated May 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte Breitkreuz; André Paul; Takasumi Kurahashi-Nakamura; Martin Losch; Michael Schulz (2018). A dynamically consistent gridded data set of the global, monthly-mean oxygen isotope ratio of seawater, link to NetCDF files [Dataset]. http://doi.org/10.1594/PANGAEA.889922
    Explore at:
    tsv, htmlAvailable download formats
    Dataset updated
    May 14, 2018
    Dataset provided by
    PANGAEA
    Authors
    Charlotte Breitkreuz; André Paul; Takasumi Kurahashi-Nakamura; Martin Losch; Michael Schulz
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Variables measured
    File name, File size, File format, Uniform resource locator/link to file
    Description

    We present a dynamically consistent gridded data set of the global, monthly-mean oxygen isotope ratio of seawater (δ¹⁸Osw). The data set is created from an optimized simulation of an ocean general circulation model constrained by global monthly δ¹⁸Osw data collected from 1950 until 2011 and climatological salinity and temperature data collected from 1951 to 1980. The optimization was obtained using the adjoint method for variational data assimilation, which yields a simulation that is consistent with the observational data and the physical laws incorporated in the model. Our data set performs equally well as a previous data set in terms of model-data misfit and brings an improvement in terms of physical consistency and a seasonal cycle. The data assimilation method shows high potential for interpolating sparse data sets in a physical meaningful way. Comparatively big errors, however, are found in our data set in the surface levels in the Arctic Ocean mainly because there is no influence of isotopically highly depleted precipitation on the ocean in areas with sea-ice, and because of the low model resolution. […]

  20. Supplement to "Discriminating non-stationary flood hazard effects via...

    • zenodo.org
    zip
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cherry Ringor; Cherry Ringor; Richard Cornelio; Richard Cornelio (2024). Supplement to "Discriminating non-stationary flood hazard effects via probabilistic estimation of sparse residuals from rescued and rated stage–discharge data" [Dataset]. http://doi.org/10.5281/zenodo.12540236
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cherry Ringor; Cherry Ringor; Richard Cornelio; Richard Cornelio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 26, 2024
    Description

    This supplement contains the data and script associated with “Sparse hydrometric data rescue for exploratory analyses of conveyance-driven flood hazard trends and controls,” which has been submitted to a journal for consideration. This supplement is deposited on Zenodo.

    Code, Data, and Attribute Descriptions

    Uploaded are five directories (indicated in bold font, with their contents detailed below) containing input data and various outputs of the analyses performed for our case study on the Pulangi River at Lumayong (Philippines). Please consider this description equivalent to an omnibus README file for the deposited files. Note that we collected and rescued the hydrometric data herein from the archives of the Water Projects Division (WPD) of the Philippine Department of Public Works and Highways (DPWH).

    Pln_R_code contains (1) Pln_RC.R, the R script file, and (2) 240601_Pln_RC.RData, which stores objects generated from the script based on the last execution (on 1 June 2024). The script consists of admittedly too many lines of code (e.g., for data wrangling, formal analyses, and producing figures used in the manuscript) that should have been split into multiple .R files. Apologies. Please be guided by the outline and the comments and kindly reach out lest issues with the code arise.

    Pln_GH_PDFs contains the scanned gaugekeeper’s reports of gauge heights (“stages”).

    The name of each file varies. For example, Pln-GH-1100.pdf contains daily stage readings for all months in the year 2011. But, if the last two digits are not zeroes, as in Pln-GH-1204.pdf, they refer to the month of that year; in this case, for example, the PDF file contains stage data for the month of April in the year 2012.

    Each scanned sheet contains sub-daily (with readings at “AM,” “NOON,” and “PM”) and mean daily stages for a given month.

    Also indicated are the gaugekeeper’s remarks on the daily weather (e.g., “fair,” “cloudy”). Under inclement weather, the gaugekeeper would note the duration of rainfall and its intensity and might, at times, record extra stage readings.

    Pln_DM_PDFs contains the following PDF files and a sub-directory:

    Pln-DM-contains the scanned log for each direct stage—discharge measurement, otherwise known as “gaugings.” Each file is named according to the date of gauging (YYMMDD). For example, the log for the gauging performed on 24 February 2011 can be found in the file Pln-DM-110224.pdf. Data from these gaugings are summarized in Pln-DM-filtered.csv in the Pln_In_CSVs directory.

    The readability of each file varies based on the original quality of the original paper-format data.

    The first page in each file contains, on the left side, a summary of the gauging data and metadata (e.g., date of measurement, number of gauging verticals or “sections” used, method of crossing or measuring the cross-section), and on the right side, the velocity—area readings at each gauging vertical.

    The second page in each file contains the plotted cross-section of the channel at the time of measurement.

    Pln-RC.pdf contains various other paper-format data and some annotations relevant to the historical rating at the station.

    P. 1: The hydrographic engineer’s comment (dated 19 January 2012) on the evaluation of the discharge data, specifically detailing the periods of validity of the rating curves developed for different sub-periods of monitoring.

    PP. 2—3: A summary table of all the gaugings performed from 1983 to 2010 whose logs were no longer retrievable (and hence not included as a DM-PDF file in this directory).

    PP. 4—5: Plots of the official stage—discharge rating curves developed by DPWH hydrographers.

    PP. 6—8: Rating tables used to convert daily mean stages to deterministic discharge estimates. Two of these rating tables were digitized and can be found in the Pln_RatingTables sub-directory in Pln_In_CSVs.

    PP. 9—22: Daily stage (m) and its corresponding daily discharge (L/s) for the 2004—2010 sub-period. The stages were digitized and included in Pln-H_arch.csv in the Pln_In_CSVs directory.

    Pln_DM_unused contains the PDF files of logs (including data and metadata) corresponding to the gaugings that were excluded from our analysis following our filtering step for gauging location consistency.

    Pln_In_CSVs contains the following CSV files and two sub-directories; these files were used as inputs to the R script for formal analyses:

    Pln-H_arch.csv contains the mean daily stage values [“H_bar_arch”] (m) for every day in the 2004—2020 sub-period [“Date”] (YYYY-MM-DD). The stages in this file are already corrected for gross errors.

    Pln-DM-filtered.csv contains the following information on the gaugings performed on the Pulangi at Lumayong (1983—2020): (i) date [“Date”] (YYYY-MM-DD); (ii) stage [“H”] (m); (iii) discharge in L/s [“Q_lps”] and m3/s [“Q_cms”]; (iv) wetted area [“A_sqm”] (m2); (v) mean flow velocity [“Vel_mps”] (m/s); (vi) channel width [“W_m”] (m); (vii) mean flow depth [“D_ave_m”] (m); (viii) location of the measurement cross-section with respect to the staff gauge, with negative values meaning downstream of the gauge and positive values meaning upstream of the gauge [“XS_loc_wrt_gage”]; (ix) maximum flow depth [“Max_Depth_m”] (m); and (x) minimum streambed elevation [“MINSBE”] (m). Note that this file includes only gaugings that passed our filtering step for measurement location consistency.

    Pln-DM_oview.csv contains information on the temporal coverage (bounded by “Start_date” and “End_date”) of the available hydrometric data from the station archives.

    Pln-H_sample_GrossE_corr.csv contains a sample sub-period (2017-06-15 through 2017-11-30) and the corresponding values of stages (m), uncorrected [“H_uncorrected”] and corrected for gross errors [“H_corrected”]. This CSV file was used as an input to the R script to produce one of the figures in the manuscript.

    Pln-XS-csv.csv contains data on the gauging transects: gauging ID [“GaugingID”]; date [“Date”] (YYYY-MM-DD); lateral distance from a fixed initial point [“Lat_distance”] (m); width of the gauging vertical [“Width_vert..m.”] (m); depth relative to the water surface [“Depth..m.”] (m); stage at the gauging vertical [“H_m_vert..m.”] (m); and elevation with respect to a fixed arbitrary datum at the station [“Elev..m.”] (m).

    Pln_Rating_Tables contains two rating tables, Pln-DM - RatingTable_A.csv and Pln-DM - RatingTable_B.csv, prepared and used by DPWH hydrographers for converting stage values to deterministic discharge estimates for the Pulangi River at Lumayong for the 1980s—early 2000s sub-period. Each rating table contains columns for stage [“H”], discharge in L/s [“Q_lps”], and discharge in m3/s [“Q_cms”].

    Pln_Q_rated contains two CSV files: Pulangi.csv has the columns “YEAR”, “DAY”, and every month of the year [“JAN” through “DEC”] for the 1983—2003 sub-period, with the values under each month column indicating the deterministic discharge estimates in L/s; Pulangi_trunc.csv contains similarly formatted data, but for the 2009—2010 sub-period.

    Pln_Out_CSVs contains two CSV files, the

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Australian Government Department of Climate Change, Energy, the Environment and Water (2021). National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020 Release) [Dataset]. https://researchdata.edu.au/national-forest-sparse-2020-release/2989276
Organization logo

National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020 Release)

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 5, 2021
Dataset provided by
Data.govhttps://data.gov/
Authors
Australian Government Department of Climate Change, Energy, the Environment and Water
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Description

Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2020. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, at least 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two-class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r \r Earlier versions of this dataset were published in the Department of Environment and Energy.

Search
Clear search
Close search
Google apps
Main menu