100+ datasets found
  1. u

    Discovered Process Models from Noisy Logs

    • figshare.unimelb.edu.au
    zip
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anandi Karunaratne; Artem Polyvyanyy; Alistair Moffat (2025). Discovered Process Models from Noisy Logs [Dataset]. http://doi.org/10.26188/30739082.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2025
    Dataset provided by
    The University of Melbourne
    Authors
    Anandi Karunaratne; Artem Polyvyanyy; Alistair Moffat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of three folders:Systems: Three public event logs: Sepsis Cases, RTFMS, and BPIC 2012.Logs: Clean and noisy logs derived from the base systems.From each base log, we created samples of seven sizes (1000, 2000, 4000, 10000, 20000, 40000, 100000 traces) using sampling with replacement, yielding 21 clean logs.Noise was then added using $\snip$ across seven intensity levels (0.1%, 0.2%, 0.4%, 1.0%, 2.0%, 4.0%, 10.0%) and five noise types (absence, insertion, ordering, substitution, mixed). Percentages refer to the number of trace-level injections.Each configuration was repeated five times, producing 3,675 noisy logs and a total of 3,696 logs.Models: Contains discovered models for all clean logs and a random subset of noisy logs (incomplete), using the Alpha, Heuristics, and Inductive miners.

  2. 2010 Census Production Settings Redistricting Data (P.L. 94-171)...

    • icpsr.umich.edu
    • registry.opendata.aws
    Updated Nov 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abowd, John M.; Ashmead, Robert; Cumings-Menon, Ryan; Garfinkel, Simson; Heineck, Micah; Heiss, Christine; Johns, Robert; Kifer, Daniel; Leclerc, Philip; Machanavajjhala, Ashwin; Moran, Brett; Sexton, William; Spence, Matthew; Zhuravlev, Pavel (2023). 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement File [Dataset]. http://doi.org/10.3886/ICPSR38777.v2
    Explore at:
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Abowd, John M.; Ashmead, Robert; Cumings-Menon, Ryan; Garfinkel, Simson; Heineck, Micah; Heiss, Christine; Johns, Robert; Kifer, Daniel; Leclerc, Philip; Machanavajjhala, Ashwin; Moran, Brett; Sexton, William; Spence, Matthew; Zhuravlev, Pavel
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/38777/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/38777/terms

    Time period covered
    2010
    Area covered
    United States
    Description

    The 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement Files are an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022], and implemented in https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code). The NMF was produced using the official "production settings," the final set of algorithmic parameters and privacy-loss budget allocations that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File. The NMF consists of the full set of privacy-protected statistical queries (counts of individuals or housing units with particular combinations of characteristics) of confidential 2010 Census data relating to the redistricting data portion of the 2010 Demonstration Data Products Suite - Redistricting and Demographic and Housing Characteristics File - Production Settings (2023-04-03). These statistical queries, called "noisy measurements" were produced under the zero-Concentrated Differential Privacy framework (Bun, M. and Steinke, T [2016]; see also Dwork C. and Roth, A. [2014]) implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023]), which added positive or negative integer-valued noise to each of the resulting counts. The noisy measurements are an intermediate stage of the TDA prior to the post-processing the TDA then performs to ensure internal and hierarchical consistency within the resulting tables. The Census Bureau has released these 2010 Census demonstration data to enable data users to evaluate the expected impact of disclosure avoidance variability on 2020 Census data. The 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement Files (2023-04-03) have been cleared for public dissemination by the Census Bureau Disclosure Review Board (CBDRB-FY22-DSEP-004). The data include zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2010 Census Edited File (CEF), which includes confidential data initially collected in the 2010 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) (https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product- planning/2010-demonstration-data-products/04 Demonstration_Data_Products_Suite/2023-04-03/). As these 2010 Census demonstration data are intended to support study of the design and expected impacts of the 2020 Disclosure Avoidance System, the 2010 CEF records were pre-processed before application of the zCDP framework. This pre-processing converted the 2010 CEF records into the input-file format, response codes, and tabulation categories used for the 2020 Census, which differ in substantive ways from the format, response codes, and tabulation categories originally used for the 2010 Census. The NMF provides estimates of counts of persons in the CEF by various characteristics and combinations of characteristics, including their reported race and ethnicity, whether they were of voting age, whether they resided in a housing unit or one of 7 group quarters types, and their census block of residence, after the addition of discrete Gaussian noise (with the scale parameter determined by the privacy-loss budget allocation for that particular query under zCDP). Noisy measurements of the counts of occupied and vacant housing units by census block are also included. Lastly, data on constraints--information into which no noise was infused by the Disclosure Avoidance System (DAS) and used by the TDA to post-process the noisy measurements into the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) --are provided. These data are available for download (i.e. not restricted access). Due to their size, they must be downloaded through the link on this

  3. v

    Noise control areas

    • opendata.vancouver.ca
    csv, excel, geojson +1
    Updated Mar 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Noise control areas [Dataset]. https://opendata.vancouver.ca/explore/dataset/noise-control-areas/
    Explore at:
    geojson, excel, json, csvAvailable download formats
    Dataset updated
    Mar 8, 2019
    License

    https://opendata.vancouver.ca/pages/licence/https://opendata.vancouver.ca/pages/licence/

    Description

    This dataset contains the boundaries of areas where noise levels are limited by City bylaws. Data currencyThe extract for this dataset is updated weekly. There may be no change in data content from one week to the next because there is no change in source data. Priorities and resources will also determine how fast a change in reality is reflected in the database. Data accuracyThese boundaries follow street and/or lane centrelines so their placement in the street right of way is approximate. Websites for further informationManage noise

  4. f

    Data from: Machine-Learning-Based Data Analysis Method for Cell-Based...

    • acs.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Hou; Chao Xie; Yuhan Gui; Gang Li; Xiaoyu Li (2023). Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries [Dataset]. http://doi.org/10.1021/acsomega.3c02152.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Rui Hou; Chao Xie; Yuhan Gui; Gang Li; Xiaoyu Li
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    DNA-encoded library (DEL) is a powerful ligand discovery technology that has been widely adopted in the pharmaceutical industry. DEL selections are typically performed with a purified protein target immobilized on a matrix or in solution phase. Recently, DELs have also been used to interrogate the targets in the complex biological environment, such as membrane proteins on live cells. However, due to the complex landscape of the cell surface, the selection inevitably involves significant nonspecific interactions, and the selection data are much noisier than the ones with purified proteins, making reliable hit identification highly challenging. Researchers have developed several approaches to denoise DEL datasets, but it remains unclear whether they are suitable for cell-based DEL selections. Here, we report the proof-of-principle of a new machine-learning (ML)-based approach to process cell-based DEL selection datasets by using a Maximum A Posteriori (MAP) estimation loss function, a probabilistic framework that can account for and quantify uncertainties of noisy data. We applied the approach to a DEL selection dataset, where a library of 7,721,415 compounds was selected against a purified carbonic anhydrase 2 (CA-2) and a cell line expressing the membrane protein carbonic anhydrase 12 (CA-12). The extended-connectivity fingerprint (ECFP)-based regression model using the MAP loss function was able to identify true binders and also reliable structure–activity relationship (SAR) from the noisy cell-based selection datasets. In addition, the regularized enrichment metric (known as MAP enrichment) could also be calculated directly without involving the specific machine-learning model, effectively suppressing low-confidence outliers and enhancing the signal-to-noise ratio. Future applications of this method will focus on de novo ligand discovery from cell-based DEL selections.

  5. n

    Data from: Learning to cope: vocal adjustment to urban noise is correlated...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Jun 13, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefanie E. LaZerte; Hans Slabbekoorn; Ken A. Otter (2016). Learning to cope: vocal adjustment to urban noise is correlated with prior experience in black-capped chickadees [Dataset]. http://doi.org/10.5061/dryad.669qn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2016
    Dataset provided by
    Leiden University
    University of Northern British Columbia
    Authors
    Stefanie E. LaZerte; Hans Slabbekoorn; Ken A. Otter
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    British Columbia
    Description

    Urban noise can interfere with avian communication through masking, but birds can reduce this interference by altering their vocalizations. Although several experimental studies indicate that birds can rapidly change their vocalizations in response to sudden increases in ambient noise, none have investigated whether this is a learned response that depends on previous exposure. Black-capped chickadees (Poecile atricapillus) change the frequency of their songs in response to both fluctuating traffic noise and experimental noise. We investigated whether these responses to fluctuating noise depend on familiarity with noise. We confirmed that males in noisy areas sang higher-frequency songs than those in quiet areas, but found that only males in already-noisy territories shifted songs upwards in immediate response to experimental noise. Unexpectedly, males in more quiet territories shifted songs downwards in response to experimental noise. These results suggest that chickadees may require prior experience with fluctuating noise to adjust vocalizations in such a way as to minimize masking. Thus, learning to cope may be an important part of adjusting to acoustic life in the city.

  6. f

    Data from: Clustering High-Dimensional Noisy Categorical Data

    • tandf.figshare.com
    pdf
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhiyi Tian; Jiaming Xu; Jen Tang (2024). Clustering High-Dimensional Noisy Categorical Data [Dataset]. http://doi.org/10.6084/m9.figshare.24925957.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Zhiyi Tian; Jiaming Xu; Jen Tang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering is a widely used unsupervised learning technique that groups data into homogeneous clusters. However, when dealing with real-world data that contain categorical values, existing algorithms can be computationally costly in high dimensions and can struggle with noisy data that has missing values. Furthermore, except for one algorithm, no others provide theoretical guarantees of clustering accuracy. In this article, we propose a general categorical data encoding method and a computationally efficient spectral-based algorithm to cluster high-dimensional noisy categorical data (nominal or ordinal). Under a statistical model for data on m attributes from n subjects in r clusters with missing probability ϵ, we show that our algorithm exactly recovers the true clusters with high probability when mn(1−ϵ)≥CMr2 log 3M, with M=max(n,m) and a fixed constant C. In addition, we show that mn(1−ϵ)2≥rδ/2 with 0

  7. E

    Data from: The HIWIRE database, a noisy and non-native English speech corpus...

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 25, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0293/
    Explore at:
    Dataset updated
    Nov 25, 2008
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This database has been collected and packaged under the auspices of the IST-EU STREP project HIWIRE (Human Input that Works In Real Environments). The database was designed to be used as a tool for development and test of speech processing and recognition techniques dealing with robust non-native speech recognition.The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.Clean audio data has been recorded in different office rooms using a close-talking microphone for lowest ambient acoustic effects (Plantronics USB-45). The used sampling frequency is 16 kHz and data is stored in Windows PCM WAV 16 bits mono format.Recordings correspond to prompts extracted from an aeronautic command and control application. A total of 8,099 utterances have been recorded corresponding to 81 speakers pronouncing 100 utterances each. The speaker distribution is as follows:

    Country# Speakers# Utterances
    France31 (38.3%)3100
    Greece20 (24.7%)2000
    Italy20 (24.7%)2000
    Spain10 (12.3%)999
    Total818099
    To generate the noisy data utterances, the speech level is maintained and only the noise amplitude is modified to obtain the desired SNR. The noise amplitude is adjusted to obtain three different averaged SNR values of 10dB, 5dB and -5dB which are referenced as low noise (LN), mid noise (MN) and high noise (HN) conditions. For each given condition the noise level remains constant.The speech data are pcm-wav files (16kHz / 16 bits / mono) stored on one DVD. The total size is 3.03 Gbytes for 33.053 files.

  8. FSDnoisy18k

    • zenodo.org
    • opendatalab.com
    • +1more
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Mercedes Collado; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Frederic Font; Frederic Font; Xavier Favory; Xavier Serra; Xavier Serra; Mercedes Collado; Manoj Plakal; Xavier Favory (2020). FSDnoisy18k [Dataset]. http://doi.org/10.5281/zenodo.2529934
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Mercedes Collado; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Frederic Font; Frederic Font; Xavier Favory; Xavier Serra; Xavier Serra; Mercedes Collado; Manoj Plakal; Xavier Favory
    Description

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    Data curators

    Eduardo Fonseca and Mercedes Collado

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    Citation

    If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

    You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    FSDnoisy18k description

    What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

    We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

    The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

    The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

    Code

    We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

    Label noise characteristics

    FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

    FSDnoisy18k basic characteristics

    The dataset most relevant characteristics are as follows:

    • FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.
    • The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
    • The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).
    • The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.
    • The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.
    • The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.
    • FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

    License

    FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

    In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

    Files

    FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

    root
    │ 
    └───FSDnoisy18k.audio_train/     Audio clips in the train set
    │  
    └───FSDnoisy18k.audio_test/      Audio clips in the test set
    │  
    └───FSDnoisy18k.meta/         Files for evaluation setup
    │  │      
    │  └───train.csv           Data split and ground truth for the train set
    │  │      
    │  └───test.csv           Ground truth for the test set     
    │  
    └───FSDnoisy18k.doc/
      │      
      └───README.md           The dataset description file that you are reading
      │      
      └───LICENSE-DATASET        License of the FSDnoisy18k dataset as an entity  
      │      
      └───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound 
    

    Each row (i.e. audio clip) of the train.csv file contains the following

  9. mindaffectBCI

    • kaggle.com
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MindAffect (2020). mindaffectBCI [Dataset]. https://www.kaggle.com/mindaffect/mindaffectbci/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    MindAffect
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    MindAffect is a startup working to make Brain Computer Interfaces (BCI) with a mission to “open up new dimensions of interaction” by providing developing technologies which allow users to directly control computers with their brains. So far we have achieved this mission by;

    • working directly with patients and patient support groups (such as ALS-liga Belgium) to deliver BCI technologies directly to end users,
    • partnering with groups in sectors interested in adding brain control, such as VR gaming or home automation, to develop product prototypes,
    • direct sales of complete BCI development kits to makers and hackers interested developing brain controlled projects from our kickstarter campaign.

    Modern BCIs (including our own) rely heavily on machine learning techniques to process the noisy data gathered from EEG sensors and cope with the high degree of variability in responses over different individuals and environments. MindAffect firmly believes that the key to enabling the new BCI applications we all want is a combination of more sophisticated machine learning algorithms and larger and more diverse datasets on which to train these algorithms

    Content

    As a first step to enabling machine learning experts to improve the BCI experience, we are publishing our internal testing datasets and the analysis codes used to develop and refine our own algorithms. We hope these datasets will help people in developing new and improved algorithms for this type of data.

    Initially, we have committed about 60 datasets from our of our development team. We are committed to adding more datasets to this as we gather them to try and build as large as possible database of cVEP EEG data for algorithm development. Further, as more users gather their own data with our open-source bci, we hope they will be willing to donate their own datasets to rapidly gather a large and diverse dataset for further algorithm enhancement.

    Specifically, this dataset was gathered by one of our developers in Nijmegen in the Netherlands using our on-line bci system exactly as shown in this video

    Acknowledgements

    This dataset was gathered and donated by MindAffect B.V.

    Inspiration

    What can you do with this data? 1. Get better performance in less time? How about using a deep-learning approach? 2. Generalize your algorithm to transfer between data-sets so the user does not have to re-calibrate for each new data-set? 3. Generalize over multiple users (as we add new user data?) 4. Generalize to different BCI types (as we add P300 and SSVEP datasets..)

  10. o

    Data and Code for: Noise, Cognitive Function, and Worker Productivity

    • openicpsr.org
    delimited
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua T Dean (2023). Data and Code for: Noise, Cognitive Function, and Worker Productivity [Dataset]. http://doi.org/10.3886/E193705V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    American Economic Association
    Authors
    Joshua T Dean
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2016 - 2017
    Area covered
    Nairobi, Kenya
    Description

    This is the data and code needed to replicate the results in Noise, Cognitive Function, and Worker Productivity.Paper Abstract:Cognitive science research suggests the noisy workplaces common in low and middle income countries can impair workers’ cognitive functions. However, whether this translates into lower earnings for workers depends on the importance of these functions for productivity and whether workers understand these effects. I use two randomized experiments in Nairobi, Kenya to answer these questions. First, I randomize exposure to engine noise during a textile training course at a government training facility. An increase of 7 dB reduces productivity by approximately 3%. In order to study what mechanism drives this effect, I then randomize engine noise during tests of cognitive function and an effort task. The same noise change impairs cognitive function but not effort task performance. Finally, in both experiments, I examine whether individuals appreciate the impact of noise on their performance by eliciting participants’ willingness to pay for quiet working conditions while randomly varying whether they are compensated based on their performance. Individuals’ willingness to pay does not depend on the wage structure; suggesting that they are not aware that quiet working conditions would increase their performance pay. Thus, workers may fail to mitigate earnings losses by sorting into quieter jobs where they are more productive.

  11. F

    Gaussian Processes with Noisy Regression Inputs for Dynamical Systems

    • data.uni-hannover.de
    zip
    Updated Aug 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Regelungstechnik (2024). Gaussian Processes with Noisy Regression Inputs for Dynamical Systems [Dataset]. https://data.uni-hannover.de/dataset/gaussian-processes-with-noisy-regression-inputs-for-dynamical-systems
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 14, 2024
    Dataset authored and provided by
    Institut für Regelungstechnik
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We here provide the code related to our recent paper "Gaussian Processes with Noisy Regression Inputs for Dynamical Systems".

    To run the code, execute the 'offline_phase.mat' or 'offline_phase_all.mat' files.

  12. Strategic noise mapping (2017)

    • gov.uk
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Environment, Food & Rural Affairs (2022). Strategic noise mapping (2017) [Dataset]. https://www.gov.uk/government/publications/strategic-noise-mapping-2019
    Explore at:
    Dataset updated
    Dec 21, 2022
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Environment, Food & Rural Affairs
    Description

    Defra has published strategic noise map data that give a snapshot of the estimated noise from major road and rail sources across England in 2017. The data was developed as part of implementing the Environmental Noise Directive.

    This publication explains which noise sources were included in 2017 strategic noise mapping process. It provides summary maps for major road and rail sources and provides links to the detailed Geographic Information Systems (GIS) noise datasets.

    This data will help transport authorities to better identify and prioritise relevant local action on noise. It will also be useful for planners, academics and others working to assess noise and its impacts.

    Noise mapping Geographic Information Systems (GIS) datasets

    Rail noise

    Road Noise

    Other

    Noise exposure data

    We’ve published data which shows the estimated number of people affected by noise from road traffic, railway and industrial sources.

  13. Table_1_Multisensory benefits for speech recognition in noisy...

    • frontiersin.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yonghee Oh; Meg Schwalm; Nicole Kalpin (2023). Table_1_Multisensory benefits for speech recognition in noisy environments.XLSX [Dataset]. http://doi.org/10.3389/fnins.2022.1031424.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Yonghee Oh; Meg Schwalm; Nicole Kalpin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A series of our previous studies explored the use of an abstract visual representation of the amplitude envelope cues from target sentences to benefit speech perception in complex listening environments. The purpose of this study was to expand this auditory-visual speech perception to the tactile domain. Twenty adults participated in speech recognition measurements in four different sensory modalities (AO, auditory-only; AV, auditory-visual; AT, auditory-tactile; AVT, auditory-visual-tactile). The target sentences were fixed at 65 dB sound pressure level and embedded within a simultaneous speech-shaped noise masker of varying degrees of signal-to-noise ratios (−7, −5, −3, −1, and 1 dB SNR). The amplitudes of both abstract visual and vibrotactile stimuli were temporally synchronized with the target speech envelope for comparison. Average results showed that adding temporally-synchronized multimodal cues to the auditory signal did provide significant improvements in word recognition performance across all three multimodal stimulus conditions (AV, AT, and AVT), especially at the lower SNR levels of −7, −5, and −3 dB for both male (8–20% improvement) and female (5–25% improvement) talkers. The greatest improvement in word recognition performance (15–19% improvement for males and 14–25% improvement for females) was observed when both visual and tactile cues were integrated (AVT). Another interesting finding in this study is that temporally synchronized abstract visual and vibrotactile stimuli additively stack in their influence on speech recognition performance. Our findings suggest that a multisensory integration process in speech perception requires salient temporal cues to enhance speech recognition ability in noisy environments.

  14. Iron Ore Pellet Size Prediction

    • kaggle.com
    zip
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Moreira (2025). Iron Ore Pellet Size Prediction [Dataset]. https://www.kaggle.com/datasets/deathmetalbrazil/dados-pelotas/data
    Explore at:
    zip(238948 bytes)Available download formats
    Dataset updated
    Jan 3, 2025
    Authors
    André Moreira
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Iron Ore Pellet Size Prediction

    The aim is to predict the size of the pellets (pellet feed) at the end of the production process in a steel industry operating in the global market.

    The prediction will be carried out using historical data from sensors that capture information from each stage of the production process, statistical models and artificial intelligence algorithms, which will seek to identify trends and patterns in order to estimate the size of the pellets at the end of the process.

    Dataset Overview

    The dataset contains 10 columns and 9997 rows where each row shows a stage of the production process with its respective information.

    This data can be extremely useful for process engineers, data scientists and other professionals involved in the steel industry.

    For process engineers, detailed analysis of variables can provide valuable insights into operational efficiency. They can identify bottlenecks in the process, assess the impact of different operating conditions and implement improvements that result in more efficient and higher quality production.

    For data scientists, the dataset offers a rich source of information for building predictive models. Using machine learning techniques, they can develop algorithms that predict pellet size based on input variables, allowing for real-time adjustments and optimization of the production process. In addition, statistical analysis can reveal hidden patterns and trends that may not be evident at first glance.

    Note: There are many outliers and noisy data (zeros) in the database that have not been intentionally treated.

    Column Descriptions

    • Superficie_Especifica (explanatory): Quality measure that goes into platooning
    • Taxa_Alimentacao_Disco (explanatory): Disc feed rate
    • Taxa_Alimentacao_Misturador (explanatory): Mixer feed rate
    • Umidade (explanatory): Humidity feed rate
    • Bentonita (explanatory): Bentonite feed rate
    • Velocidade_Disco (explanatory): Disc speed feed rate
    • Velocidade_Misturador (explanatory): Mixer speed feed rate
    • Retorno_1 (explanatory): Return feed rate
    • Retorno_2 (explanatory): Return feed rate
    • Distribuicao_Tamanho_Pelotas (target): Size of the pellets at the end of the process

    How to Use this Dataset

    Exploratory Data Analysis (EDA):

    Perform univariate and multivariate analysis. Visualize data distributions for variables such as Umidade, Bentonita, and Taxa_Alimentacao_Disco.

    Data Visualization:

    Create plots to study relationships between features. Use heatmaps to analyze correlations between numerical features.

    Predictive Modeling:

    Build machine learning models to predict Distribuicao_Tamanho_Pelotas using features. Test different regression models (machine learning or deep learning) for better insights.

  15. m

    Generalized entropy based possibilistic fuzzy C-means

    • data.mendeley.com
    Updated Oct 31, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salar Askari Lasaki (2016). Generalized entropy based possibilistic fuzzy C-means [Dataset]. http://doi.org/10.17632/b3xkmxrz88.1
    Explore at:
    Dataset updated
    Oct 31, 2016
    Authors
    Salar Askari Lasaki
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dear Researcher,

    Thank you for using this code and datasets. I explain how GEPFCM code related to my paper "Generalized entropy based possibilistic fuzzy C-Means for clustering noisy data and its convergence proof" published in Neurocomputing, works. The main datasets mentioned in the paper together with GEPFCM code are included. If there is any question, feel free to contact me at: bas_salaraskari@yahoo.com s_askari@aut.ac.ir

    Regards,

    S. Askari

    Guidelines for GEPFCM algorithm: 1. Open the file GEPFCM Code using MATLAB. This is relaxed form of the algorithm to handle noisy data. 2. Enter or paste name of the dataset you wish to cluster in line 15 after "load". It loads the dataset in the workplace. 3. For details of the parameters cFCM, cPCM, c1E, c2E, eta, and m, please read the paper. 4. Lines 17 and 18: "N" is number of data vectors and "D" is number of independent variables. 5. Line 26: "C" is number of clusters. To input your own desired value for number of clusters, "uncomment" this line and then enter the value. Since the datasets provided here, include "C", this line is "comment". 6. Line 28: "ruopt" is optimal value of ρ discussed in equation 13 of the paper. To enter your own value of ρ, "uncomment" this line. Since the datasets provided here, include "ruopt ", this line is "comment". 7. If line 50 is "comment", covariance norm (Mahalanobis distance) is use and if it is "uncomment", identity norm (Euclidean distance) is used. 8. When you run the algorithm, first FCM is applied to the data. Cluster centers calculated by FCM initialize PFCM. Then PFCM is applied to the data and cluster centers computed by PFCM initialize GEPFCM. Finally, GEPFCM is applied to the data. 9. For two-dimensional plot, "uncomment" lines 419-421 and "comment" lines 423-425. For three-dimensional plot, "comment" lines 419-421 and "uncomment" lines 423-425. 10. To run the algorithm, press Ctrl Enter on your keyboard. 11. For your own dataset, please arrange the data as the datasets described in the MS word file "Read Me".

  16. r

    Large-Scale Dataset for Emergency Vehicle Siren and Road Noises

    • resodate.org
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    muhammad Usaid; tabarka rajab; Sarwer Wasi; Prof Dr Muhammad Asif; Prof Dr sheikh Muhammad munaf; Prof. Dr. Engr. Samreen Hussain (2021). Large-Scale Dataset for Emergency Vehicle Siren and Road Noises [Dataset]. http://doi.org/10.6084/M9.FIGSHARE.17560865
    Explore at:
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    figshare
    Authors
    muhammad Usaid; tabarka rajab; Sarwer Wasi; Prof Dr Muhammad Asif; Prof Dr sheikh Muhammad munaf; Prof. Dr. Engr. Samreen Hussain
    Description

    In our research work, we have accumulated a dataset of two thousand sound files from different resources and extracted the necessary feature that could be further utilized in the deep learning problem of emergency sound classification. In addition, we have shared the link to our dataset for both WAV format files cropped at a specific time window and at a fixed frequency and CSV file of extracted features. Our dataset contains the ambulance sounds and other road noises data which is in the form of audio files.

  17. b

    Data from: A meta-analysis of the influence of anthropogenic noise on...

    • nde-dev.biothings.io
    • data-staging.niaid.nih.gov
    • +2more
    zip
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cameron Duquette; Cameron Duquette; Torre Hovick; Scott Loss (2021). A meta-analysis of the influence of anthropogenic noise on terrestrial wildlife communication strategies [Dataset]. http://doi.org/10.5061/dryad.k6djh9w61
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 26, 2021
    Dataset provided by
    Oklahoma State University
    North Dakota State University
    Authors
    Cameron Duquette; Cameron Duquette; Torre Hovick; Scott Loss
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description
    1. Human-caused noise pollution dominates the soundscape of modern ecosystems, from urban centers to national parks. Though wildlife can generally alter their communication to accommodate many types of natural noise (e.g. wind, wave action, heterospecific communication), noise pollution from anthropogenic sources pushes the limits of wildlife communication flexibility by causing loud, low-pitched, and near-continuous interference. Because responses to noise pollution are variable and taxa-specific, multi-species risk assessments and mitigation are not currently possible.
    2. We conducted a meta-analysis to synthesize noise pollution effects on terrestrial wildlife communication. Specifically, we assessed: 1) the impacts of noise pollution on modulation of call rate, duration, amplitude, and frequency (including peak, minimum, and maximum frequency); and 2) the literature on anthropogenic noise pollution by region, taxa, study design, and disturbance type.
    3. Terrestrial wildlife (results driven by avian studies) generally respond to noise pollution by calling with higher minimum frequencies, while they generally do not alter the amplitude, maximum frequency, peak frequency, duration, and rate of calling.
    4. The literature on noise pollution research is biased towards birds, population-level studies, urban noise sources, and study systems in North America.
    5. Policy applications Our study reveals the ways in which wildlife can alter their signals to contend with anthropogenic noise, and discusses the potential fitness and management consequences of these signal alterations. This information, combined with an identification of current research needs, will allow researchers and managers to better develop noise pollution risk assessment protocols and prioritize mitigation efforts to reduce anthropogenic noise.12-Mar-2021 Methods Literature Search Strategy and Inclusion Criteria

        We searched the peer-reviewed scientific literature to synthesize information regarding noise pollution impacts on wildlife acoustic communication and to assess research gaps and biases. We restricted the search to terrestrial systems because general approaches to noise pollution risk assessment and recommendations for noise mitigation already exist for some coastal and marine systems (Southall et al. 2007). Perhaps more importantly, a vast body of research conducted to date on marine wildlife has yielded valuable knowledge such as species-specific spectral sensitivity, critical impact thresholds, and mitigation effectiveness which can be drawn upon to advance general theory and research and to develop further regulatory guidelines (Erbe et al. 2016). Finally, the physics of sound transmission differ between water and air, affecting both how sound is perceived by organisms and potential mitigation strategies (Würsig et al. 2000, Shannon et al. 2015). We used Web of Science (search conducted 4/5/2018) to search for studies investigating the impact of noise pollution on wildlife modulation of call frequency, rate, duration, and amplitude (see Table 2 for specific search terms). We assessed these multiple communication response variables even though they may be related because each response may have different ecological and/or evolutionary implications. An initial search produced 815 studies. After implementing all inclusion criteria (see below), our search resulted in 181 data points from 32 studies representing six continents (Table 3).
      

    We used the “Analyze Results” feature in Web of Science to filter out irrelevant disciplines (e.g., Audiology, Speech Pathology, nexcluded = 347). After compiling remaining results into a database, we removed duplicate studies (nexcluded = 5) and studies determined to be topically irrelevant based on reading of all titles (nexcluded = 117). We excluded studies broadcasting white noise as a treatment, as we were interested in responses to spectral characteristics that more closely match environmental noise pollution (i.e., loud, low-frequency sounds, nexcluded = 3). However, we retained one study that explicitly manipulated the characteristics of white noise to approximate low-frequency traffic sounds. We excluded studies conducted in a laboratory setting, as we were only interested in responses of free-living wildlife to noises experienced in their natural habitat (nexcluded = 5). After detailed screening of article texts, we removed studies that did not assess effects of noise pollution on the above focal response variables and studies with analysis methods or reporting that precluded us from extracting a relevant effect size (nexcluded = 59).

        For remaining studies, we extracted the location, focal taxa, response variable, sound source, and study design. We also extracted means, sample sizes, and standard deviations of response variables for studies assessing categorical predictor variables (e.g., call characteristics at quiet and noisy sites), or values of Pearson’s r for studies assessing continuous predictor variables (e.g., response characteristics over a gradient of decibel levels). In studies with multiple treatments, we used the two extreme ends of the environmental sound spectrum for analysis. For example, if a study tested call rates in “quiet”, “moderate”, and “loud” environments, we compared responses between “quiet” and “loud” sites. Sound sources included airplane (n = 2), construction (n = 6), energy development (n = 17), roadway (n = 52), urban (n = 101), and white noise (n = 3). We also distinguished study designs as event-based (n = 41) versus continuous (n = 140). Event-based study designs evaluated instantaneous signal flexibility in the presence of anthropogenic sound (e.g., a grasshopper calling more loudly during an airplane overflight compared to normal conditions, Fig. 2). Continuous study designs, on the other hand, evaluated differences in acoustic properties between populations in loud and quiet environments (e.g., communication characteristics of red-winged blackbirds, (Agelaius phoenicus), in rural versus urban environments; Fig. 2). Following our literature search, we incorporated a specific search for bat studies, as they were underrepresented in our initial search and we felt that they are good models for the study of anthropogenic sound impacts due to their reliance on acoustic information for both communication and foraging.
    

    Analysis

    To assess potential biases in the noise pollution literature, we assessed observed versus expected proportions of studies using Pearson’s χ2 tests. We conducted these tests to analyze numbers of studies for each response variable, sound source, focal taxa, continent, and study design; in each case we tested a null hypothesis that an equal proportion of studies have been conducted for each category (e.g., 50% of studies each for event-based and continuous study designs). To control the Type I error rate, we employed a Holm’s Sequential Bonferroni correction.

    We conducted a meta-analysis to assess wildlife responses to noise pollution using the metafor package (Viechtbauer, 2010) in the R statistical environment (version 3.4.1, R Core Team 2017). We ran mixed-effects meta regression models with study design (event-based versus continuous), and taxa as fixed effects and study ID as a random effect.

    When possible, we calculated Hedge’s g for each study that used a categorical noise treatment. When studies evaluated responses to noise along a continuous gradient, we calculated Hedge’s g using Pearson’s r. To evaluate overall effect of each response variable (Minimum Frequency, Maximum Frequency, Peak Frequency, Duration, Rate, and Amplitude), as well as the effect of study type and taxa, we evaluated overlap of 95% confidence intervals with zero. After conducting analyses, we constructed Q-Q plots to visually assess model fit.

  18. o

    Bylaw No. 6980 A Bylaw to Regulate the Control of Noise within Regina -...

    • openregina.ca
    Updated Jan 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Bylaw No. 6980 A Bylaw to Regulate the Control of Noise within Regina - Dataset - City of Regina Open Data [Dataset]. https://openregina.ca/dataset/bylaw-no-6980-a-bylaw-to-regulate-the-control-of-noise-within-regina
    Explore at:
    Dataset updated
    Jan 16, 2017
    Area covered
    Regina
    Description

    To prohibit, eliminate and abate loud, unusual and unnecessary noise or noises which annoy, disturb, injure or endanger the comfort, repose, health, peace or safety of others within the City of Regina.

  19. 2010 Census Production Settings Demographic and Housing Characteristics...

    • registry.opendata.aws
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Census Bureau, 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File [Dataset]. https://registry.opendata.aws/census-2010-dhc-nmf/
    Explore at:
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File (2023-06-30) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9 , and implemented in https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code). The NMF was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations, that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File. The NMF consists of the full set of privacy-protected statistical queries (counts of individuals or housing units with particular combinations of characteristics) of confidential 2010 Census data relating to the 2010 Demonstration Data Products Suite – Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File – Production Settings (2023-04-03). These statistical queries, called “noisy measurements” were produced under the zero-Concentrated Differential Privacy framework (Bun, M. and Steinke, T [2016] https://arxiv.org/abs/1605.02065; see also Dwork C. and Roth, A. [2014] https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] https://arxiv.org/abs/2004.00010), which added positive or negative integer-valued noise to each of the resulting counts. The noisy measurements are an intermediate stage of the TDA prior to the post-processing the TDA then performs to ensure internal and hierarchical consistency within the resulting tables. The Census Bureau has released these 2010 Census demonstration data to enable data users to evaluate the expected impact of disclosure avoidance variability on 2020 Census data. The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File (2023-04-03) has been cleared for public dissemination by the Census Bureau Disclosure Review Board (CBDRB-FY22-DSEP-004).

    The 2010 Census Production Settings Demographic and Housing Characteristics Demonstration Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2010 Census Edited File (CEF), which includes confidential data initially collected in the 2010 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) (https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/2010-demonstration-data-products/04-Demonstration_Data_Products_Suite/2023-04-03/). As these 2010 Census demonstration data are intended to support study of the design and expected impacts of the 2020 Disclosure Avoidance System, the 2010 CEF records were pre-processed before application of the zCDP framework. This pre-processing converted the 2010 CEF records into the input-file format, response codes, and tabulation categories used for the 2020 Census, which differ in substantive ways from the format, response codes, and tabulation categories originally used for the 2010 Census.

    The NMF provides estimates of counts of persons in the CEF by various characteristics and combinations of characteristics including their reported race and ethnicity, whether they were of voting age, whether they resided in a housing unit or one of 7 group quarters types, and their census block of residence after the addition of discrete Gaussian noise (with the scale parameter determined by the privacy-loss budget allocation for that particular query under zCDP). Noisy measurements of the counts of occupied and vacant housing units by census block are also included. Lastly, data on constraints—information into which no noise was infused by the Disclosure Avoidance System (DAS) and used by the TDA to post-process the noisy measurements into the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) —are provided.

  20. r

    Data from: The Renewed Role of Sweep Functions in Noisy Shortcuts to...

    • resodate.org
    Updated Oct 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michele Delvecchio; Francesco Petiziol; Sandro Wimberger (2021). The Renewed Role of Sweep Functions in Noisy Shortcuts to Adiabaticity [Dataset]. http://doi.org/10.14279/depositonce-12487
    Explore at:
    Dataset updated
    Oct 11, 2021
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Michele Delvecchio; Francesco Petiziol; Sandro Wimberger
    Description

    We study the robustness of different sweep protocols for accelerated adiabaticity following in the presence of static errors and of dissipative and dephasing phenomena. While in the noise-free case, counterdiabatic driving is, by definition, insensitive to the form of the original sweep function, this property may be lost when the quantum system is open. We indeed observe that, according to the decay and dephasing channels investigated here, the performance of the system becomes highly dependent on the sweep function. Our findings are relevant for the experimental implementation of robust shortcuts-to-adiabaticity techniques for the control of quantum systems.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anandi Karunaratne; Artem Polyvyanyy; Alistair Moffat (2025). Discovered Process Models from Noisy Logs [Dataset]. http://doi.org/10.26188/30739082.v1

Discovered Process Models from Noisy Logs

Explore at:
zipAvailable download formats
Dataset updated
Nov 28, 2025
Dataset provided by
The University of Melbourne
Authors
Anandi Karunaratne; Artem Polyvyanyy; Alistair Moffat
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset consists of three folders:Systems: Three public event logs: Sepsis Cases, RTFMS, and BPIC 2012.Logs: Clean and noisy logs derived from the base systems.From each base log, we created samples of seven sizes (1000, 2000, 4000, 10000, 20000, 40000, 100000 traces) using sampling with replacement, yielding 21 clean logs.Noise was then added using $\snip$ across seven intensity levels (0.1%, 0.2%, 0.4%, 1.0%, 2.0%, 4.0%, 10.0%) and five noise types (absence, insertion, ordering, substitution, mixed). Percentages refer to the number of trace-level injections.Each configuration was repeated five times, producing 3,675 noisy logs and a total of 3,696 logs.Models: Contains discovered models for all clean logs and a random subset of noisy logs (incomplete), using the Alpha, Heuristics, and Inductive miners.

Search
Clear search
Close search
Google apps
Main menu