100+ datasets found
  1. Data from: Fast Bayesian Record Linkage With Record-Specific Disagreement...

    • tandf.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Stringham (2023). Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters [Dataset]. http://doi.org/10.6084/m9.figshare.14687696.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Thomas Stringham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces, or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 U.S. Census for which expert-labeled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modeling of comparison data, allowing disagreement probability parameters conditional on nonmatch status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves a significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.

  2. Z

    Dataset for testing and training map-matching methods

    • data-staging.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kubicka, Matej; Arben Cela; Philippe Moulin; Hugues Mounier; S. I. Niculescu (2020). Dataset for testing and training map-matching methods [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_57731
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    IFPEN
    L2S/CentraleSupelec (CNRS UMR 8506)
    UPE, ESIEE Paris
    Authors
    Kubicka, Matej; Arben Cela; Philippe Moulin; Hugues Mounier; S. I. Niculescu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    We present a dataset for testing, bench-marking, and offline learning of map-matching algorithms. For the first time, a large enough dataset is available to prove or disprove map-matching hypotheses on a world-wide scale. There are several hundred map-matching algorithms published in literature, each tested only on a limited scale due to difficulties in collecting truly large scale data. Our contribution aims to provide a convenient gold standard to compare various map-matching algorithms between each other. Moreover, as many state-of-the-art map-matching algorithms are based on techniques that require offline learning, our dataset can be readily used as the training set. Because of the global coverage of our dataset, learning does not have to be be biased to the part of the world where the algorithm was tested.

  3. H

    Replication Data for: Looking for twins: how to build better counterfactuals...

    • dataverse.harvard.edu
    Updated Feb 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Costalli; Fedra Negri (2021). Replication Data for: Looking for twins: how to build better counterfactuals with matching [Dataset]. http://doi.org/10.7910/DVN/CYZFCC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Stefano Costalli; Fedra Negri
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A primary challenge for researchers that make use of observational data is selection bias (i.e., the units of analysis exhibit systematic differences and dis-homogeneities due to non-random selection into treatment). This article encourages researchers in acknowledging this problem and discusses how and - more importantly - under which assumptions they may resort to statistical matching techniques to reduce the imbalance in the empirical distribution of pre-treatment observable variables between the treatment and control groups. With the aim of providing a practical guidance, the article engages with the evaluation of the effectiveness of peacekeeping missions in the case of the Bosnian civil war, a research topic in which selection bias is a structural feature of the observational data researchers have to use, and shows how to apply the Coarsened Exact Matching (CEM), the most widely used matching algorithm in the fields of Political Science and International Relations.

  4. d

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • demo-b2find.dkrz.de
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/8f288eb3-f541-5fca-a337-d519f903668f
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  5. Data from: Comparative study on matching methods for the distinction of...

    • figshare.com
    zip
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Schorcht; Robert Hecht; Gotthard Meinel (2022). Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data [Dataset]. http://doi.org/10.6084/m9.figshare.18027683.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Martin Schorcht; Robert Hecht; Gotthard Meinel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the input data and results used in the paper"Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data".License information:The LoD1 data used as input in this study are openly available at Transparenzportal Hamburg (https://transparenz.hamburg.de/),from Freie und Hansestadt Hamburg, Landesbetrieb Geoinformation und Vermessung (LGV), in compliance with the licence dl-de/by-2-0 (https://www.govdata.de/dl-de/by-2-0):)Content:1. Input Footprints of non-identical pairs:input_reference_objects.zip2. Results without additional position deviation:results_without_deviation.zip3. Results with generated position deviation including geometries:results_with_deviation.zip

  6. H

    Replication Data for: Adjusting for Confounding with Text Matching

    • datasetcatalog.nlm.nih.gov
    • dataverse.harvard.edu
    Updated Mar 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stewart, Brandon M.; Nielsen, Richard A.; Roberts, Margaret E. (2020). Replication Data for: Adjusting for Confounding with Text Matching [Dataset]. http://doi.org/10.7910/DVN/HTMX3K
    Explore at:
    Dataset updated
    Mar 24, 2020
    Authors
    Stewart, Brandon M.; Nielsen, Richard A.; Roberts, Margaret E.
    Description

    We identify situations in which conditioning on text can address confounding in observational studies. We argue that a matching approach is particularly well-suited to this task, but existing matching methods are ill-equipped to handle high-dimensional text data. Our proposed solution is to estimate a low-dimensional summary of the text and condition on this summary via matching. We propose a method of text matching, topical inverse regression matching, that allows the analyst to match both on the topical content of confounding documents and the probability that each of these documents is treated. We validate our approach and illustrate the importance of conditioning on text to address confounding with two applications: the effect of perceptions of author gender on citation counts in the international relations literature and the effects of censorship on Chinese social media users.

  7. r

    Assessing the performance of matching algorithms when selection into...

    • resodate.org
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Boris Augurzky (2025). Assessing the performance of matching algorithms when selection into treatment is strong (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9hc3Nlc3NpbmctdGhlLXBlcmZvcm1hbmNlLW9mLW1hdGNoaW5nLWFsZ29yaXRobXMtd2hlbi1zZWxlY3Rpb24taW50by10cmVhdG1lbnQtaXMtc3Ryb25n
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW
    ZBW Journal Data Archive
    Authors
    Boris Augurzky
    Description

    This paper investigates the method of matching regarding two crucial implementation choices: the distance measure and the type of algorithm. We implement optimal full matching a fully efficient algorithm, and present a framework for statistical inference. The implementation uses data from the NLSY79 to study the effect of college education on earnings. We find that decisions regarding the matching algorithm depend on the structure of the data: In the case of strong selection into treatment and treatment effect heterogeneity a full matching seems preferable. If heterogeneity is weak, pair matching suffices.

  8. H

    Replication Data for: The Balance-Sample Size Frontier in Matching Methods...

    • dataverse.harvard.edu
    Updated Jul 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary King; Christopher Lucas; Richard Nielsen (2017). Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference [Dataset]. http://doi.org/10.7910/DVN/SURSEO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Gary King; Christopher Lucas; Richard Nielsen
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/SURSEOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/SURSEO

    Description

    We propose a simplified approach to matching for causal inference that simultaneously optimizes both balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum balance for each possible sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms that calculate the matching frontier for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.

  9. Z

    Valentine Datasets

    • data.niaid.nih.gov
    Updated Jul 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koutras, Christos; Siachamis, Georgios; Ionescu, Andra; Psarakis, Kyriakos; Brons, Jerry; Fragkoulis, Marios; Lofi, Christoph; Bonifati, Angela; Katsifodimos, Asterios (2021). Valentine Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5084604
    Explore at:
    Dataset updated
    Jul 9, 2021
    Dataset provided by
    TU Delft
    ING Bank Netherlands
    Lyon 1 University
    Authors
    Koutras, Christos; Siachamis, Georgios; Ionescu, Andra; Psarakis, Kyriakos; Brons, Jerry; Fragkoulis, Marios; Lofi, Christoph; Bonifati, Angela; Katsifodimos, Asterios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets used for evaluating state-of-the-art schema matching methods in the paper "Valentine: Evaluating Matching Techniques for Dataset Discovery" , which was accepted for presentation in IEEE ICDE 2021. They come in the form of fabricated pairs respecting a relatedness scenario as discussed in the paper.

  10. Index match, Index match Advance

    • kaggle.com
    zip
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). Index match, Index match Advance [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/index-match-index-match-advance
    Explore at:
    zip(10258 bytes)Available download formats
    Dataset updated
    Mar 15, 2024
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data contains Index match, index match Advance

  11. datasets for geospatial data matching

    • figshare.com
    xlsx
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenbin Zhang (2022). datasets for geospatial data matching [Dataset]. http://doi.org/10.6084/m9.figshare.11521389.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Wenbin Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Run main.m in Matlab to produce matching resultsdata_BJ and data_HK are two datasets for testing the geospatial data matching methods.The objects are represented by their centroids and corresponding vertices.

  12. d

    Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference...

    • da-ra.de
    • linkagelibrary.icpsr.umich.edu
    Updated Nov 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Bizer; Ralph Peeters; Anna Primpeli (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.7801/352
    Explore at:
    Dataset updated
    Nov 2020
    Dataset provided by
    Mannheim University Library
    da|ra
    Authors
    Christian Bizer; Ralph Peeters; Anna Primpeli
    Description

    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

  13. H

    Replication Data for: Leveraging Large Language Models for Fuzzy String...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Wang (2024). Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science [Dataset]. http://doi.org/10.7910/DVN/A8MKLO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Yu Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

  14. IPL Match Dataset

    • kaggle.com
    zip
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanika Chaudhari (2025). IPL Match Dataset [Dataset]. https://www.kaggle.com/datasets/chaudharisanika/ipl-match-dataset
    Explore at:
    zip(16591 bytes)Available download formats
    Dataset updated
    Jun 13, 2025
    Authors
    Sanika Chaudhari
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides match-level information from the Indian Premier League (IPL), featuring details from multiple seasons. It is curated for those interested in analyzing trends, performances, and patterns.

    Key Features:

    Covers multiple IPL seasons

    Includes team names, toss winner, toss decision

    Match winner and result type (normal, tie, no result)

    Player of the match, venue, and city information

    Useful for data analysis, visualization, and machine learning tasks

    This dataset is perfect for beginners exploring sports analytics, students practicing data wrangling, and enthusiasts building IPL-based projects or dashboards.

  15. f

    Data from: Graded Matching for Large Observational Studies

    • datasetcatalog.nlm.nih.gov
    • tandf.figshare.com
    Updated Mar 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu, Ruoqi; Rosenbaum, Paul R. (2022). Graded Matching for Large Observational Studies [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000420071
    Explore at:
    Dataset updated
    Mar 28, 2022
    Authors
    Yu, Ruoqi; Rosenbaum, Paul R.
    Description

    Observational studies of causal effects often use multivariate matching to control imbalances in measured covariates. For instance, using network optimization, one may seek the closest possible pairing for key covariates among all matches that balance a propensity score and finely balance a nominal covariate, perhaps one with many categories. This is all straightforward when matching thousands of individuals, but requires some adjustments when matching tens or hundreds of thousands of individuals. In various senses, a sparser network—one with fewer edges—permits optimization in larger samples. The question is: What is the best way to make the network sparse for matching? A network that is too sparse will eliminate from consideration possible pairings that it should consider. A network that is not sparse enough will waste computation considering pairings that do not deserve serious consideration. We propose a new graded strategy in which potential pairings are graded, with a preference for higher grade pairings. We try to match with pairs of the best grade, incorporating progressively lower grade pairs only to the degree they are needed. In effect, only sparse networks are built, stored and optimized. Two examples are discussed, a small example with 1567 matched pairs from clinical medicine, and a slightly larger example with 22,111 matched pairs from economics. The method is implemented in an R package RBestMatch available at https://github.com/ruoqiyu/RBestMatch. Supplementary materials for this article are available online.

  16. football match actions dataset

    • kaggle.com
    zip
    Updated Mar 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Tarek Refaat (2022). football match actions dataset [Dataset]. https://www.kaggle.com/datasets/itarek898/football-match-dataset-first-version
    Explore at:
    zip(2126007204 bytes)Available download formats
    Dataset updated
    Mar 30, 2022
    Authors
    Muhammad Tarek Refaat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Football Match Dataset This dataset contains detailed information and statistics from football matches. Curated and created by Your Name, it is designed for sports analytics, machine learning projects, and research related to football match events, outcomes, and player performance.

    Table of Contents Overview

    Dataset Description

    Data Fields

    Data Files

    Usage

    Installation

    About the Creator

    Contributing

    License

    Citation

    Contact

    Overview The Football Match Dataset – First Version provides comprehensive historical data about football matches. This dataset is ideal for:

    Analyzing match outcomes and trends

    Building machine learning models (e.g., match result prediction, event classification)

    Researching match statistics and game dynamics

    The dataset is available on Kaggle and was created by me, [Your Name].

    Dataset Description This dataset includes various data points collected from football matches. Key aspects of the dataset include:

    Match Information: Dates, venues, participating teams, and final scores.

    Match Events: Detailed events during the match such as goals, assists, fouls, substitutions, etc.

    Additional Metrics: Depending on the version, statistics on possession, shots, passes, and player performance might be included.

    Note: For a complete list of fields and detailed descriptions, please refer to the accompanying documentation on the Kaggle page.

    Data Fields Although the fields may be updated in future versions, the current dataset generally contains:

    Match_ID: A unique identifier for each match.

    Date: The match date.

    Home_Team: Name of the home team.

    Away_Team: Name of the away team.

    Home_Score: Goals scored by the home team.

    Away_Score: Goals scored by the away team.

    Events: A detailed log of match events (e.g., goal scorers, cards, substitutions).

    Additional Metrics: Columns with advanced match statistics (if available).

    Data Files The dataset is distributed as one or more CSV files. Common file names include:

    football_match_dataset_first_version.csv

    Additional metadata or supporting files may be included

  17. IPL Dataset 2008-2019

    • kaggle.com
    zip
    Updated Aug 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    👨🏼‍💻 (2020). IPL Dataset 2008-2019 [Dataset]. https://www.kaggle.com/lazycoder00/ipl-dataset-20082019
    Explore at:
    zip(20313 bytes)Available download formats
    Dataset updated
    Aug 21, 2020
    Authors
    👨🏼‍💻
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context There's a story behind every dataset and here's your opportunity to share yours.

    Content What's inside is more than just rows and columns. Make it easy for others to get started.

    Acknowledgements I got this set of data from online source.

    Inspiration Just work on this data, Use your techniques and tools.

  18. d

    Replication Data for: The Balance-Sample Size Frontier in Matching Methods...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King, Gary, Christopher Lucas, and Richard Nielsen (2023). Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference [Dataset]. http://doi.org/10.7910/DVN/TRTXLP
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    King, Gary, Christopher Lucas, and Richard Nielsen
    Description

    Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference

  19. d

    Data from: Highly Scalable Matching Pursuit Signal Decomposition Algorithm

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Highly Scalable Matching Pursuit Signal Decomposition Algorithm [Dataset]. https://catalog.data.gov/dataset/highly-scalable-matching-pursuit-signal-decomposition-algorithm
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met. A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.

  20. Data from: A simplified linear feature matching method using decision tree...

    • tandf.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ick-Hoi Kim; Chen-Chieh Feng; Yi-Chen Wang (2023). A simplified linear feature matching method using decision tree analysis, weighted linear directional mean, and topological relationships [Dataset]. http://doi.org/10.6084/m9.figshare.4497074.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Ick-Hoi Kim; Chen-Chieh Feng; Yi-Chen Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Linear feature matching is one of the crucial components for data conflation that sees its usefulness in updating existing data through the integration of newer data and in evaluating data accuracy. This article presents a simplified linear feature matching method to conflate historical and current road data. To measure the similarity, the shorter line median Hausdorff distance (SMHD), the absolute value of cosine similarity (aCS) of the weighted linear directional mean values, and topological relationships are adopted. The decision tree analysis is employed to derive thresholds for the SMHD and the aCS. To demonstrate the usefulness of the simple linear feature matching method, four models with incremental configurations are designed and tested: (1) Model 1: one-to-one matching based on the SMHD; (2) Model 2: matching with only the SMHD threshold; (3) Model 3: matching with the SMHD and the aCS thresholds; and (4) Model 4: matching with the SMHD, the aCS, and topological relationships. These experiments suggest that Model 2, which considers only distance, does not provide stable results, while Models 3 and 4, which consider direction and topological relationships, produce stable results with levels of accuracy around 90% and 95%, respectively. The results suggest that the proposed method is simple yet robust for linear feature matching.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thomas Stringham (2023). Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters [Dataset]. http://doi.org/10.6084/m9.figshare.14687696.v1
Organization logo

Data from: Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Thomas Stringham
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces, or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 U.S. Census for which expert-labeled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modeling of comparison data, allowing disagreement probability parameters conditional on nonmatch status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves a significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.

Search
Clear search
Close search
Google apps
Main menu