Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces, or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 U.S. Census for which expert-labeled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modeling of comparison data, allowing disagreement probability parameters conditional on nonmatch status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves a significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We present a dataset for testing, bench-marking, and offline learning of map-matching algorithms. For the first time, a large enough dataset is available to prove or disprove map-matching hypotheses on a world-wide scale. There are several hundred map-matching algorithms published in literature, each tested only on a limited scale due to difficulties in collecting truly large scale data. Our contribution aims to provide a convenient gold standard to compare various map-matching algorithms between each other. Moreover, as many state-of-the-art map-matching algorithms are based on techniques that require offline learning, our dataset can be readily used as the training set. Because of the global coverage of our dataset, learning does not have to be be biased to the part of the world where the algorithm was tested.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A primary challenge for researchers that make use of observational data is selection bias (i.e., the units of analysis exhibit systematic differences and dis-homogeneities due to non-random selection into treatment). This article encourages researchers in acknowledging this problem and discusses how and - more importantly - under which assumptions they may resort to statistical matching techniques to reduce the imbalance in the empirical distribution of pre-treatment observable variables between the treatment and control groups. With the aim of providing a practical guidance, the article engages with the evaluation of the effectiveness of peacekeeping missions in the case of the Bosnian civil war, a research topic in which selection bias is a structural feature of the observational data researchers have to use, and shows how to apply the Coarsened Exact Matching (CEM), the most widely used matching algorithm in the fields of Political Science and International Relations.
Facebook
TwitterMany e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the input data and results used in the paper"Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data".License information:The LoD1 data used as input in this study are openly available at Transparenzportal Hamburg (https://transparenz.hamburg.de/),from Freie und Hansestadt Hamburg, Landesbetrieb Geoinformation und Vermessung (LGV), in compliance with the licence dl-de/by-2-0 (https://www.govdata.de/dl-de/by-2-0):)Content:1. Input Footprints of non-identical pairs:input_reference_objects.zip2. Results without additional position deviation:results_without_deviation.zip3. Results with generated position deviation including geometries:results_with_deviation.zip
Facebook
TwitterWe identify situations in which conditioning on text can address confounding in observational studies. We argue that a matching approach is particularly well-suited to this task, but existing matching methods are ill-equipped to handle high-dimensional text data. Our proposed solution is to estimate a low-dimensional summary of the text and condition on this summary via matching. We propose a method of text matching, topical inverse regression matching, that allows the analyst to match both on the topical content of confounding documents and the probability that each of these documents is treated. We validate our approach and illustrate the importance of conditioning on text to address confounding with two applications: the effect of perceptions of author gender on citation counts in the international relations literature and the effects of censorship on Chinese social media users.
Facebook
TwitterThis paper investigates the method of matching regarding two crucial implementation choices: the distance measure and the type of algorithm. We implement optimal full matching a fully efficient algorithm, and present a framework for statistical inference. The implementation uses data from the NLSY79 to study the effect of college education on earnings. We find that decisions regarding the matching algorithm depend on the structure of the data: In the case of strong selection into treatment and treatment effect heterogeneity a full matching seems preferable. If heterogeneity is weak, pair matching suffices.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/SURSEOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/SURSEO
We propose a simplified approach to matching for causal inference that simultaneously optimizes both balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum balance for each possible sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms that calculate the matching frontier for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets used for evaluating state-of-the-art schema matching methods in the paper "Valentine: Evaluating Matching Techniques for Dataset Discovery" , which was accepted for presentation in IEEE ICDE 2021. They come in the form of fabricated pairs respecting a relatedness scenario as discussed in the paper.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data contains Index match, index match Advance
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Run main.m in Matlab to produce matching resultsdata_BJ and data_HK are two datasets for testing the geospatial data matching methods.The objects are represented by their centroids and corresponding vertices.
Facebook
TwitterThe goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides match-level information from the Indian Premier League (IPL), featuring details from multiple seasons. It is curated for those interested in analyzing trends, performances, and patterns.
Key Features:
Covers multiple IPL seasons
Includes team names, toss winner, toss decision
Match winner and result type (normal, tie, no result)
Player of the match, venue, and city information
Useful for data analysis, visualization, and machine learning tasks
This dataset is perfect for beginners exploring sports analytics, students practicing data wrangling, and enthusiasts building IPL-based projects or dashboards.
Facebook
TwitterObservational studies of causal effects often use multivariate matching to control imbalances in measured covariates. For instance, using network optimization, one may seek the closest possible pairing for key covariates among all matches that balance a propensity score and finely balance a nominal covariate, perhaps one with many categories. This is all straightforward when matching thousands of individuals, but requires some adjustments when matching tens or hundreds of thousands of individuals. In various senses, a sparser network—one with fewer edges—permits optimization in larger samples. The question is: What is the best way to make the network sparse for matching? A network that is too sparse will eliminate from consideration possible pairings that it should consider. A network that is not sparse enough will waste computation considering pairings that do not deserve serious consideration. We propose a new graded strategy in which potential pairings are graded, with a preference for higher grade pairings. We try to match with pairs of the best grade, incorporating progressively lower grade pairs only to the degree they are needed. In effect, only sparse networks are built, stored and optimized. Two examples are discussed, a small example with 1567 matched pairs from clinical medicine, and a slightly larger example with 22,111 matched pairs from economics. The method is implemented in an R package RBestMatch available at https://github.com/ruoqiyu/RBestMatch. Supplementary materials for this article are available online.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Football Match Dataset This dataset contains detailed information and statistics from football matches. Curated and created by Your Name, it is designed for sports analytics, machine learning projects, and research related to football match events, outcomes, and player performance.
Table of Contents Overview
Dataset Description
Data Fields
Data Files
Usage
Installation
About the Creator
Contributing
License
Citation
Contact
Overview The Football Match Dataset – First Version provides comprehensive historical data about football matches. This dataset is ideal for:
Analyzing match outcomes and trends
Building machine learning models (e.g., match result prediction, event classification)
Researching match statistics and game dynamics
The dataset is available on Kaggle and was created by me, [Your Name].
Dataset Description This dataset includes various data points collected from football matches. Key aspects of the dataset include:
Match Information: Dates, venues, participating teams, and final scores.
Match Events: Detailed events during the match such as goals, assists, fouls, substitutions, etc.
Additional Metrics: Depending on the version, statistics on possession, shots, passes, and player performance might be included.
Note: For a complete list of fields and detailed descriptions, please refer to the accompanying documentation on the Kaggle page.
Data Fields Although the fields may be updated in future versions, the current dataset generally contains:
Match_ID: A unique identifier for each match.
Date: The match date.
Home_Team: Name of the home team.
Away_Team: Name of the away team.
Home_Score: Goals scored by the home team.
Away_Score: Goals scored by the away team.
Events: A detailed log of match events (e.g., goal scorers, cards, substitutions).
Additional Metrics: Columns with advanced match statistics (if available).
Data Files The dataset is distributed as one or more CSV files. Common file names include:
football_match_dataset_first_version.csv
Additional metadata or supporting files may be included
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context There's a story behind every dataset and here's your opportunity to share yours.
Content What's inside is more than just rows and columns. Make it easy for others to get started.
Acknowledgements I got this set of data from online source.
Inspiration Just work on this data, Use your techniques and tools.
Facebook
TwitterReplication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference
Facebook
TwitterIn this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met. A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Linear feature matching is one of the crucial components for data conflation that sees its usefulness in updating existing data through the integration of newer data and in evaluating data accuracy. This article presents a simplified linear feature matching method to conflate historical and current road data. To measure the similarity, the shorter line median Hausdorff distance (SMHD), the absolute value of cosine similarity (aCS) of the weighted linear directional mean values, and topological relationships are adopted. The decision tree analysis is employed to derive thresholds for the SMHD and the aCS. To demonstrate the usefulness of the simple linear feature matching method, four models with incremental configurations are designed and tested: (1) Model 1: one-to-one matching based on the SMHD; (2) Model 2: matching with only the SMHD threshold; (3) Model 3: matching with the SMHD and the aCS thresholds; and (4) Model 4: matching with the SMHD, the aCS, and topological relationships. These experiments suggest that Model 2, which considers only distance, does not provide stable results, while Models 3 and 4, which consider direction and topological relationships, produce stable results with levels of accuracy around 90% and 95%, respectively. The results suggest that the proposed method is simple yet robust for linear feature matching.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces, or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 U.S. Census for which expert-labeled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modeling of comparison data, allowing disagreement probability parameters conditional on nonmatch status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves a significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.