CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVE
Matching methods improve the validity of causal inference by reducing model dependence and offering intuitive diagnostics. While they have become a part of the standard tool kit across disciplines, matching methods are rarely used when analyzing time-series cross-sectional data. We fill this methodological gap. In the proposed approach, we first match each treated observation with control observations from other units in the same time period that have an identical treatment history up to the pre-specified number of lags. We use standard matching and weighting methods to further refine this matched set so that the treated and matched control observations have similar covariate values. Assessing the quality of matches is done by examining covariate balance. Finally, we estimate both short-term and long-term average treatment effects using the difference-in-differences estimator, accounting for a time trend. We illustrate the proposed methodology through simulation and empirical studies. An open-source software package is available for implementing the proposed methods.
Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Separation is achieved by intelligence based-matching of the curvelet coefficients.
In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met.
A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Toponym matching, i.e. pairing strings that represent the same real-world location, is a fundamental problemfor several practical applications. The current state-of-the-art relies on string similarity metrics, either specifically developed for matching place names or integrated within methods that combine multiple metrics. However, these methods all rely on common sub-strings in order to establish similarity, and they do not effectively capture the character replacements involved in toponym changes due to transliterations or to changes in language and culture over time. In this article, we present a novel matching approach, leveraging a deep neural network to classify pairs of toponyms as either matching or nonmatching. The proposed network architecture uses recurrent nodes to build representations from the sequences of bytes that correspond to the strings that are to be matched. These representations are then combined and passed to feed-forward nodes, finally leading to a classification decision. We present the results of a wide-ranging evaluation on the performance of the proposed method, using a large dataset collected from the GeoNames gazetteer. These results show that the proposed method can significantly outperform individual similarity metrics from previous studies, as well as previous methods based on supervised machine learning for combining multiple metrics.
We introduce a new "Monotonic Imbalance Bounding" (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, "Equal Percent Bias Reducing" (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods. See also: Casual Inference
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.
The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this paper we introduce the Random Recursive Partitioning (RRP) matching method. RRP generates a proximity matrix which might be useful in econometric applications like average treatment effect estimation. RRP is a Monte Carlo method that randomly generates non-empty recursive partitions of the data and evaluates the proximity between two observations as the empirical frequency they fall in a same cell of these random partitions over all Monte Carlo replications. From the proximity matrix it is possible to derive both graphical and analytical tools to evaluate the extent of the common support between data sets. The RRP method is honest in that it does not match observations at any cost: if data sets are separated, the method clearly states it. The match obtained with RRP is invariant under monotonic transformation of the data. Average treatment effect estimators derived from the proximity matrix seem to be competitive compared to more commonly used estimators. RRP method does not require a particular structure of the data and for this reason it can be applied when distances like Mahalanobis or Euclidean are not suitable, in the presence of missing data or when the estimated propensity score is too sensitive to model specifications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Field data is often limited regarding causal inference. This is partly because randomization techniques are often impractical or unethical within certain fields (e.g., randomly assigning individuals to different types of classroom instruction in educational settings). Matching procedures, like propensity score matching (PSM; Rosenbaum & Rubin, 1983), are regularly used to strengthen interpretations of group membership-effects in field research. By matching individuals from different subgroups of a field sample (e.g., participations vs. nonparticipation in a special education program), relevant confounds to group membership-effects (e.g., socio-economic status) can be balanced out and thereby eliminated retrospectively. That way, matching turns field data into quasi-experimental data. Currently, the most prominent approach to matching individuals is nearest neighbor matching (NNM) (see Austin, 2014; Austin & Stuart, 2015; Heinz et al., 2022; Jacovidis, 2017). Available statistical software (e.g., R-packages like MatchIt, Ho et al., 2011), however, does not fully realize the potential of NNM to reduce sample-related bias in field data due to unsystematic procedures for the identification of apt pairs to match. Furthermore, existing matching applications are limited to two-group designs (that being said, weighting applications for more than two groups do exists, e.g., MMW-S, Hong, 2012). In addition, balance estimation, as a matching quality check, is often conducted rudimentarily (e.g., by solely reporting between-group post-matching differences). So far, conventions on balance estimation for more than two groups are absent. To address these shortcomings, we developed a systematic algorithm, designed for matching individuals from two or more groups alongside a set of adequate balance estimates. We call it “MAGMA” (for MAny-Group MAtching). In this work, we demonstrate and evaluate the MAGMA-algorithm, using two empirical examples from extensive field data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference
Climate Similarity Support ScriptsThis compressed file contains scripts and supporting data necessary to reproduce the analyses and products associated with Doherty et al. (2017), “Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration.”Climate Similarity Scripts.zip
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present a robust Global Matching technique focused on 3D mapping applications using laser range-finders. Our approach works under the assumption that places can be recognized by analyzing the projection of the observed points along the gravity direction. Relative poses between pairs of 3D point clouds are estimated by aligning their 2D projective representations and benefiting from the corresponding dimensional reduction. We present the complete processing pipeline for two different applications that use the global matcher as a core component: First, the global matcher is used for the registration of static scan sets where no a-priori information of the relative poses is available. It is combined with an effective procedure for validating the matches that exploits the implicit empty space information associated to single acquisitions. In the second use case, the global matcher is used for the loop detection required for 3D SLAM applications. We use an Extended Kalman Filter to obtain a belief of the map poses, which allows to validate matches and to execute hierarchical overlap tests, which reduce the number of potential matches to be evaluated. Additionally, the global matcher is combined with a fast local technique. In both use cases, the global reconstruction problem is modeled as a sparse graph, where scan poses (nodes) are connected through matches (edges). The graph structure allows formulating a sparse global optimization problem that optimizes scan poses, considering simultaneously all accepted matches. Our approach is being used in production systems and has been successfully evaluated on several real and publicly available datasets.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Supplementary materials – tables:Table E1a. Characteristics of PSM strategies for comparison oneTable E1b. Characteristics of PSM strategies for comparison twoTable E2a. Coarsening of covariates used in CEM for comparison oneTable E2b. Coarsening of covariates used in CEM for comparison twoTable E3a. Characteristics of CEM strategies for comparison oneTable E3b. Characteristics of CEM strategies for comparison twoSupplementary materials – figuresFigure E1a. Selection process for comparison oneFigure E1b. Selection process for comparison twoFigure E2a. Distribution of baseline covariates by PSM caliper width for comparison oneFigure E2b. Distribution of baseline covariates by PSM caliper width for comparison twoFigure E3a. Distribution of baseline covariates by CEM strategy for comparison oneFigure E3b. Distribution of baseline covariates by CEM strategy for comparison twoSupplementary materials – tablesTable E1a. Characteristics of PSM strategies for comparison oneTable E1b. Characteristics of PSM strategies for comparison twoTable E2a. Coarsening of covariates used in CEM for comparison oneTable E2b. Coarsening of covariates used in CEM for comparison twoTable E3a. Characteristics of CEM strategies for comparison oneTable E3b. Characteristics of CEM strategies for comparison twoAbstractAims & Methods: We compared propensity score matching (PSM) and coarsened exact matching (CEM) in balancing baseline characteristics between treatment groups using observational data obtained from a pan-Canadian prostate cancer radiotherapy database. Changes in effect estimates were evaluated as a function of improvements in balance, using results from RCTs to guide interpretation. Results: CEM and PSM improved balance between groups in both comparisons, while retaining the majority of original data. Improvements in balance were associated with effect estimates closer to those obtained in RCTs. Conclusions: CEM and PSM led to substantial improvements in balance between comparison groups, while retaining a considerable proportion of original data. This could lead to improved accuracy in effect estimates obtained using observational data in a variety of clinical situations.
This research aims to solve the problem selection to a decision. In the profile matching method, a parameter assessed on the difference between the target value with the value that is owned by an individual. There are two important parameters in this method such as core factors and secondary factors. These values are converted into a percentage of weight so as to produce the final decision as a determinant of the data which will be closer to the predetermined targets. By doing this method, sorting the data against specific criteria that are dynamicallyperformed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative measures computed at the whole brain scale.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive results Math GPA for summer school participants and non-participants.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data