73 datasets found
  1. H

    Data for: "Linking Datasets on Organizations Using Half a Billion...

    • dataverse.harvard.edu
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Connor Jerzak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

  2. H

    Replication Data for: Matching Methods for Causal Inference with Time-Series...

    • dataverse.harvard.edu
    Updated Oct 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Replication Data for: Matching Methods for Causal Inference with Time-Series Cross-Section Data [Dataset]. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZTDHVE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Kosuke Imai; In Song Kim; Erik Wang
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVE

    Description

    Matching methods improve the validity of causal inference by reducing model dependence and offering intuitive diagnostics. While they have become a part of the standard tool kit across disciplines, matching methods are rarely used when analyzing time-series cross-sectional data. We fill this methodological gap. In the proposed approach, we first match each treated observation with control observations from other units in the same time period that have an identical treatment history up to the pre-specified number of lags. We use standard matching and weighting methods to further refine this matched set so that the treated and matched control observations have similar covariate values. Assessing the quality of matches is done by examining covariate balance. Finally, we estimate both short-term and long-term average treatment effects using the difference-in-differences estimator, accounting for a time trend. We illustrate the proposed methodology through simulation and empirical studies. An open-source software package is available for implementing the proposed methods.

  3. d

    Replication Data for: Leveraging Large Language Models for Fuzzy String...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Yu (2024). Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science [Dataset]. https://search.dataone.org/view/sha256%3Aab5973eb91ac4a62bc4d7b721a7d7e54f3ae1877e4cc917d10307d74a45a97fe
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Wang, Yu
    Description

    Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

  4. Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    delimited
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

  5. n

    Ground-roll separation using intelligence based-matching method

    • narcis.nl
    • data.mendeley.com
    Updated Feb 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, J (via Mendeley Data) (2020). Ground-roll separation using intelligence based-matching method [Dataset]. http://doi.org/10.17632/xg237bzyxb.1
    Explore at:
    Dataset updated
    Feb 27, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Li, J (via Mendeley Data)
    Description

    Separation is achieved by intelligence based-matching of the curvelet coefficients.

  6. d

    Data from: Highly Scalable Matching Pursuit Signal Decomposition Algorithm

    • datasets.ai
    • s.cnmilf.com
    • +3more
    33
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2024). Highly Scalable Matching Pursuit Signal Decomposition Algorithm [Dataset]. https://datasets.ai/datasets/highly-scalable-matching-pursuit-signal-decomposition-algorithm
    Explore at:
    33Available download formats
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    National Aeronautics and Space Administration
    Description

    In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met.

    A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.

  7. f

    Data from: Toponym matching through deep neural networks

    • tandf.figshare.com
    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Santos; Patricia Murrieta-Flores; Pável Calado; Bruno Martins (2023). Toponym matching through deep neural networks [Dataset]. http://doi.org/10.6084/m9.figshare.5554192
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Rui Santos; Patricia Murrieta-Flores; Pável Calado; Bruno Martins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Toponym matching, i.e. pairing strings that represent the same real-world location, is a fundamental problemfor several practical applications. The current state-of-the-art relies on string similarity metrics, either specifically developed for matching place names or integrated within methods that combine multiple metrics. However, these methods all rely on common sub-strings in order to establish similarity, and they do not effectively capture the character replacements involved in toponym changes due to transliterations or to changes in language and culture over time. In this article, we present a novel matching approach, leveraging a deep neural network to classify pairs of toponyms as either matching or nonmatching. The proposed network architecture uses recurrent nodes to build representations from the sequences of bytes that correspond to the strings that are to be matched. These representations are then combined and passed to feed-forward nodes, finally leading to a classification decision. We present the results of a wide-ranging evaluation on the performance of the proposed method, using a large dataset collected from the GeoNames gazetteer. These results show that the proposed method can significantly outperform individual similarity metrics from previous studies, as well as previous methods based on supervised machine learning for combining multiple metrics.

  8. d

    Replication data for: Multivariate Matching Methods That are Monotonic...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iacus, Stefano M.; King, Gary; Porro, Giuseppe (2023). Replication data for: Multivariate Matching Methods That are Monotonic Imbalance Bounding [Dataset]. http://doi.org/10.7910/DVN/OMHQFP
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Iacus, Stefano M.; King, Gary; Porro, Giuseppe
    Description

    We introduce a new "Monotonic Imbalance Bounding" (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, "Equal Percent Bias Reducing" (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods. See also: Casual Inference

  9. Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

  10. P

    WDC LSPM Dataset

    • paperswithcode.com
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
    Explore at:
    Dataset updated
    May 31, 2022
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  11. J

    Random Recursive Partitioning: a matching method for the estimation of the...

    • journaldata.zbw.eu
    • jda-test.zbw.eu
    .rda, csv, txt, zip
    Updated Dec 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giuseppe Porro; Stefano Maria Iacus; Giuseppe Porro; Stefano Maria Iacus (2022). Random Recursive Partitioning: a matching method for the estimation of the average treatment effect (replication data) [Dataset]. http://doi.org/10.15456/jae.2022319.1304251755
    Explore at:
    csv(13692), .rda(118659), zip(18375), txt(3478), csv(40569), csv(166644), csv(169710), csv(21498), csv(177445)Available download formats
    Dataset updated
    Dec 8, 2022
    Dataset provided by
    ZBW - Leibniz Informationszentrum Wirtschaft
    Authors
    Giuseppe Porro; Stefano Maria Iacus; Giuseppe Porro; Stefano Maria Iacus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this paper we introduce the Random Recursive Partitioning (RRP) matching method. RRP generates a proximity matrix which might be useful in econometric applications like average treatment effect estimation. RRP is a Monte Carlo method that randomly generates non-empty recursive partitions of the data and evaluates the proximity between two observations as the empirical frequency they fall in a same cell of these random partitions over all Monte Carlo replications. From the proximity matrix it is possible to derive both graphical and analytical tools to evaluate the extent of the common support between data sets. The RRP method is honest in that it does not match observations at any cost: if data sets are separated, the method clearly states it. The match obtained with RRP is invariant under monotonic transformation of the data. Average treatment effect estimators derived from the proximity matrix seem to be competitive compared to more commonly used estimators. RRP method does not require a particular structure of the data and for this reason it can be applied when distances like Mahalanobis or Euclidean are not suitable, in the presence of missing data or when the estimated propensity score is too sensitive to model specifications.

  12. o

    Introduction and Demonstration of the Many-Group Matching (MAGMA)-Algorithm:...

    • osf.io
    url
    Updated Dec 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus D. Feuchter; Julian Urban; Vsevolod Scherrer; Moritz Breit; Franzis Preckel (2022). Introduction and Demonstration of the Many-Group Matching (MAGMA)-Algorithm: Matching Solutions for Two or More Groups [Dataset]. http://doi.org/10.17605/OSF.IO/AEDXB
    Explore at:
    urlAvailable download formats
    Dataset updated
    Dec 13, 2022
    Dataset provided by
    Center For Open Science
    Authors
    Markus D. Feuchter; Julian Urban; Vsevolod Scherrer; Moritz Breit; Franzis Preckel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Field data is often limited regarding causal inference. This is partly because randomization techniques are often impractical or unethical within certain fields (e.g., randomly assigning individuals to different types of classroom instruction in educational settings). Matching procedures, like propensity score matching (PSM; Rosenbaum & Rubin, 1983), are regularly used to strengthen interpretations of group membership-effects in field research. By matching individuals from different subgroups of a field sample (e.g., participations vs. nonparticipation in a special education program), relevant confounds to group membership-effects (e.g., socio-economic status) can be balanced out and thereby eliminated retrospectively. That way, matching turns field data into quasi-experimental data. Currently, the most prominent approach to matching individuals is nearest neighbor matching (NNM) (see Austin, 2014; Austin & Stuart, 2015; Heinz et al., 2022; Jacovidis, 2017). Available statistical software (e.g., R-packages like MatchIt, Ho et al., 2011), however, does not fully realize the potential of NNM to reduce sample-related bias in field data due to unsystematic procedures for the identification of apt pairs to match. Furthermore, existing matching applications are limited to two-group designs (that being said, weighting applications for more than two groups do exists, e.g., MMW-S, Hong, 2012). In addition, balance estimation, as a matching quality check, is often conducted rudimentarily (e.g., by solely reporting between-group post-matching differences). So far, conventions on balance estimation for more than two groups are absent. To address these shortcomings, we developed a systematic algorithm, designed for matching individuals from two or more groups alongside a set of adequate balance estimates. We call it “MAGMA” (for MAny-Group MAtching). In this work, we demonstrate and evaluate the MAGMA-algorithm, using two empirical examples from extensive field data.

  13. Data from: Automated Linking of Historical Data

    • linkagelibrary.icpsr.umich.edu
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
    Explore at:
    Dataset updated
    Aug 20, 2020
    Authors
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1850 - 1940
    Area covered
    United States
    Description

    Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

  14. d

    Replication Data for: The Balance-Sample Size Frontier in Matching Methods...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King, Gary, Christopher Lucas, and Richard Nielsen (2023). Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference [Dataset]. http://doi.org/10.7910/DVN/TRTXLP
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    King, Gary, Christopher Lucas, and Richard Nielsen
    Description

    Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference

  15. d

    Data from: Matching seed to site by climate similarity: Techniques to...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle D. Doherty; Bradley J. Butterfield; Troy E. Wood (2017). Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration [Dataset]. http://doi.org/10.5061/dryad.43bv0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 10, 2017
    Dataset provided by
    Dryad
    Authors
    Kyle D. Doherty; Bradley J. Butterfield; Troy E. Wood
    Time period covered
    2017
    Area covered
    global, western North America
    Description

    Climate Similarity Support ScriptsThis compressed file contains scripts and supporting data necessary to reproduce the analyses and products associated with Doherty et al. (2017), “Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration.”Climate Similarity Scripts.zip

  16. e

    Data from: Global matching of point clouds for scan registration and loop...

    • data.europa.eu
    zip
    Updated Oct 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joint Research Centre (2019). Global matching of point clouds for scan registration and loop detection [Dataset]. https://data.europa.eu/data/datasets/8c002004-8cd1-4998-89e1-8875a386743a?locale=pl
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2019
    Dataset authored and provided by
    Joint Research Centre
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We present a robust Global Matching technique focused on 3D mapping applications using laser range-finders. Our approach works under the assumption that places can be recognized by analyzing the projection of the observed points along the gravity direction. Relative poses between pairs of 3D point clouds are estimated by aligning their 2D projective representations and benefiting from the corresponding dimensional reduction. We present the complete processing pipeline for two different applications that use the global matcher as a core component: First, the global matcher is used for the registration of static scan sets where no a-priori information of the relative poses is available. It is combined with an effective procedure for validating the matches that exploits the implicit empty space information associated to single acquisitions. In the second use case, the global matcher is used for the loop detection required for 3D SLAM applications. We use an Extended Kalman Filter to obtain a belief of the map poses, which allows to validate matches and to execute hierarchical overlap tests, which reduce the number of potential matches to be evaluated. Additionally, the global matcher is combined with a fast local technique. In both use cases, the global reconstruction problem is modeled as a sparse graph, where scan poses (nodes) are connected through matches (edges). The graph structure allows formulating a sparse global optimization problem that optimizes scan poses, considering simultaneously all accepted matches. Our approach is being used in production systems and has been successfully evaluated on several real and publicly available datasets.

  17. f

    Supplementary Tables E1-E3. Propensity score matching versus coarsened exact...

    • future-science-group.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Guy; Igor Karp; Piotr Wilk; Joseph Chin; George Rodrigues (2023). Supplementary Tables E1-E3. Propensity score matching versus coarsened exact matching in observational comparative effectiveness research [Dataset]. http://doi.org/10.25402/FSG.14710737.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Future Science Group
    Authors
    David Guy; Igor Karp; Piotr Wilk; Joseph Chin; George Rodrigues
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Supplementary materials – tables:Table E1a. Characteristics of PSM strategies for comparison oneTable E1b. Characteristics of PSM strategies for comparison twoTable E2a. Coarsening of covariates used in CEM for comparison oneTable E2b. Coarsening of covariates used in CEM for comparison twoTable E3a. Characteristics of CEM strategies for comparison oneTable E3b. Characteristics of CEM strategies for comparison twoSupplementary materials – figuresFigure E1a. Selection process for comparison oneFigure E1b. Selection process for comparison twoFigure E2a. Distribution of baseline covariates by PSM caliper width for comparison oneFigure E2b. Distribution of baseline covariates by PSM caliper width for comparison twoFigure E3a. Distribution of baseline covariates by CEM strategy for comparison oneFigure E3b. Distribution of baseline covariates by CEM strategy for comparison twoSupplementary materials – tablesTable E1a. Characteristics of PSM strategies for comparison oneTable E1b. Characteristics of PSM strategies for comparison twoTable E2a. Coarsening of covariates used in CEM for comparison oneTable E2b. Coarsening of covariates used in CEM for comparison twoTable E3a. Characteristics of CEM strategies for comparison oneTable E3b. Characteristics of CEM strategies for comparison twoAbstractAims & Methods: We compared propensity score matching (PSM) and coarsened exact matching (CEM) in balancing baseline characteristics between treatment groups using observational data obtained from a pan-Canadian prostate cancer radiotherapy database. Changes in effect estimates were evaluated as a function of improvements in balance, using results from RCTs to guide interpretation. Results: CEM and PSM improved balance between groups in both comparisons, while retaining the majority of original data. Improvements in balance were associated with effect estimates closer to those obtained in RCTs. Conclusions: CEM and PSM led to substantial improvements in balance between comparison groups, while retaining a considerable proportion of original data. This could lead to improved accuracy in effect estimates obtained using observational data in a variety of clinical situations.

  18. Data from: Profile Matching in Solving Rank Problem

    • osf.io
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andysah Putera Utama Siahaan (2023). Profile Matching in Solving Rank Problem [Dataset]. http://doi.org/10.17605/OSF.IO/ND8EA
    Explore at:
    Dataset updated
    Apr 3, 2023
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Andysah Putera Utama Siahaan
    Description

    This research aims to solve the problem selection to a decision. In the profile matching method, a parameter assessed on the difference between the target value with the value that is owned by an individual. There are two important parameters in this method such as core factors and secondary factors. These values are converted into a percentage of weight so as to produce the final decision as a determinant of the data which will be closer to the predetermined targets. By doing this method, sorting the data against specific criteria that are dynamicallyperformed.

  19. f

    Quantitative measures computed at the whole brain scale.

    • figshare.com
    xls
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Yadav; François-Xavier Dupé; Sylvain Takerkart; Guillaume Auzias (2023). Quantitative measures computed at the whole brain scale. [Dataset]. http://doi.org/10.1371/journal.pone.0293886.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rohit Yadav; François-Xavier Dupé; Sylvain Takerkart; Guillaume Auzias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative measures computed at the whole brain scale.

  20. f

    Descriptive results Math GPA for summer school participants and...

    • plos.figshare.com
    xls
    Updated Apr 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Monfrance; Carla Haelermans; Trudie Schils (2024). Descriptive results Math GPA for summer school participants and non-participants. [Dataset]. http://doi.org/10.1371/journal.pone.0302060.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Monfrance; Carla Haelermans; Trudie Schils
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptive results Math GPA for summer school participants and non-participants.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL

Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records"

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2025
Dataset provided by
Harvard Dataverse
Authors
Connor Jerzak
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

Search
Clear search
Close search
Google apps
Main menu