99 datasets found

H
Data for: "Linking Datasets on Organizations Using Half a Billion...
dataverse.harvard.edu
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/EHRQQL
Dataset updated
Jan 13, 2025
Dataset provided by
Harvard Dataverse
Authors
Connor Jerzak
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data
Amazon-Google, Augmented Version, Fixed Splits
linkagelibrary.icpsr.umich.edu
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Primpeli; Christian Bizer (2020). Amazon-Google, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127241V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127241V1
Dataset updated
Nov 23, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
H
Replication Data for: Matching Methods for Causal Inference with Time-Series...
dataverse.harvard.edu
Updated Oct 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kosuke Imai; In Song Kim; Erik Wang (2021). Replication Data for: Matching Methods for Causal Inference with Time-Series Cross-Section Data [Dataset]. http://doi.org/10.7910/DVN/ZTDHVE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZTDHVE
Dataset updated
Oct 13, 2021
Dataset provided by
Harvard Dataverse
Authors
Kosuke Imai; In Song Kim; Erik Wang
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVE
Description
Matching methods improve the validity of causal inference by reducing model dependence and offering intuitive diagnostics. While they have become a part of the standard tool kit across disciplines, matching methods are rarely used when analyzing time-series cross-sectional data. We fill this methodological gap. In the proposed approach, we first match each treated observation with control observations from other units in the same time period that have an identical treatment history up to the pre-specified number of lags. We use standard matching and weighting methods to further refine this matched set so that the treated and matched control observations have similar covariate values. Assessing the quality of matches is done by examining covariate balance. Finally, we estimate both short-term and long-term average treatment effects using the difference-in-differences estimator, accounting for a time trend. We illustrate the proposed methodology through simulation and empirical studies. An open-source software package is available for implementing the proposed methods.
Data from: Automated Linking of Historical Data
linkagelibrary.icpsr.umich.edu
Updated Aug 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
Explore at:
Unique identifier
https://doi.org/10.3886/E120703V1
Dataset updated
Aug 20, 2020
Authors
Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1850 - 1940
Area covered
United States
Description
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Data from: Stochastic Matching DataSet
kaggle.com
Updated Jul 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
knightwayne (2023). Stochastic Matching DataSet [Dataset]. https://www.kaggle.com/datasets/knightwayne/stochastic-matching-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
knightwayne
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset

This dataset was created by knightwayne

Released under CC BY-SA 3.0

Contents
Data from: Comparative study on matching methods for the distinction of...
figshare.com
zip
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Schorcht; Robert Hecht; Gotthard Meinel (2022). Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data [Dataset]. http://doi.org/10.6084/m9.figshare.18027683.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18027683.v1
Dataset updated
Jan 26, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Martin Schorcht; Robert Hecht; Gotthard Meinel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the input data and results used in the paper"Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data".License information:The LoD1 data used as input in this study are openly available at Transparenzportal Hamburg (https://transparenz.hamburg.de/),from Freie und Hansestadt Hamburg, Landesbetrieb Geoinformation und Vermessung (LGV), in compliance with the licence dl-de/by-2-0 (https://www.govdata.de/dl-de/by-2-0):)Content:1. Input Footprints of non-identical pairs:input_reference_objects.zip2. Results without additional position deviation:results_without_deviation.zip3. Results with generated position deviation including geometries:results_with_deviation.zip
Z
Valentine Datasets
data.niaid.nih.gov
explore.openaire.eu
Updated Jul 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katsifodimos, Asterios (2021). Valentine Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5084604
Explore at:
Dataset updated
Jul 9, 2021
Dataset provided by
Psarakis, Kyriakos
Lofi, Christoph
Katsifodimos, Asterios
Fragkoulis, Marios
Brons, Jerry
Bonifati, Angela
Ionescu, Andra
Siachamis, Georgios
Koutras, Christos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets used for evaluating state-of-the-art schema matching methods in the paper "Valentine: Evaluating Matching Techniques for Dataset Discovery" , which was accepted for presentation in IEEE ICDE 2021. They come in the form of fabricated pairs respecting a relatedness scenario as discussed in the paper.
e
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
b2find.eudat.eu
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
Explore at:
Dataset updated
Nov 27, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
H
Replication Data for: The Balance-Sample Size Frontier in Matching Methods...
dataverse.harvard.edu
pdf, tsv, txt +1
Updated Jul 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2017). Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference [Dataset]. http://doi.org/10.7910/DVN/SURSEO
Explore at:
tsv(184878), tsv(925446), type/x-r-syntax(42824), pdf(66052), txt(1742)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/SURSEO
Dataset updated
Jul 1, 2017
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We propose a simplified approach to matching for causal inference that simultaneously optimizes both balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum balance for each possible sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms that calculate the matching frontier for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.
e
Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference...
b2find.eudat.eu
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) Product Data Matching Task derived from the WDC Product Data Corpus Large-Scale Product Matching - Version 2.0 used for the MWPD2020 Challenge at the ISWC2020 Conference - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d1288b98-f495-5258-855e-3d153a7b62ed
Explore at:
Dataset updated
Nov 27, 2020
Description
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
f
Data from: Disparity Selective Stereo Matching Using Correlation Confidence...
figshare.com
txt
Updated Aug 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sangkeun Lee (2018). Disparity Selective Stereo Matching Using Correlation Confidence Measure [Dataset]. http://doi.org/10.6084/m9.figshare.6885158.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6885158.v1
Dataset updated
Aug 1, 2018
Dataset provided by
figshare
Authors
Sangkeun Lee
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Test code to reproduce the results of the paper.This work presents a robust stereo matching method for occluded regions. First, we generate cost volumes using the CEN-SUS transform and the scale-invariant feature transform(SIFT). Then, label-based cost volumes are aggregated using adaptive support weight and SLIC scheme from generated two cost volumes. In order to obtain optimal disparity by two label-based cost volumes, we select the disparity corresponding to high confidence similarity of CENSUS or SIFT with minimum cost point.
w
Development of Automatic History Matching Techniques for Geothermal...
data.wu.ac.at
Updated Dec 29, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). Development of Automatic History Matching Techniques for Geothermal Applications [Dataset]. https://data.wu.ac.at/odso/geothermaldata_org/Yzk3YjMyODAtZjhiMC00YmFkLWE1NGMtNmU2ODRjMzc0OTU5
Explore at:
Dataset updated
Dec 29, 2015
Description
No Publication Abstract is Available
d
Data from: Highly Scalable Matching Pursuit Signal Decomposition Algorithm
catalog.data.gov
datasets.ai
+3more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Highly Scalable Matching Pursuit Signal Decomposition Algorithm [Dataset]. https://catalog.data.gov/dataset/highly-scalable-matching-pursuit-signal-decomposition-algorithm
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met. A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.
d
Replication Data for: Leveraging Large Language Models for Fuzzy String...
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang, Yu (2024). Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science [Dataset]. http://doi.org/10.7910/DVN/A8MKLO
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/A8MKLO
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Wang, Yu
Description
Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.
C
AgLiMatch dataset
dataverse.csuc.cat
portalrecerca.udl.cat
+1more
pcap, png, tsv, txt
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Guevara; Javier Guevara; Jordi Gené Mola; Jordi Gené Mola; Eduard Gregorio López; Eduard Gregorio López; Miguel Torres-Torriti; Miguel Torres-Torriti; Giulio Reina; Giulio Reina; Fernando Auat Cheein; Fernando Auat Cheein (2025). AgLiMatch dataset [Dataset]. http://doi.org/10.34810/data2320
Explore at:
pcap(588461494), txt(89361), pcap(355137568), pcap(731088388), pcap(699472654), tsv(1135), txt(258148), txt(2774), pcap(552702982), txt(164443), txt(104279), txt(69031), pcap(514651510), txt(175790), png(1269595), pcap(368572624), pcap(64437102), txt(98805), txt(69652), pcap(265225278), txt(265982), txt(182), pcap(347767372), txt(211224), pcap(161080724), txt(205850)Available download formats
Unique identifier
https://doi.org/10.34810/data2320
Dataset updated
Jun 6, 2025
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Javier Guevara; Javier Guevara; Jordi Gené Mola; Jordi Gené Mola; Eduard Gregorio López; Eduard Gregorio López; Miguel Torres-Torriti; Miguel Torres-Torriti; Giulio Reina; Giulio Reina; Fernando Auat Cheein; Fernando Auat Cheein
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset funded by
Agencia Estatal de Investigación
Generalitat de Catalunya
Description
The agricultural LiDAR data to evaluate scan matching techniques (AgLiMatch dataset) is comprised of a set of Velodyne VLP-16 LiDAR captures and the corresponding GNSS-RTK tracks acquired in a Fuji apple orchard using an autonomous platform. This dataset was used in [1] to evaluate scan matching techniques by comparing the platform path calculated using LiDAR scan matching techniques and the actual platform path ground truth measured with a GNSS-RTK system. The correspondence between each LiDAR file (inside /velodyne_data folder) and GNSS track file (inside /GNSS_data folder) is detailed in “Velodyne-GNSS_correspondence-data.xlsx” file. The relative position between the LiDAR sensor and the GNSS rover is shown in “experimental_setup.png”. Distance units are in mm.
Serie A Matches Dataset (2020-2025)
kaggle.com
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcel Biezunski (2025). Serie A Matches Dataset (2020-2025) [Dataset]. https://www.kaggle.com/datasets/marcelbiezunski/serie-a-matches-dataset-2020-2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marcel Biezunski
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Don't forget to upvote if you enjoy my work :)

Serie A Match Results Dataset (2020–2025) was created in response to community requests following the release of my LaLiga Match Results Dataset.

This dataset contains match-level results and performance stats from the Italian Serie A football league, covering seasons 2020 to 2025.

Source: Data was collected using a custom Python web scraper from FBref.com (https://fbref.com/en/comps/11/Serie-A-Stats).

Uses: - Match prediction models - Sports analytics - Feature engineering experiments - Educational ML datasets

Licensing Intended for educational and research use only. All rights remain with original data providers.
H
Replication Data for: Matching with Text Data: An Experimental Evaluation of...
dataverse.harvard.edu
Updated Dec 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reagan Mozer (2019). Replication Data for: Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality [Dataset]. http://doi.org/10.7910/DVN/K8IL3V
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/K8IL3V
Dataset updated
Dec 24, 2019
Dataset provided by
Harvard Dataverse
Authors
Reagan Mozer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the materials needed to replicate the results presented in Mozer et al. (2019), "Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality", forthcoming in Political Analysis.
Z
Supplementary Material for "Towards Robust Plagiarism Detection in...
data.niaid.nih.gov
zenodo.org
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hagel, Nathan (2025). Supplementary Material for "Towards Robust Plagiarism Detection in Programming Education: Introducing Tolerant Token Matching Techniques to Counter Novel Obfuscation Methods" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15069763
Explore at:
Dataset updated
Jun 2, 2025
Dataset provided by
Hagel, Nathan
Maisch, Robin
Bartel, Alexander
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material for the paper "Towards Robust Plagiarism Detection in Programming Education: Introducing Tolerant Token Matching Techniques to Counter Novel Obfuscation Methods".

We include the following artefacts:

code: the implementation of our approach based on JPlag (code and jars)

datasets: the student programs and obfuscated plagiarisms used in our evaluation

gpt: the prompts and scripts used for the AI-based obfuscation and generation

results: the raw result data of our evaluation
H
Replication data for: Comparing Experimental and Matching Methods Using a...
dataverse.harvard.edu
bin +3
Updated Mar 16, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2010). Replication data for: Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment [Dataset]. http://doi.org/10.7910/DVN/CTT87V
Explore at:
text/plain; charset=us-ascii(915), tsv(262834774), tsv(113549293), text/plain; charset=us-ascii(2582), tsv(204400530), tsv(215197093), tsv(85310237), tsv(199263452), text/x-stata-syntax; charset=us-ascii(11089), text/plain; charset=us-ascii(1560), tsv(111871194), tsv(201570993), tsv(264903007), tsv(87808030), bin(16292373), tsv(202577033), tsv(207072190), tsv(85249733), tsv(261656381), tsv(111975888)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/CTT87V
Dataset updated
Mar 16, 2010
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In the social sciences, randomized experimentation is the optimal research design for establishing causation. However, for a number of practical reasons, researchers are sometimes unable to conduct experiments and must rely on observational data. In an effort to develop estimators that can approximate experimental results using observational data, scholars have given increasing attention to matching. In this article, we test the performance of matching by gauging the success with which matching approximates experimental results. The voter mobilization experiment presented here comprises a large number of observations (60,000 randomly assigned to the treatment group and nearly two million assigned to the control group) and a rich set of covariates. This study is analyzed in two ways. The first method, instrumental variables estimation, takes advantage of random assignment in order to produce consistent estimates. The second method, matching estimation, ignores random assignment and analyzes the data as though they were nonexperimental. Matching is found to produce biased results in this application because even a rich set of covariates is insufficient to control for preexisting differences between the treatment and control group. Matching, in fact, produces estimates that are no more accurate than those generated by ordinary least squares regression. The experimental findings show that brief paid get-out-the-vote phone calls do not increase turnout, while matching and regression show a large and significant effect.
d
Replication data for: Multivariate Matching Methods That are Monotonic...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iacus, Stefano M.; King, Gary; Porro, Giuseppe (2023). Replication data for: Multivariate Matching Methods That are Monotonic Imbalance Bounding [Dataset]. http://doi.org/10.7910/DVN/OMHQFP
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OMHQFP
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Iacus, Stefano M.; King, Gary; Porro, Giuseppe
Description
We introduce a new "Monotonic Imbalance Bounding" (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, "Equal Percent Bias Reducing" (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods. See also: Casual Inference

Facebook

Twitter

Click to copy link

Link copied

Cite

Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL

Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records"

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/EHRQQL

Dataset updated

Jan 13, 2025

Dataset provided by

Harvard Dataverse

Authors

Connor Jerzak

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

Clear search

Close search

Google apps

Main menu

Data for: "Linking Datasets on Organizations Using Half a Billion...

Amazon-Google, Augmented Version, Fixed Splits

Replication Data for: Matching Methods for Causal Inference with Time-Series...

Data from: Automated Linking of Historical Data

Data from: Stochastic Matching DataSet

Dataset

Contents

Data from: Comparative study on matching methods for the distinction of...

Valentine Datasets

Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

Replication Data for: The Balance-Sample Size Frontier in Matching Methods...

Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference...

Data from: Disparity Selective Stereo Matching Using Correlation Confidence...

Development of Automatic History Matching Techniques for Geothermal...

Data from: Highly Scalable Matching Pursuit Signal Decomposition Algorithm

Replication Data for: Leveraging Large Language Models for Fuzzy String...

AgLiMatch dataset

Serie A Matches Dataset (2020-2025)

Replication Data for: Matching with Text Data: An Experimental Evaluation of...

Supplementary Material for "Towards Robust Plagiarism Detection in...

Replication data for: Comparing Experimental and Matching Methods Using a...

Replication data for: Multivariate Matching Methods That are Monotonic...

Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records"