99 datasets found
  1. H

    Data for: "Linking Datasets on Organizations Using Half a Billion...

    • dataverse.harvard.edu
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Connor Jerzak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

  2. Amazon-Google, Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Amazon-Google, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127241V1
    Explore at:
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

  3. H

    Replication Data for: Matching Methods for Causal Inference with Time-Series...

    • dataverse.harvard.edu
    Updated Oct 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kosuke Imai; In Song Kim; Erik Wang (2021). Replication Data for: Matching Methods for Causal Inference with Time-Series Cross-Section Data [Dataset]. http://doi.org/10.7910/DVN/ZTDHVE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Kosuke Imai; In Song Kim; Erik Wang
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVE

    Description

    Matching methods improve the validity of causal inference by reducing model dependence and offering intuitive diagnostics. While they have become a part of the standard tool kit across disciplines, matching methods are rarely used when analyzing time-series cross-sectional data. We fill this methodological gap. In the proposed approach, we first match each treated observation with control observations from other units in the same time period that have an identical treatment history up to the pre-specified number of lags. We use standard matching and weighting methods to further refine this matched set so that the treated and matched control observations have similar covariate values. Assessing the quality of matches is done by examining covariate balance. Finally, we estimate both short-term and long-term average treatment effects using the difference-in-differences estimator, accounting for a time trend. We illustrate the proposed methodology through simulation and empirical studies. An open-source software package is available for implementing the proposed methods.

  4. Data from: Automated Linking of Historical Data

    • linkagelibrary.icpsr.umich.edu
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez (2020). Automated Linking of Historical Data [Dataset]. http://doi.org/10.3886/E120703V1
    Explore at:
    Dataset updated
    Aug 20, 2020
    Authors
    Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1850 - 1940
    Area covered
    United States
    Description

    Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.

  5. Data from: Stochastic Matching DataSet

    • kaggle.com
    Updated Jul 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    knightwayne (2023). Stochastic Matching DataSet [Dataset]. https://www.kaggle.com/datasets/knightwayne/stochastic-matching-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    knightwayne
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by knightwayne

    Released under CC BY-SA 3.0

    Contents

  6. Data from: Comparative study on matching methods for the distinction of...

    • figshare.com
    zip
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Schorcht; Robert Hecht; Gotthard Meinel (2022). Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data [Dataset]. http://doi.org/10.6084/m9.figshare.18027683.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Martin Schorcht; Robert Hecht; Gotthard Meinel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the input data and results used in the paper"Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data".License information:The LoD1 data used as input in this study are openly available at Transparenzportal Hamburg (https://transparenz.hamburg.de/),from Freie und Hansestadt Hamburg, Landesbetrieb Geoinformation und Vermessung (LGV), in compliance with the licence dl-de/by-2-0 (https://www.govdata.de/dl-de/by-2-0):)Content:1. Input Footprints of non-identical pairs:input_reference_objects.zip2. Results without additional position deviation:results_without_deviation.zip3. Results with generated position deviation including geometries:results_with_deviation.zip

  7. Z

    Valentine Datasets

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jul 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katsifodimos, Asterios (2021). Valentine Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5084604
    Explore at:
    Dataset updated
    Jul 9, 2021
    Dataset provided by
    Psarakis, Kyriakos
    Lofi, Christoph
    Katsifodimos, Asterios
    Fragkoulis, Marios
    Brons, Jerry
    Bonifati, Angela
    Ionescu, Andra
    Siachamis, Georgios
    Koutras, Christos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets used for evaluating state-of-the-art schema matching methods in the paper "Valentine: Evaluating Matching Techniques for Dataset Discovery" , which was accepted for presentation in IEEE ICDE 2021. They come in the form of fabricated pairs respecting a relatedness scenario as discussed in the paper.

  8. e

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • b2find.eudat.eu
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  9. H

    Replication Data for: The Balance-Sample Size Frontier in Matching Methods...

    • dataverse.harvard.edu
    pdf, tsv, txt +1
    Updated Jul 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2017). Replication Data for: The Balance-Sample Size Frontier in Matching Methods for Causal Inference [Dataset]. http://doi.org/10.7910/DVN/SURSEO
    Explore at:
    tsv(184878), tsv(925446), type/x-r-syntax(42824), pdf(66052), txt(1742)Available download formats
    Dataset updated
    Jul 1, 2017
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We propose a simplified approach to matching for causal inference that simultaneously optimizes both balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum balance for each possible sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms that calculate the matching frontier for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.

  10. e

    Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference...

    • b2find.eudat.eu
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) Product Data Matching Task derived from the WDC Product Data Corpus Large-Scale Product Matching - Version 2.0 used for the MWPD2020 Challenge at the ISWC2020 Conference - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d1288b98-f495-5258-855e-3d153a7b62ed
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

  11. f

    Data from: Disparity Selective Stereo Matching Using Correlation Confidence...

    • figshare.com
    txt
    Updated Aug 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sangkeun Lee (2018). Disparity Selective Stereo Matching Using Correlation Confidence Measure [Dataset]. http://doi.org/10.6084/m9.figshare.6885158.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 1, 2018
    Dataset provided by
    figshare
    Authors
    Sangkeun Lee
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Test code to reproduce the results of the paper.This work presents a robust stereo matching method for occluded regions. First, we generate cost volumes using the CEN-SUS transform and the scale-invariant feature transform(SIFT). Then, label-based cost volumes are aggregated using adaptive support weight and SLIC scheme from generated two cost volumes. In order to obtain optimal disparity by two label-based cost volumes, we select the disparity corresponding to high confidence similarity of CENSUS or SIFT with minimum cost point.

  12. w

    Development of Automatic History Matching Techniques for Geothermal...

    • data.wu.ac.at
    Updated Dec 29, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). Development of Automatic History Matching Techniques for Geothermal Applications [Dataset]. https://data.wu.ac.at/odso/geothermaldata_org/Yzk3YjMyODAtZjhiMC00YmFkLWE1NGMtNmU2ODRjMzc0OTU5
    Explore at:
    Dataset updated
    Dec 29, 2015
    Description

    No Publication Abstract is Available

  13. d

    Data from: Highly Scalable Matching Pursuit Signal Decomposition Algorithm

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Highly Scalable Matching Pursuit Signal Decomposition Algorithm [Dataset]. https://catalog.data.gov/dataset/highly-scalable-matching-pursuit-signal-decomposition-algorithm
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met. A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.

  14. d

    Replication Data for: Leveraging Large Language Models for Fuzzy String...

    • search.dataone.org
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Yu (2024). Replication Data for: Leveraging Large Language Models for Fuzzy String Matching in Political Science [Dataset]. http://doi.org/10.7910/DVN/A8MKLO
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Wang, Yu
    Description

    Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

  15. C

    AgLiMatch dataset

    • dataverse.csuc.cat
    • portalrecerca.udl.cat
    • +1more
    pcap, png, tsv, txt
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Guevara; Javier Guevara; Jordi Gené Mola; Jordi Gené Mola; Eduard Gregorio López; Eduard Gregorio López; Miguel Torres-Torriti; Miguel Torres-Torriti; Giulio Reina; Giulio Reina; Fernando Auat Cheein; Fernando Auat Cheein (2025). AgLiMatch dataset [Dataset]. http://doi.org/10.34810/data2320
    Explore at:
    pcap(588461494), txt(89361), pcap(355137568), pcap(731088388), pcap(699472654), tsv(1135), txt(258148), txt(2774), pcap(552702982), txt(164443), txt(104279), txt(69031), pcap(514651510), txt(175790), png(1269595), pcap(368572624), pcap(64437102), txt(98805), txt(69652), pcap(265225278), txt(265982), txt(182), pcap(347767372), txt(211224), pcap(161080724), txt(205850)Available download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Javier Guevara; Javier Guevara; Jordi Gené Mola; Jordi Gené Mola; Eduard Gregorio López; Eduard Gregorio López; Miguel Torres-Torriti; Miguel Torres-Torriti; Giulio Reina; Giulio Reina; Fernando Auat Cheein; Fernando Auat Cheein
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Dataset funded by
    Agencia Estatal de Investigación
    Generalitat de Catalunya
    Description

    The agricultural LiDAR data to evaluate scan matching techniques (AgLiMatch dataset) is comprised of a set of Velodyne VLP-16 LiDAR captures and the corresponding GNSS-RTK tracks acquired in a Fuji apple orchard using an autonomous platform. This dataset was used in [1] to evaluate scan matching techniques by comparing the platform path calculated using LiDAR scan matching techniques and the actual platform path ground truth measured with a GNSS-RTK system. The correspondence between each LiDAR file (inside /velodyne_data folder) and GNSS track file (inside /GNSS_data folder) is detailed in “Velodyne-GNSS_correspondence-data.xlsx” file. The relative position between the LiDAR sensor and the GNSS rover is shown in “experimental_setup.png”. Distance units are in mm.

  16. Serie A Matches Dataset (2020-2025)

    • kaggle.com
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Biezunski (2025). Serie A Matches Dataset (2020-2025) [Dataset]. https://www.kaggle.com/datasets/marcelbiezunski/serie-a-matches-dataset-2020-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marcel Biezunski
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Don't forget to upvote if you enjoy my work :)

    Serie A Match Results Dataset (2020–2025) was created in response to community requests following the release of my LaLiga Match Results Dataset.

    This dataset contains match-level results and performance stats from the Italian Serie A football league, covering seasons 2020 to 2025.

    Source: Data was collected using a custom Python web scraper from FBref.com (https://fbref.com/en/comps/11/Serie-A-Stats).

    Uses: - Match prediction models - Sports analytics - Feature engineering experiments - Educational ML datasets

    Licensing Intended for educational and research use only. All rights remain with original data providers.

  17. H

    Replication Data for: Matching with Text Data: An Experimental Evaluation of...

    • dataverse.harvard.edu
    Updated Dec 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reagan Mozer (2019). Replication Data for: Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality [Dataset]. http://doi.org/10.7910/DVN/K8IL3V
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Reagan Mozer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the materials needed to replicate the results presented in Mozer et al. (2019), "Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality", forthcoming in Political Analysis.

  18. Z

    Supplementary Material for "Towards Robust Plagiarism Detection in...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hagel, Nathan (2025). Supplementary Material for "Towards Robust Plagiarism Detection in Programming Education: Introducing Tolerant Token Matching Techniques to Counter Novel Obfuscation Methods" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15069763
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Hagel, Nathan
    Maisch, Robin
    Bartel, Alexander
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material for the paper "Towards Robust Plagiarism Detection in Programming Education: Introducing Tolerant Token Matching Techniques to Counter Novel Obfuscation Methods".

    We include the following artefacts:

    code: the implementation of our approach based on JPlag (code and jars)

    datasets: the student programs and obfuscated plagiarisms used in our evaluation

    gpt: the prompts and scripts used for the AI-based obfuscation and generation

    results: the raw result data of our evaluation

  19. H

    Replication data for: Comparing Experimental and Matching Methods Using a...

    • dataverse.harvard.edu
    bin +3
    Updated Mar 16, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2010). Replication data for: Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment [Dataset]. http://doi.org/10.7910/DVN/CTT87V
    Explore at:
    text/plain; charset=us-ascii(915), tsv(262834774), tsv(113549293), text/plain; charset=us-ascii(2582), tsv(204400530), tsv(215197093), tsv(85310237), tsv(199263452), text/x-stata-syntax; charset=us-ascii(11089), text/plain; charset=us-ascii(1560), tsv(111871194), tsv(201570993), tsv(264903007), tsv(87808030), bin(16292373), tsv(202577033), tsv(207072190), tsv(85249733), tsv(261656381), tsv(111975888)Available download formats
    Dataset updated
    Mar 16, 2010
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In the social sciences, randomized experimentation is the optimal research design for establishing causation. However, for a number of practical reasons, researchers are sometimes unable to conduct experiments and must rely on observational data. In an effort to develop estimators that can approximate experimental results using observational data, scholars have given increasing attention to matching. In this article, we test the performance of matching by gauging the success with which matching approximates experimental results. The voter mobilization experiment presented here comprises a large number of observations (60,000 randomly assigned to the treatment group and nearly two million assigned to the control group) and a rich set of covariates. This study is analyzed in two ways. The first method, instrumental variables estimation, takes advantage of random assignment in order to produce consistent estimates. The second method, matching estimation, ignores random assignment and analyzes the data as though they were nonexperimental. Matching is found to produce biased results in this application because even a rich set of covariates is insufficient to control for preexisting differences between the treatment and control group. Matching, in fact, produces estimates that are no more accurate than those generated by ordinary least squares regression. The experimental findings show that brief paid get-out-the-vote phone calls do not increase turnout, while matching and regression show a large and significant effect.

  20. d

    Replication data for: Multivariate Matching Methods That are Monotonic...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iacus, Stefano M.; King, Gary; Porro, Giuseppe (2023). Replication data for: Multivariate Matching Methods That are Monotonic Imbalance Bounding [Dataset]. http://doi.org/10.7910/DVN/OMHQFP
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Iacus, Stefano M.; King, Gary; Porro, Giuseppe
    Description

    We introduce a new "Monotonic Imbalance Bounding" (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, "Equal Percent Bias Reducing" (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods. See also: Casual Inference

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL

Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records"

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2025
Dataset provided by
Harvard Dataverse
Authors
Connor Jerzak
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

Search
Clear search
Close search
Google apps
Main menu