100+ datasets found
  1. d

    Data Matching Imputation System

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +2more
    Updated Oct 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2024). Data Matching Imputation System [Dataset]. https://catalog.data.gov/dataset/data-matching-imputation-system1
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    The DMIS dataset is a flat file record of the matching of several data set collections. Primarily it consists of VTRs, dealer records, Observer data in conjunction with vessel permit information for the purpose of supporting North East Regional quota monitoring projects.

  2. o

    Data and Code for: Correlation Neglect in Student-to-School Matching

    • openicpsr.org
    delimited
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Rees-Jones; Ran Shorrer; Chloe Tergiman (2023). Data and Code for: Correlation Neglect in Student-to-School Matching [Dataset]. http://doi.org/10.3886/E192088V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    American Economic Association
    Authors
    Alex Rees-Jones; Ran Shorrer; Chloe Tergiman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2019 - 2022
    Area covered
    United States
    Description

    Data and Code to accompany the paper "Correlation Neglect in Student-to-School Matching."Abstract: We present results from three experiments containing incentivized school-choice scenarios. In these scenarios, we vary whether schools' assessments of students are based on a common priority (inducing correlation in admissions decisions) or are based on independent assessments (eliminating correlation in admissions decisions). The quality of students' application strategies declines in the presence of correlated admissions: application strategies become substantially more aggressive and fail to include attractive ``safety'' options. We provide a battery of tests suggesting that this phenomenon is at least partially driven by correlation neglect, and we discuss implications for the design and deployment of student-to-school matching mechanisms.

  3. d

    Maryland Counties Match Tool for Data Quality

    • catalog.data.gov
    • opendata.maryland.gov
    • +1more
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opendata.maryland.gov (2023). Maryland Counties Match Tool for Data Quality [Dataset]. https://catalog.data.gov/dataset/maryland-counties-match-tool-for-data-quality
    Explore at:
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    opendata.maryland.gov
    Area covered
    Maryland
    Description

    Data standardization is an important part of effective management. However, sometimes people have data that doesn't match. This dataset includes different ways that counties might get written by different people. It can be used as a lookup table when you need County to be your unique identifier. For example, it allows you to match St. Mary's, St Marys, and Saint Mary's so that you can use it with disparate data from other data sets.

  4. H

    Data for: "Linking Datasets on Organizations Using Half a Billion...

    • dataverse.harvard.edu
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Jerzak (2025). Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records" [Dataset]. http://doi.org/10.7910/DVN/EHRQQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Connor Jerzak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers usually use approximate string (``fuzzy'') matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations. In response, a number of machine-learning methods have been developed to refine string matching. Yet, the effectiveness of these methods is limited by the size and diversity of training data. This paper introduces data from a prominent employment networking site (LinkedIn) as a massive training corpus to address these limitations. We show how, by leveraging information from LinkedIn regarding organizational name-to-name links, we can improve upon existing matching benchmarks, incorporating the trillions of name pair examples from LinkedIn into various methods to improve performance by explicitly maximizing match probabilities inferred from the LinkedIn corpus. We also show how relationships between organization names can be modeled using a network representation of the LinkedIn data. In illustrative merging tasks involving lobbying firms, we document improvements when using the LinkedIn corpus in matching calibration and make all data and methods open source. Keywords: Record linkage; Interest groups; Text as data; Unstructured data

  5. M

    Match Data Collection Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Match Data Collection Report [Dataset]. https://www.archivemarketresearch.com/reports/match-data-collection-19382
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global match data collection market is projected to grow from USD 940 million in 2023 to USD 3,530 million by 2033, at a CAGR of 16.7%. Growing adoption of data-driven decision-making in the sports industry, the increasing popularity of esports, and advancements in sensor technology are the primary factors driving the market growth. The use of match data allows teams, players, and coaches to gain insights into their performance, identify strengths and weaknesses, and make informed decisions. The market is segmented by type (sensor data, video data, and others), application (sports industry and esports), and region (North America, South America, Europe, Middle East & Africa, and Asia Pacific). North America is the largest market, followed by Europe. The Asia Pacific region is expected to witness the highest growth rate due to the increasing popularity of esports and the growing number of professional sports leagues in the region. Key players in the market include Opta, Sportradar, N3XT Sports, Sportsdata, OUTFORZ, KINEXON Sports, Stats Perform, Baidu Cloud, Bestdata, Gracenote, Genius Sports, Statscore, and Broadage.

  6. f

    Data_Sheet_1_Polar Similars: Using Massive Mobile Dating Data to Predict...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jon Levy; Devin Markell; Moran Cerf (2023). Data_Sheet_1_Polar Similars: Using Massive Mobile Dating Data to Predict Synchronization and Similarity in Dating Preferences.docx [Dataset]. http://doi.org/10.3389/fpsyg.2019.02010.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Jon Levy; Devin Markell; Moran Cerf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Leveraging a massive dataset of over 421 million potential matches between single users on a leading mobile dating application, we were able to identify numerous characteristics of effective matching. Effective matching is defined as the exchange of contact information with the likely intent to meet in person. The characteristics of effective match include alignment of psychological traits (i.e., extroversion), physical traits (i.e., height), personal choices (i.e., desiring the same relationship type), and shared experiences. For nearly all characteristics, the more similar the individuals were, the higher the likelihood was of them finding each other desirable and opting to meet in person. The only exception was introversion, where introverts rarely had an effective match with other introverts. When investigating the preliminary stages of the choice process we looked at the consistency between the choice of men/women, the time it took users to make these binary choices, and the tendency of yes/no decisions. We used a biologically inspired choice model to estimate the decision process and could predict the selection and response time with nearly 60% accuracy. Given that people make their initial selection in no more than 11 s, and ultimately prefer a partner who shares numerous attributes with them, we suggest that users are less selective in their early preferences and gradually, during their conversation, converge onto clusters that share a high degree of similarity in characteristics.

  7. H

    Replication Data for: Matching Methods for Causal Inference with Time-Series...

    • dataverse.harvard.edu
    Updated Oct 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kosuke Imai; In Song Kim; Erik Wang (2021). Replication Data for: Matching Methods for Causal Inference with Time-Series Cross-Section Data [Dataset]. http://doi.org/10.7910/DVN/ZTDHVE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Kosuke Imai; In Song Kim; Erik Wang
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/ZTDHVE

    Description

    Matching methods improve the validity of causal inference by reducing model dependence and offering intuitive diagnostics. While they have become a part of the standard tool kit across disciplines, matching methods are rarely used when analyzing time-series cross-sectional data. We fill this methodological gap. In the proposed approach, we first match each treated observation with control observations from other units in the same time period that have an identical treatment history up to the pre-specified number of lags. We use standard matching and weighting methods to further refine this matched set so that the treated and matched control observations have similar covariate values. Assessing the quality of matches is done by examining covariate balance. Finally, we estimate both short-term and long-term average treatment effects using the difference-in-differences estimator, accounting for a time trend. We illustrate the proposed methodology through simulation and empirical studies. An open-source software package is available for implementing the proposed methods.

  8. Assessment of States' Use of Computer Matching Protocols in SNAP

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Food and Nutrition Service (2025). Assessment of States' Use of Computer Matching Protocols in SNAP [Dataset]. https://catalog.data.gov/dataset/assessment-of-states-use-of-computer-matching-protocols-in-snap
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Food and Nutrition Servicehttps://www.fns.usda.gov/
    Description

    As required by federal law, state SNAP agencies verify financial and non-financial information by matching SNAP applicant and participant information to various national and state data sources to ensure they meet the program’s eligibility criteria. Data matching is an important tool for ensuring program integrity and benefit accuracy. However, information on states’ data matching practices and protocols is limited. This study was undertaken to address this knowledge gap.

  9. i

    Map Matching Dataset

    • ieee-dataport.org
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youliang chen (2023). Map Matching Dataset [Dataset]. https://ieee-dataport.org/documents/map-matching-dataset
    Explore at:
    Dataset updated
    Oct 10, 2023
    Authors
    Youliang chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    China

  10. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  11. R

    Data from: Image Matching System Dataset

    • universe.roboflow.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyojin (2023). Image Matching System Dataset [Dataset]. https://universe.roboflow.com/hyojin-jwabh/image-matching-system
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset authored and provided by
    Hyojin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Recaptcha
    Description

    Image Matching System

    ## Overview
    
    Image Matching System is a dataset for classification tasks - it contains Recaptcha annotations for 8,828 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  12. w

    Web Data Commons - The WDC Data Training Dataset and Gold Standard for...

    • webdatacommons.org
    json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Bizer; Anna Primpeli; Ralph Peeters, Web Data Commons - The WDC Data Training Dataset and Gold Standard for Large-Scale Product Matching [Dataset]. http://www.webdatacommons.org/largescaleproductcorpus/
    Explore at:
    jsonAvailable download formats
    Authors
    Christian Bizer; Anna Primpeli; Ralph Peeters
    Description

    The training dataset consisting of 20 million pairs of product offers referring to the same products. The offers were extracted from 43 thousand e-shops which provide schema.org annotations including some form of product ID such as a GTIN or MPN. We also created a gold standard by manually verifying 2000 pairs of offers belonging to four different product categories.

  13. Data from: Matching Treatment and Offender: North Carolina, 1980-1982

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Matching Treatment and Offender: North Carolina, 1980-1982 [Dataset]. https://catalog.data.gov/dataset/matching-treatment-and-offender-north-carolina-1980-1982-bdbd9
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    North Carolina
    Description

    These data were gathered in order to evaluate the implications of rational choice theory for offender rehabilitation. The hypothesis of the research was that income-enhancing prison rehabilitation programs are most effective for the economically motivated offender. The offender was characterized by demographic and socio-economic characteristics, criminal history and behavior, and work activities during incarceration. Information was also collected on type of release and post-release recidivistic and labor market measures. Recividism was measured by arrests, convictions, and reincarcerations, length of time until first arrest after release, and seriousness of offense leading to reincarceration.

  14. r

    Data from: imatch for matching in Stata

    • researchdata.edu.au
    ado, doc, txt
    Updated Jan 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Associate Professor Zhiqiang Wang; Associate Professor Zhiqiang Wang (2017). imatch for matching in Stata [Dataset]. http://doi.org/10.14264/UQL.2017.982
    Explore at:
    doc(60416), txt(3648), ado(3224)Available download formats
    Dataset updated
    Jan 1, 2017
    Dataset provided by
    The University of Queensland
    Authors
    Associate Professor Zhiqiang Wang; Associate Professor Zhiqiang Wang
    License

    https://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreementhttps://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreement

    Description

    The imatch program was written for Sata users to match different groups according to multiple variables. Program file: imatch.ado Help file: imatch.hlp

  15. Z

    Semantic address matching dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lin, Yue (2020). Semantic address matching dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3477006
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Lin, Yue
    Kang, Mengjun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data for our paper Lin, Y., Kang, M., Wu, Y., Du, Q. and Liu, T. (2019) A deep learning architecture for semantic address matching, International Journal of Geographical Information Science, DOI: 10.1080/13658816.2019.1681431

    Below is an overview of each file in this dataset.

    train.txt The training dataset

    train_code_a.txt The index representations of the address elements (i.e., address elements represented by the corresponding indexes in the vocabulary obtained by word2vec) in Sa

    train_code_b.txt The index representations of the address elements in Sb

    train_lable.txt The labels of address pairs in the training dataset

    dev.txt The development dataset

    dev_code_a.txt The index representations of the address elements in Sa

    dev_code_b.txt The index representations of the address elements in Sb

    dev_lable.txt The labels of address pairs in the development dataset

    test.txt The test dataset

    test_code_a.txt The index representations of the address elements in Sa

    test_code_b.txt The index representations of the address elements in Sb

    test_lable.txt The labels of address pairs in the test dataset

  16. H

    Replication Data for: Matching with Text Data: An Experimental Evaluation of...

    • dataverse.harvard.edu
    Updated Dec 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reagan Mozer (2019). Replication Data for: Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality [Dataset]. http://doi.org/10.7910/DVN/K8IL3V
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Reagan Mozer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the materials needed to replicate the results presented in Mozer et al. (2019), "Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality", forthcoming in Political Analysis.

  17. f

    Pattern Matching Rules for Identifying Age Data.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 2, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burnap, Pete; Sloan, Luke; Williams, Matthew; Morgan, Jeffrey (2015). Pattern Matching Rules for Identifying Age Data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001929618
    Explore at:
    Dataset updated
    Mar 2, 2015
    Authors
    Burnap, Pete; Sloan, Luke; Williams, Matthew; Morgan, Jeffrey
    Description

    Pattern Matching Rules for Identifying Age Data.

  18. c

    FRC Match Dataset

    • cubig.ai
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). FRC Match Dataset [Dataset]. https://cubig.ai/store/products/397/frc-match-dataset
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The FRC Match Dataset is based on the FRST Robotics Competition (FRC) competition records from 2018 to 2025, and is a robot competition match data that includes various information such as EPA (Expected Score Contribution), match win rate, team composition, and match results for each match.

    2) Data Utilization (1) FRC Match Data has characteristics that: • Each row contains numerical and categorical variables such as year, event, playoff status, match stage, winning team, EPA-based probability of victory, team name and composition, and match results, which together provide team/match performance and forecasting indicators. (2) FRC Match Data can be used to: • Prediction and Assessment of Match Results: Using EPA and past match data, machine learning models can predict match wins and losses, and prediction models can be evaluated for reliability with indicators such as Brier score. • Team Strategy and Performance Analysis: By analyzing EPA, win rate, and matchup data for each team, you can use it to understand the strategic contribution, cooperation effects, seasonal trends, and strong and weak team characteristics.

  19. string-matching-data

    • kaggle.com
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ayushokaay (2025). string-matching-data [Dataset]. https://www.kaggle.com/datasets/ayushparwal/string-matching-data/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ayushokaay
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by neednot_toplay

    Released under Apache 2.0

    Contents

  20. MODMatcher: Multi-Omics Data Matcher for Integrative Genomic Analysis

    • plos.figshare.com
    tiff
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seungyeul Yoo; Tao Huang; Joshua D. Campbell; Eunjee Lee; Zhidong Tu; Mark W. Geraci; Charles A. Powell; Eric E. Schadt; Avrum Spira; Jun Zhu (2023). MODMatcher: Multi-Omics Data Matcher for Integrative Genomic Analysis [Dataset]. http://doi.org/10.1371/journal.pcbi.1003790
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Seungyeul Yoo; Tao Huang; Joshua D. Campbell; Eunjee Lee; Zhidong Tu; Mark W. Geraci; Charles A. Powell; Eric E. Schadt; Avrum Spira; Jun Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(Point of Contact, Custodian) (2024). Data Matching Imputation System [Dataset]. https://catalog.data.gov/dataset/data-matching-imputation-system1

Data Matching Imputation System

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 19, 2024
Dataset provided by
(Point of Contact, Custodian)
Description

The DMIS dataset is a flat file record of the matching of several data set collections. Primarily it consists of VTRs, dealer records, Observer data in conjunction with vessel permit information for the purpose of supporting North East Regional quota monitoring projects.

Search
Clear search
Close search
Google apps
Main menu