Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces, or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 U.S. Census for which expert-labeled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modeling of comparison data, allowing disagreement probability parameters conditional on nonmatch status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves a significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.
Facebook
TwitterThe DMIS dataset is a flat file record of the matching of several data set collections. Primarily it consists of VTRs, dealer records, Observer data in conjunction with vessel permit information for the purpose of supporting North East Regional quota monitoring projects.
Facebook
TwitterThis paper investigates the method of matching regarding two crucial implementation choices: the distance measure and the type of algorithm. We implement optimal full matching a fully efficient algorithm, and present a framework for statistical inference. The implementation uses data from the NLSY79 to study the effect of college education on earnings. We find that decisions regarding the matching algorithm depend on the structure of the data: In the case of strong selection into treatment and treatment effect heterogeneity a full matching seems preferable. If heterogeneity is weak, pair matching suffices.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In real-world scenarios, a major limitation for shape-matching datasets is represented by having all the meshes of the same subject share their connectivity across different poses. Specifically, similar connectivities could provide a significant bias for shape-matching algorithms, simplifying the matching process and potentially leading to correspondences based on recurring triangle patterns rather than geometric correspondences between mesh parts. As a consequence, the resulting correspondence may be meaningless, and the evaluation of the algorithm may be misled.
To overcome this limitation, we introduce TACO, a new dataset where meshes representing the same subject in different poses do not share the same connectivity, and we compute new ground truth correspondences between shapes. We extensively evaluate our dataset to ensure that ground truth isometries are properly preserved. We also use our dataset to validate state-of-the-art shape-matching algorithms, verifying a degradation in performance when the connectivity gets altered.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/SURSEOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/SURSEO
We propose a simplified approach to matching for causal inference that simultaneously optimizes both balance (similarity between the treated and control groups) and matched sample size. Existing approaches either fix the matched sample size and maximize balance or fix balance and maximize sample size, leaving analysts to settle for suboptimal solutions or attempt manual optimization by iteratively tweaking their matching method and rechecking balance. To jointly maximize balance and sample size, we introduce the matching frontier, the set of matching solutions with maximum balance for each possible sample size. Rather than iterating, researchers can choose matching solutions from the frontier for analysis in one step. We derive fast algorithms that calculate the matching frontier for several commonly used balance metrics. We demonstrate with analyses of the effect of sex on judging and job training programs that show how the methods we introduce can extract new knowledge from existing data sets.
Facebook
TwitterObservational studies of causal effects often use multivariate matching to control imbalances in measured covariates. For instance, using network optimization, one may seek the closest possible pairing for key covariates among all matches that balance a propensity score and finely balance a nominal covariate, perhaps one with many categories. This is all straightforward when matching thousands of individuals, but requires some adjustments when matching tens or hundreds of thousands of individuals. In various senses, a sparser network—one with fewer edges—permits optimization in larger samples. The question is: What is the best way to make the network sparse for matching? A network that is too sparse will eliminate from consideration possible pairings that it should consider. A network that is not sparse enough will waste computation considering pairings that do not deserve serious consideration. We propose a new graded strategy in which potential pairings are graded, with a preference for higher grade pairings. We try to match with pairs of the best grade, incorporating progressively lower grade pairs only to the degree they are needed. In effect, only sparse networks are built, stored and optimized. Two examples are discussed, a small example with 1567 matched pairs from clinical medicine, and a slightly larger example with 22,111 matched pairs from economics. The method is implemented in an R package RBestMatch available at https://github.com/ruoqiyu/RBestMatch. Supplementary materials for this article are available online.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/HTMX3Khttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/HTMX3K
We identify situations in which conditioning on text can address confounding in observational studies. We argue that a matching approach is particularly well-suited to this task, but existing matching methods are ill-equipped to handle high-dimensional text data. Our proposed solution is to estimate a low-dimensional summary of the text and condition on this summary via matching. We propose a method of text matching, topical inverse regression matching, that allows the analyst to match both on the topical content of confounding documents and the probability that each of these documents is treated. We validate our approach and illustrate the importance of conditioning on text to address confounding with two applications: the effect of perceptions of author gender on citation counts in the international relations literature and the effects of censorship on Chinese social media users.
Facebook
TwitterBuild highly targeted, custom datasets from a database of 200 million global contacts to match your target audience profile, and receive quarterly refreshes that are powered by Leadbook's proprietary A.I. powered data technology.
Build your dataset with custom attributes and conditions like: - Usage of a specific technology - Minimum number of records per organisation - Data matching against a list of area codes - Data matching against a list of business registration numbers - Specific headquarter and branch location combinations
Complimentary de-duplication is provided to ensure that you only pay for contacts that you don't already own.
All records include: - Contact name - Job title - Contact email address - Contact phone number - Contact location - Organisation name - Organisation type - Organisation headcount - Primary industry
Additional information like social media handles, secondary industries, and organisation websites may be provided where available.
Pricing includes a one-time data processing fee and additional fees per data refresh.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Dataset provides a comprehensive view into the dynamics of online matchmaking interactions. It captures essential variables that influence the likelihood of successful matches across different genders. This dataset allows researchers and analysts to explore how factors such as VIP subscription status, income levels, parental status, age, and self-perceived attractiveness contribute to the outcomes of online dating endeavors.
The occurrence of zero matches for certain users within the dataset can be attributed to the presence of "ghost users." These are users who create an account but subsequently abandon the app without engaging further. Consequently, their profiles do not participate in any matching activities, leading to a recorded match count of zero. This phenomenon should be taken into account when analyzing user activity and match data, as it impacts the overall interpretation of user engagement and match success rates.
This dataset contains 1000 records, which is considered relatively low within this category of datasets. Additionally, the dataset may not accurately reflect reality as it was captured intermittently over different periods of time.
Furthermore, certain match categories are missing due to confidentiality constraints, and several other crucial variables are also absent for the same reason. Consequently, the machine learning models employed may not achieve high accuracy in predicting the number of matches.
It is important to acknowledge these limitations when interpreting the results derived from this dataset. Careful consideration of these factors is advised when drawing conclusions or making decisions based on the findings of any analyses conducted using this data.
Due to confidentiality constraints, only a small amount of data was collected. Additionally, only users with variables showing high correlation with the matching variable were included in the dataset.
As a result, the high performance of machine learning models on this dataset is primarily due to the data collection method (i.e., only high-correlation data was included).
Therefore, the findings you may derive from manipulating this dataset are not representative of the real dating world.
The source of this dataset is confidential, and it may be released in the future. For the present, this dataset can be utilized under the terms of the license visible on the dataset's card.
Users are advised to review and adhere to the terms specified in the dataset's license when using the data for any purpose.
This dataset provides insights into the dynamics of online dating interactions, allowing for predictive modeling and analysis of factors influencing matchmaking success.
This dataset, shared by Rabie El Kharoua, is original and has never been shared before. It is made available under the CC BY 4.0 license, allowing anyone to use the dataset in any form as long as proper citation is given to the author. A DOI is provided for proper referencing. Please note that duplication of this work within Kaggle is not permitted.
This dataset is synthetic and was generated for educational purposes, making it ideal for data science and machine learning projects. It is an original dataset, owned by Mr. Rabie El Kharoua, and has not been previously shared. You are free to use it under the license outlined on the data card. The dataset is offered without any guarantees. Details about the data provider will be shared soon.
Facebook
TwitterIn this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met. A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Figure8-The dynamic sorting and sample point exclusion process of the proposed method
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The authors would like to thank the Federal Government and the Heads of Government of the Länder, as well as the Joint Science Conference (GWK), for their funding and support within the framework of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) - project number 442146713.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Benchmark results for "Mix-and-Match: A Model-driven Runtime Optimisation Strategy for BFS on GPUs" paper.
Performance data for Breadth-First Search on NVidia TitanX. Including trained Binary Decision Tree model for predicting the best implementation on an input graph.
Facebook
Twitterhttps://opensource.org/licenses/GPL-3.0https://opensource.org/licenses/GPL-3.0
A recent empirical literature models search and matching frictions by means of a reduced-form matching function. An alternative approach is to simulate the matching process directly. In this paper, we follow the latter approach to model matching in ride-hailing. We compute the matching function implied by the matching process. It exhibits increasing returns to scale, and it does not resemble the commonly used Cobb-Douglas functional form. We then use this matching function to quantify network externalities. A subsidy on the order of $2 per trip is needed to correct for these externalities and induce the market to operate efficiently. This repository contains the code and a subset of the data used for the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two datasets are included in this repository: claim matching and claim detection datasets. The collections contain data in 5 languages: Bengali, English, Hindi, Malayalam and Tamil.
The "claim detection" dataset contains textual claims from social media and fact-checking websites annotated for the "fact-check worthiness" of the claims in each message. Data points have one of the three labels of "Yes" (text contains one or more check-worthy claims), "No" and "Probably".
The "claim matching" dataset is a curated collection of pairs of textual claims from social media and fact-checking websites for the purpose of automatic and multilingual claim matching. Pairs of data have one of the four labels of "Very Similar", "Somewhat Similar", "Somewhat Dissimilar" and "Very Dissimilar".
All personally identifiable information (PII) including phone numbers, email addresses, license plate numbers and addresses have been replaced with general tags (e.g.
, etc) to protect user anonymity. A detailed explanation on the curation and annotation process is provided in our ACL 2021 paper:
Facebook
TwitterResults of multi-party bargaining are usually described by concepts from cooperative game theory, in particular by the core. In one-on-one matching,
core allocations are stable in the sense that no pair of unmatched or otherwise matched players can improve their incomes by forming a match. Because of incomplete information and bounded rationality, it is difficult to adopt a core allocation immediately. Theoretical investigations cope with the problem of whether core allocations can be adopted in a stochastic process with repeated re-matching. In this paper, we investigate sequences of matching with data from an experimental 2×2 labor market with wage negotiations. This market has seven possible matching structures (states) and is additionally characterized by the negotiated wages and profits. First, we describe the stochastic process of transitions from one state to another including the average transition times. Second, we identify different influences on the process parameters as, for example, the difference of incomes in a match. Third, allocations in the core should be completely durable or at least more durable than comparable out-of-core allocations, but they are not. Final bargaining results (induced by a time limit) appear as snapshots of a stochastic process without absorbing states and with only weak systematic influences.
Data and R code of the analysis are provided.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The name of the file indicate information: {type of sequence}_{type of measure}_{sequence properites}_{additional information}.csv
{type of sequence} - 'synth' for synthetic or 'london' for real mobility data from London, UK. {type of measure} - 'r2' for R-squared measure or 'corr' for Spearman's correlation {sequence properties} - for synthetic data there are three types of sequences, described in the research article (random, markovian, nonstationary). For real mobility data this part includes information about data processing parameters: (...)_london_{type of mobility sequence}_{DBSCAN epsilon value}_{DBSCAN min_pts value}. {type of mobility sequence} is 'seq' for next-place sequences and '30min' or '1H' for the next time-bin sequences and indicate the size of the time-bin. Files with 'predictability' at the end of the file contain R-squared and Spearman's correlation of measures calculated in relation to the predictability measure.
R2 files include values of R-squared for all types of modelled regression functions. 'line' indicates {y = ax + b} for single variable and {y = ax + by + c} for two variables. 'expo' indicates {y = a*x^b + c} for single variable and {y = a*x^b + c*y^d + e} for two variables 'log' indicates {y = a*log(x*b) + c} for single variable and {y = a * x + c * log(y) + e + d*x * log(y)} for two variables. 'logf' indicates {y = a*log(x) + c * log(y) + e + b*log(x) * log(y)} for two variables
Facebook
TwitterSeparation is achieved by intelligence based-matching of the curvelet coefficients.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global machine learning assisted history matching market size reached USD 1.83 billion in 2024, reflecting a robust surge in adoption across the oil & gas, mining, and geothermal sectors. The market is experiencing a strong growth trajectory, with a recorded CAGR of 13.7% from 2025 to 2033. By the end of 2033, the market is forecasted to reach USD 5.46 billion, driven by the increasing need for efficient reservoir management, enhanced production optimization, and the integration of advanced data analytics in subsurface modeling. The primary growth factor for this market is the escalating demand for digital transformation in upstream energy operations, where machine learning technologies are revolutionizing traditional history matching processes.
The rapid adoption of machine learning assisted history matching is largely attributed to the growing complexities of subsurface reservoirs and the ever-increasing volume of data generated by modern exploration and production activities. As energy companies strive to maximize reservoir recovery and minimize operational risks, machine learning algorithms offer unprecedented capabilities in automating the history matching process, reducing manual intervention, and providing more accurate reservoir models. This shift is further propelled by the oil & gas industry's ongoing transition towards digitalization, with operators seeking to leverage artificial intelligence and machine learning for predictive analytics, real-time decision making, and cost optimization. The ability of machine learning solutions to handle multi-dimensional datasets and deliver faster, more reliable results is a key driver behind the market’s impressive CAGR.
Another significant growth factor is the increasing focus on maximizing resource extraction while adhering to stringent environmental and regulatory standards. Machine learning assisted history matching allows operators to simulate numerous reservoir scenarios swiftly, enabling them to identify optimal production strategies and mitigate potential environmental impacts. The integration of cloud computing and advanced analytics platforms has further democratized access to these technologies, enabling small and medium enterprises (SMEs) to adopt sophisticated history matching solutions without the need for heavy upfront investments in IT infrastructure. Moreover, the rising demand for enhanced oil recovery (EOR) techniques, coupled with the depletion of conventional reserves, is compelling operators to invest in advanced machine learning solutions that can unlock new value from mature fields.
From a regional perspective, North America continues to dominate the machine learning assisted history matching market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of major oil & gas companies, a mature digital ecosystem, and a strong focus on innovation are key factors underpinning North America’s leadership. Meanwhile, Asia Pacific is emerging as the fastest-growing regional market, bolstered by rising energy demand, significant investments in exploration activities, and the increasing adoption of digital technologies in countries such as China, India, and Australia. The Middle East & Africa region also presents substantial growth opportunities, driven by ongoing investments in upstream projects and the adoption of advanced reservoir management practices.
The machine learning assisted history matching market is segmented by solution type into software and services. The software segment currently holds the largest share, primarily due to the proliferation of advanced analytics platforms and specialized machine learning tools designed for reservoir engineers and geoscientists. These software solutions are continually evolving, incorporating new algorithms and user-friendly interfaces that streamline the history matching process. The availability of customizable software packages enables operators to tailor solutions to their specific reservoir characteristics, leading to improved model accuracy and reduced cycle times. Furthermore, the integration of cloud-based software has significantly enhanced scalability and collaboration, allowing geographically dispersed teams to work seamlessly on complex projects.
Within the software segment, the adoption of artificial intelligence (AI)
Facebook
TwitterThis data set was collected in a survey study on the effects of process alignment and process agility on coordination and outcomes in software project teams.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces, or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 U.S. Census for which expert-labeled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modeling of comparison data, allowing disagreement probability parameters conditional on nonmatch status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves a significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.