Facebook
TwitterI am not the owner of this dataset. My sole intention is to make the dataset easily available to enthusiasts who are curious about Entity Resolution. Here is the original source of the dataset. The dataset is also available through a R package, from which I downloaded it.
The restaurant dataset is created with the help of 864 restaurant records from two different data sources (Fodor’s and Zagat’s restaurant guides) provided by Sheila Tejada. Restaurants are described by name, address, city, phone and category. Among these, 112 record pairs refer to the same entity present in the dataset.
Facebook
TwitterThis dataset was created by Mustafa Fatakdawala
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository includes 13 established datasets for evaluating ML- and DL-based matching algorithms:
Additionally, the repository includes five new benchmark datasets that are drawn from the following databases using a principled approach based on DeepBlocker:
The datasets are available in six different formats so that they can be processed by the following matching algorithms:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - first/last name, email, phone, address and dob. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through is_duplicate_of and duplication_rule fieldsDuplicate RulesDuplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and nameOutput FormatThe dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.Data RegenerationThe project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.Files Includedspider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py
Facebook
Twitter
According to our latest research, the global Entity Resolution Software market size reached USD 2.48 billion in 2024. The market is exhibiting strong momentum and is expected to grow at a CAGR of 12.2% from 2025 to 2033, projecting the market to reach USD 7.03 billion by 2033. The surge in data-driven decision-making, rising regulatory compliance demands, and the proliferation of digital customer touchpoints are primary growth drivers fueling the expansion of the Entity Resolution Software market worldwide.
The growth of the Entity Resolution Software market is primarily propelled by the exponential increase in data volumes across enterprises and industries. As organizations accumulate massive amounts of structured and unstructured data from diverse sources, the ability to accurately identify, match, and resolve entities such as customers, suppliers, and transactions becomes critical. The rise of digital transformation initiatives has made data quality and integrity a top priority, leading to increased adoption of entity resolution solutions. These platforms enable organizations to consolidate disparate data points, eliminate duplicates, and create unified, accurate records, thereby enhancing operational efficiency, customer experience, and business intelligence capabilities. The growing emphasis on data-driven strategies continues to drive demand for sophisticated entity resolution software that can seamlessly integrate with existing data management systems.
Another significant growth factor for the Entity Resolution Software market is the heightened focus on regulatory compliance and risk management. Industries such as banking, financial services, insurance (BFSI), healthcare, and government are subject to stringent data privacy and security regulations, including GDPR, HIPAA, and anti-money laundering (AML) directives. Entity resolution software plays a pivotal role in ensuring compliance by accurately linking and verifying entities across multiple datasets, thereby reducing the risk of fraud, identity theft, and regulatory breaches. The ability to maintain a single, consistent view of entities not only streamlines compliance processes but also supports advanced analytics and reporting, making these solutions indispensable for organizations operating in highly regulated environments.
The rapid adoption of cloud-based solutions and advancements in artificial intelligence (AI) and machine learning (ML) technologies are also accelerating the growth of the Entity Resolution Software market. Cloud deployment offers scalability, flexibility, and cost-efficiency, enabling organizations of all sizes to implement entity resolution capabilities without significant upfront investments in infrastructure. AI and ML algorithms enhance the accuracy and speed of entity resolution processes by automating complex matching, deduplication, and relationship discovery tasks. These technological advancements are making entity resolution solutions more accessible and effective, thereby expanding their adoption across a broad spectrum of industries, including retail, telecommunications, and e-commerce.
From a regional perspective, North America continues to dominate the Entity Resolution Software market, driven by the presence of major technology providers, high digital maturity, and strong regulatory frameworks. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid digitalization, increasing investments in data infrastructure, and expanding e-commerce and financial sectors. Europe remains a significant market, supported by robust data protection regulations and growing adoption among enterprises seeking to enhance data quality and compliance. The Middle East & Africa and Latin America are also witnessing increased uptake, particularly among government and financial institutions aiming to improve data governance and combat fraud.
The Entity Resolution Software market is segmented by component into software and se
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tough Tables (2T) is a dataset designed to evaluate table annotation approaches on the CEA task.
The dataset is compliant with the data format used in SemTab2019, and it can be used as an additional dataset without any modification. Annotations are based on DBpedia 2016-10.
Note on License: This dataset includes data from the following sources. Refer to each source for license details:
- Wikipedia https://www.wikipedia.org/
- DBpedia http://dbpedia.org/
- SemTab2019 https://doi.org/10.5281/zenodo.3518539
- GeoDatos https://www.geodatos.net
- The Pudding https://pudding.cool/
- Offices.net https://offices.net
- DATA.GOV https://www.data.gov/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
Twitter
According to our latest research, the global Entity Resolution market size in 2024 stands at USD 2.1 billion, demonstrating a robust expansion trajectory. The market is expected to grow at a CAGR of 12.7% from 2025 to 2033, reaching a projected value of USD 6.1 billion by 2033. This impressive growth is primarily driven by the increasing demand for accurate data management, rising concerns over fraud detection, and the proliferation of digital transformation initiatives across various industries. As organizations worldwide strive to harness the power of big data and ensure regulatory compliance, the adoption of entity resolution solutions has become indispensable for maintaining data integrity and operational efficiency.
The primary growth factor propelling the Entity Resolution market is the exponential rise in data volumes generated from diverse sources such as IoT devices, social media, enterprise applications, and transactional systems. With the digitalization of business operations, organizations are faced with the challenge of managing and integrating vast datasets to extract meaningful insights. Entity resolution technology plays a crucial role in this context by accurately identifying, matching, and consolidating data entities across disparate sources, thereby eliminating duplicates and inconsistencies. This capability is vital for businesses seeking to enhance customer experiences, optimize operational processes, and make data-driven decisions. The growing emphasis on data quality and governance further underscores the necessity of robust entity resolution solutions, especially in highly regulated sectors like BFSI and healthcare.
Another significant driver for market growth is the escalating incidence of fraudulent activities and financial crimes, which necessitates advanced fraud detection and risk management capabilities. Entity resolution platforms enable organizations to detect hidden relationships and patterns among entities, facilitating early identification of fraudulent transactions and suspicious behaviors. As financial institutions and e-commerce platforms continue to battle sophisticated fraud schemes, the integration of entity resolution with artificial intelligence and machine learning algorithms has emerged as a game-changer. These technologies enhance the accuracy and speed of entity matching, enabling real-time risk assessment and compliance monitoring. Consequently, the demand for entity resolution solutions is witnessing a marked uptick across sectors where security and trust are paramount.
The rapid adoption of cloud computing and the proliferation of Software-as-a-Service (SaaS) models are also fueling the growth of the Entity Resolution market. Cloud-based entity resolution solutions offer unparalleled scalability, flexibility, and cost-effectiveness, making them attractive to organizations of all sizes. Small and medium enterprises (SMEs), in particular, are leveraging these solutions to overcome resource constraints and compete effectively with larger counterparts. Furthermore, the integration of entity resolution with advanced analytics and business intelligence platforms is enabling organizations to unlock new value from their data assets. This trend is expected to gain further momentum as enterprises prioritize digital transformation and data-driven innovation in the post-pandemic era.
From a regional perspective, North America currently dominates the global entity resolution market, accounting for the largest revenue share in 2024. This leadership position is attributed to the presence of major technology providers, early adoption of advanced analytics, and stringent regulatory frameworks governing data privacy and security. However, the Asia Pacific region is poised to exhibit the highest growth rate over the forecast period, driven by rapid digitalization, increasing investments in IT infrastructure, and the rising adoption of cloud-based solutions across emerging economies. Europe and Latin America are also witnessing steady growth, supported by the expanding footprint of multinational corporations and the growing emphasis on data compliance.
Identity Resolution is a critical component in the realm of data management, especially as organizations seek to unify disparate data sources into a single coheren
Facebook
Twitter
According to our latest research, the global Entity Resolution Graph for Investigations market size stood at USD 2.41 billion in 2024, underlining the sector’s robust presence in the global analytics and investigation ecosystem. The market is anticipated to expand at a compound annual growth rate (CAGR) of 18.2% from 2025 to 2033, reaching a forecasted size of USD 12.26 billion by 2033. This remarkable growth trajectory is primarily driven by the rising need for advanced data analytics, the proliferation of digital fraud, and increasing regulatory scrutiny across industries. As organizations face mounting pressure to manage complex data relationships and uncover hidden connections, the Entity Resolution Graph for Investigations market is poised for significant expansion over the coming decade.
One of the principal growth factors for the Entity Resolution Graph for Investigations market is the escalating volume and complexity of data generated by modern enterprises. As businesses digitize their operations, the data landscape has become fragmented, making it difficult to establish clear relationships between entities such as individuals, organizations, and transactions. Entity resolution graph solutions offer a sophisticated approach to integrating disparate datasets, enabling investigators to identify patterns, detect anomalies, and uncover hidden relationships. This capability is increasingly vital for sectors such as BFSI, government, and healthcare, where the accuracy of entity identification directly impacts risk management, compliance, and investigative outcomes. The integration of artificial intelligence and machine learning algorithms into these solutions further enhances their ability to deliver real-time insights, driving adoption across industries.
Another significant driver is the surge in regulatory requirements and compliance mandates globally. Financial institutions, healthcare providers, and government agencies are under unprecedented pressure to comply with anti-money laundering (AML), know your customer (KYC), and data privacy regulations. Entity resolution graph technology enables these organizations to efficiently reconcile and validate data from multiple sources, ensuring compliance while minimizing manual intervention. The technology’s ability to provide a unified view of entities across vast datasets is critical for timely and accurate reporting, audit readiness, and risk mitigation. As regulatory frameworks continue to evolve and become more stringent, demand for robust entity resolution solutions is expected to intensify, further propelling market growth.
The rise of sophisticated fraud schemes and cyber threats is also fueling demand for entity resolution graph solutions. Fraud detection and risk management applications rely heavily on the ability to correlate seemingly unrelated data points to uncover fraudulent activities. Entity resolution graphs empower organizations to visualize and analyze complex networks of relationships, making it easier to detect fraud rings, insider threats, and other malicious activities. The growing adoption of digital channels in banking, retail, and other sectors has expanded the attack surface for fraudsters, necessitating advanced investigative tools. As organizations invest in strengthening their security postures, the adoption of entity resolution graph technology is set to accelerate, underpinning the market’s sustained growth.
From a regional perspective, North America currently dominates the Entity Resolution Graph for Investigations market, driven by the early adoption of advanced analytics, a strong regulatory environment, and significant investments in digital transformation. However, Asia Pacific is emerging as a high-growth region, fueled by rapid digitization, increasing awareness of data-driven investigations, and expanding regulatory frameworks. Europe also represents a substantial share of the market, with stringent data protection laws and a mature financial services sector contributing to steady demand. As organizations across these regions continue to grapple with complex data challenges and evolving threats, the adoption of entity resolution graph solutions is expected to rise, supporting robust market growth globally.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Mock customer data used for testing identity resolution. There are 50 records in this dataset. 5 records are duplicate customers. 2 records in the data represent the same customer but a needed feature is missing from each record.
Column Descriptions: 1. customer_id: A unique identifier for each customer in the dataset. 2. first_name: The first name of the customer. 3. last_name: The last name (surname) of the customer. 4. email: The email address associated with the customer. 5. phone_number: The contact phone number of the customer. 6. address: The street address where the customer resides or is associated with. 7. city: The city in which the customer is located. 8. state: The state or region associated with the customer's address. 9. country: The country of the customer's address. 10. postal_code: The postal or ZIP code for the customer's address. 11. company_name: The name of the company the customer is associated with, if applicable.
Facebook
TwitterThis dataset was created by HitendraVaghela
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.
Facebook
TwitterRadio Station dataset contains around 10K entities of 256d vectors.
Facebook
TwitterSource Page : DBLP-Source
In the VLDB 2010 paper [1] we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution approaches using challenging real-world match tasks. The evaluation considers existing approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community a state-of-the-art commercial entity resolution implementation is considered. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.
Two lists of academic publications: DBLP and Scholar. 1. DBLP1.csv: Contain no redundant 2. Scholar.csv: Contain messy data with redundant entities. 3. DBLP-Scholar_PerfectMapping.csv: The perfect mapping for entities between both tables.
Provide an approach to find the perfect mapping between entities from the DBLP1 dataset and Scholar dataset to find same documents from DBLP dataset that is in Scholar dataset or duplicated in the Scholar
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global market size for Entity Resolution for Law Enforcement reached USD 1.42 billion in 2024. The market is experiencing robust expansion, supported by a CAGR of 14.8% from 2025 to 2033. By the end of 2033, the market is forecasted to achieve a value of USD 4.32 billion. This impressive growth is primarily driven by the increasing need for advanced data analytics and identity management solutions in law enforcement to combat sophisticated criminal activities and enhance operational efficiencies.
The growth of the Entity Resolution for Law Enforcement market is underpinned by the rapid digitalization of law enforcement agencies globally. As agencies transition from traditional paper-based systems to digital platforms, the volume, variety, and velocity of data generated have grown exponentially. This transformation necessitates robust entity resolution solutions capable of accurately identifying, linking, and deduplicating entities across disparate data sources. The proliferation of smart devices, surveillance systems, and interconnected databases has further intensified the demand for advanced software that can process and analyze massive datasets in real time. The market is also benefiting from government initiatives aimed at modernizing public safety infrastructure, which often include investments in advanced data management and analytics platforms.
Another significant driver for the Entity Resolution for Law Enforcement market is the escalating complexity and sophistication of criminal activities. Criminals are increasingly leveraging technology to obscure their identities, create false records, and exploit gaps in law enforcement data systems. This has made traditional investigative methods less effective, pushing agencies to adopt entity resolution solutions that use artificial intelligence, machine learning, and natural language processing to uncover hidden connections and relationships. The integration of these advanced technologies enables law enforcement to detect fraud, analyze intelligence, and solve cases more efficiently. Furthermore, the growing emphasis on data-driven policing and predictive analytics is accelerating the adoption of entity resolution platforms to support proactive crime prevention and resource allocation.
Additionally, the rising concerns around national security, terrorism, and cross-border crimes have compelled federal and intelligence agencies to invest heavily in entity resolution technologies. These solutions are critical for consolidating fragmented data from multiple jurisdictions and sources, enabling agencies to build comprehensive profiles of suspects, organizations, and criminal networks. The ability to accurately resolve entities across complex datasets not only enhances investigative outcomes but also supports intelligence sharing and collaboration between local, national, and international agencies. As data privacy and regulatory compliance become more stringent, entity resolution platforms are evolving to incorporate robust security features and audit trails, further boosting their adoption in the law enforcement sector.
From a regional perspective, North America continues to dominate the Entity Resolution for Law Enforcement market, driven by substantial investments in public safety technologies, a high incidence of cyber and financial crimes, and the presence of leading solution providers. Europe and Asia Pacific are also witnessing significant growth, fueled by increasing government focus on digital transformation and public safety modernization. Emerging economies in Latin America and the Middle East & Africa are gradually adopting entity resolution solutions as part of broader efforts to enhance law enforcement capabilities and address rising crime rates. The regional dynamics are shaped by varying levels of technological maturity, regulatory frameworks, and law enforcement priorities, contributing to a diverse and evolving global market landscape.
The Component segment of the Entity Resolution for Law Enforcement market is bifurcated into Software and Services. Software solutions represent the backbone of entity resolution, providing the algorithms, analytics engines, and user interfaces necessary for data integration, matching, and deduplication. These platforms are designed to handle
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data is a benchmark for coreference resolution system evaluation on knowledge graphs. It contains the information about Cruise entities in GeoLink repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Sets from the ISWC 2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Round 1, Wikidata Tables. Links to other datasets can be found on the challenge website: https://sem-tab-challenge.github.io/2024/ as well as the proceedings of the challenge published on CEUR.
For details about the challenge, see: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
For 2024 edition, see: https://sem-tab-challenge.github.io/2024/
Note on License: This data includes data from the following sources. Refer to each source for license details:
- Wikidata https://www.wikidata.org/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de675664https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de675664
Abstract (en): The Cora data contains bibliographic records of machine learning papers that have been manually clustered into groups that refer to the same publication. Originally, Cora was prepared by Andrew McCallum, and his versions of this data set are available on his Data web page. The data is also hosted here. Note that various versions of the Cora data set have been used by many publications in record linkage and entity resolution over the years.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ProductER Dataset: Product Entity Resolution
The ProductER (Product Entity Resolution) dataset is a collection of 10,000 tuples manually curated and designed to showcase the practical task of product deduplication. The objective is to determine whether two product names refer to the exact same product. Each question presents a pair of product names, and the answer is categorized as yes, no, or maybe, indicating whether the products are identical or not.
Purpose and… See the full description on the dataset page: https://huggingface.co/datasets/crossingminds/productER.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of the real-world datasets that were used in the publications:
Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, Themis Palpanas: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65: 137-157 (2017) 2015
Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, Themis Palpanas: Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data. IEEE BigData 2015: 411-420
Facebook
TwitterI am not the owner of this dataset. My sole intention is to make the dataset easily available to enthusiasts who are curious about Entity Resolution. Here is the original source of the dataset. The dataset is also available through a R package, from which I downloaded it.
The restaurant dataset is created with the help of 864 restaurant records from two different data sources (Fodor’s and Zagat’s restaurant guides) provided by Sheila Tejada. Restaurants are described by name, address, city, phone and category. Among these, 112 record pairs refer to the same entity present in the dataset.