Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Supplementary materials for this article are available online.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This work proposes a standardized CS-NER task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem , solution , resource , language , tool , method , and dataset .
The main contributions are:
1) Merges annotations for contribution-centric named entities from related work as the following datasets:
The dataset proposed in Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers (Gupta & Manning, IJCNLP 2011) is the source for ftd, annotated for both titles and abstracts for the following select entities mapped to our standardized types focus -> solution ; domain -> research problem ; and technique -> method
The dataset proposed in Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Luan et al., EMNLP 2018) is the source for scierc, annotated for abstracts for the following select entities with mappings task -> research problem
The dataset proposed in SemEval-2021 Task 11: NLPContributionGraph - Structuring Scholarly NLP Contributions for a Research Knowledge Graph (D’Souza et al., SemEval 2021) is the source for ncg, annotated for both titles and abstracts for research problem
https://paperswithcode.com/ as the pwc annotated for both titles and abstracts for task -> research problem and method entities.
2) Additionally, supplies a new annotated dataset for the titles in the ACL anthology in the acl repository where titles are annotated with all seven entities.
train.data
| NER | Count |
| --- | --- |
| solution | 65,213 |
| research problem | 43,033 |
| resource | 19,759 |
| method | 19,645 |
| tool | 4,856 |
| dataset | 4,062 |
| language | 1,704 |
dev.data
| NER | Count |
| --- | --- |
| solution | 3,685 |
| research problem | 2,717 |
| resource | 1,224 |
| method | 1,172 |
| tool | 264 |
| dataset | 191 |
| language | 79 |
test.data
| NER | Count |
| --- | --- |
| solution | 29,287 |
| research problem | 11,093 |
| resource | 8,511 |
| method | 7,009 |
| tool | 2,272 |
| dataset | 947 |
| language | 690 |
train-abs.data
| NER | Count |
| --- | --- |
| research problem | 15,498 |
| method | 12,932 |
dev-abs.data
| NER | Count |
| --- | --- |
| research problem | 1,450 |
| method | 839 |
test-abs.data
| NER | Count |
| --- | --- |
| research problem | 4,123 |
| method | 3,170 |
The reamining repositories have specialized README files with the respective dataset statistics.
Accepted for publication in ICADL 2022 proceedings.
Citation information forthcoming
Preprint
@article{d2022computer,
title={Computer Science Named Entity Recognition in the Open Research Knowledge Graph},
author={D'Souza, Jennifer and Auer, S{\"o}ren},
journal={arXiv preprint arXiv:2203.14579},
year={2022}
}
Codebase: https://gitlab.com/TIBHannover/orkg/nlp/orkg-nlp-experiments/-/tree/master/orkg_cs_ner
Service URL - REST API: https://orkg.org/nlp/api/docs#/annotation/annotates_paper_annotation_csner_post
Service URL - PyPi: https://orkg-nlp-pypi.readthedocs.io/en/latest/services/services.html#cs-ner-computer-science-named-entity-recognition
Facebook
Twitter
According to our latest research, the global Multi-INT Knowledge Graphs market size reached USD 2.3 billion in 2024, reflecting robust adoption across defense, intelligence, and commercial sectors. The market is set to expand at a CAGR of 17.2% from 2025 to 2033, with the total market value projected to reach USD 7.5 billion by 2033. This impressive growth is primarily attributed to the increasing demand for real-time, multi-source intelligence analysis and the integration of advanced AI-driven analytics in security and defense applications.
The primary growth driver for the Multi-INT Knowledge Graphs market is the exponential rise in the volume and complexity of data generated from diverse intelligence sources. As modern defense and intelligence operations require the fusion of multiple intelligence types—including SIGINT, HUMINT, GEOINT, MASINT, and OSINT—organizations are turning to knowledge graph technologies to synthesize, contextualize, and visualize this data effectively. These solutions enable analysts to uncover hidden patterns, enhance situational awareness, and support rapid, data-driven decision-making. The proliferation of sophisticated threats and the need for actionable intelligence underscore the critical role of Multi-INT Knowledge Graphs in national security, law enforcement, and cyber defense operations.
Another significant factor fueling market growth is the advancement of AI and machine learning algorithms, which are increasingly being integrated into knowledge graph platforms. These technologies accelerate the automation of data ingestion, entity resolution, and relationship mapping, reducing manual effort and minimizing the risk of human error. Furthermore, the adoption of cloud-based deployment models has democratized access to Multi-INT Knowledge Graphs, allowing organizations of all sizes to harness scalable, flexible, and cost-effective intelligence solutions. This trend is particularly evident in the commercial sector, where enterprises leverage knowledge graphs for threat intelligence, fraud detection, and compliance monitoring.
The evolving regulatory landscape and the growing emphasis on data privacy and security are also shaping the Multi-INT Knowledge Graphs market. Governments and organizations are investing heavily in secure, compliant platforms that ensure the integrity and confidentiality of sensitive intelligence data. This is driving innovation in encryption, access control, and auditability features within knowledge graph solutions. Additionally, the increasing frequency and sophistication of cyberattacks are compelling stakeholders to adopt advanced intelligence fusion platforms capable of providing holistic, real-time threat visibility across multiple domains.
From a regional perspective, North America continues to dominate the Multi-INT Knowledge Graphs market, accounting for the largest share in 2024 due to significant investments by the U.S. Department of Defense, intelligence agencies, and leading technology vendors. Europe follows closely, driven by cross-border security initiatives and the modernization of defense infrastructure. The Asia Pacific region is experiencing the fastest growth, fueled by rising geopolitical tensions, expanding military budgets, and the rapid adoption of advanced surveillance and intelligence technologies. Meanwhile, the Middle East & Africa and Latin America are witnessing steady uptake, supported by increasing government focus on national security and counterterrorism efforts.
The Multi-INT Knowledge Graphs market is segmented by component into Software, Hardware, and Services, each playing a pivotal role in the overall ecosystem. The Software segment holds the largest share, driven by the demand for advanced analytics, data fusion, and visualization platforms. These software solutions are designed to process vast volumes of structured and unstructured intelligence data, enabling organizations to ex
Facebook
TwitterThe Alesco Phone ID Database data ties together a consumer's true identity, and with linkage to the Alesco Power Identity Graph, we are perfectly positioned to help customers solve today's most challenging marketing, analytics, and identity resolution problems.
Our proprietary Phone ID database combines public and private sources and validates phone numbers against current and historical data 24 hours a day, 365 days a year.
With over 650 million unique phone numbers, device and service information, our one-of-a-kind solutions are now available for your marketing and identity resolution challenges in both B2C and B2B applications!
• Alesco Phone ID provides more than 860 million phone numbers monthly linked to a consumer or business name and includes landline, mobile phone number, VoIP, private and business phone numbers — all permissibly obtained and privacy-compliant and linked to other Alesco data sets
• How we do it: Alesco Phone ID is multi-sourced with daily information and delivered monthly or quarterly to clients. Our proprietary machine learning and advanced analytics processes ensure quality levels far above industry standards. Alesco processes over 100 million phone signals per day, compiling, normalizing, and standardizing phone information from 37 input sources.
• Accuracy: Each of Alesco’s phone data sources are vetted to ensure they are authoritative, giving you confidence in the accuracy of the information. Every record is validated, verified and processed to ensure the widest, most reliable coverage combined with stunning precision.
Ease of use: Alesco’s Phone ID Database is available as an on-premise phone database license, giving you full control to host and access this powerful resource on-site. Ongoing updates are provided on a monthly basis ensure your data is up to date.
Facebook
Twitterhttps://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
The Patient Identity Resolution Software market is an essential sector within the healthcare industry, addressing the critical issue of accurately matching patients with their medical records amidst a myriad of data sources. As healthcare systems globally become increasingly complex due to the rise of electronic hea
Facebook
Twitter641_ Входит ли смена или закрытие продукта в тройку основных проблем, на которые были поданы жалобы в орган альтернативного разрешения споров (ADR)?_#VHNA_08 641_Is product switching or closure among the top-three issues complained about to the alternative dispute resolution (ADR) entity?_#VHNA_08
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.