7 datasets found
  1. Data from: Fast Bayesian Record Linkage for Streaming Data Contexts

    • tandf.figshare.com
    zip
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Taylor; Andee Kaplan; Brenda Betancourt (2024). Fast Bayesian Record Linkage for Streaming Data Contexts [Dataset]. http://doi.org/10.6084/m9.figshare.24565758.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Ian Taylor; Andee Kaplan; Brenda Betancourt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.

  2. Data from: Multifile Partitioning for Record Linkage and Duplicate Detection...

    • tandf.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serge Aleshin-Guendel; Mauricio Sadinle (2023). Multifile Partitioning for Record Linkage and Duplicate Detection [Dataset]. http://doi.org/10.6084/m9.figshare.17137007.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Serge Aleshin-Guendel; Mauricio Sadinle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Supplementary materials for this article are available online.

  3. F

    CS-NER

    • data.uni-hannover.de
    txt
    Updated Oct 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). CS-NER [Dataset]. https://data.uni-hannover.de/dataset/cs-ner-dataset
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 7, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Computer Science Named Entity Recognition in the Open Research Knowledge Graph

    1) About

    This work proposes a standardized CS-NER task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem , solution , resource , language , tool , method , and dataset .

    The main contributions are:

    1) Merges annotations for contribution-centric named entities from related work as the following datasets:

    2) Additionally, supplies a new annotated dataset for the titles in the ACL anthology in the acl repository where titles are annotated with all seven entities.

    2) Dataset Statistics for full dataset

    Titles

    train.data

    | NER | Count |

    | --- | --- |

    | solution | 65,213 |

    | research problem | 43,033 |

    | resource | 19,759 |

    | method | 19,645 |

    | tool | 4,856 |

    | dataset | 4,062 |

    | language | 1,704 |

    dev.data

    | NER | Count |

    | --- | --- |

    | solution | 3,685 |

    | research problem | 2,717 |

    | resource | 1,224 |

    | method | 1,172 |

    | tool | 264 |

    | dataset | 191 |

    | language | 79 |

    test.data

    | NER | Count |

    | --- | --- |

    | solution | 29,287 |

    | research problem | 11,093 |

    | resource | 8,511 |

    | method | 7,009 |

    | tool | 2,272 |

    | dataset | 947 |

    | language | 690 |

    Abstracts

    train-abs.data

    | NER | Count |

    | --- | --- |

    | research problem | 15,498 |

    | method | 12,932 |

    dev-abs.data

    | NER | Count |

    | --- | --- |

    | research problem | 1,450 |

    | method | 839 |

    test-abs.data

    | NER | Count |

    | --- | --- |

    | research problem | 4,123 |

    | method | 3,170 |

    The reamining repositories have specialized README files with the respective dataset statistics.

    3) Citation

    Accepted for publication in ICADL 2022 proceedings.

    Citation information forthcoming

    Preprint

    @article{d2022computer,
     title={Computer Science Named Entity Recognition in the Open Research Knowledge Graph},
     author={D'Souza, Jennifer and Auer, S{\"o}ren},
     journal={arXiv preprint arXiv:2203.14579},
     year={2022}
    }
    

    4) Additional resources

    CS NER Software trained on the dataset in this repository

    Codebase: https://gitlab.com/TIBHannover/orkg/nlp/orkg-nlp-experiments/-/tree/master/orkg_cs_ner

    Service URL - REST API: https://orkg.org/nlp/api/docs#/annotation/annotates_paper_annotation_csner_post

    Service URL - PyPi: https://orkg-nlp-pypi.readthedocs.io/en/latest/services/services.html#cs-ner-computer-science-named-entity-recognition

  4. G

    Multi-INT Knowledge Graphs Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Multi-INT Knowledge Graphs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/multi-int-knowledge-graphs-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Multi-INT Knowledge Graphs Market Outlook



    According to our latest research, the global Multi-INT Knowledge Graphs market size reached USD 2.3 billion in 2024, reflecting robust adoption across defense, intelligence, and commercial sectors. The market is set to expand at a CAGR of 17.2% from 2025 to 2033, with the total market value projected to reach USD 7.5 billion by 2033. This impressive growth is primarily attributed to the increasing demand for real-time, multi-source intelligence analysis and the integration of advanced AI-driven analytics in security and defense applications.




    The primary growth driver for the Multi-INT Knowledge Graphs market is the exponential rise in the volume and complexity of data generated from diverse intelligence sources. As modern defense and intelligence operations require the fusion of multiple intelligence types—including SIGINT, HUMINT, GEOINT, MASINT, and OSINT—organizations are turning to knowledge graph technologies to synthesize, contextualize, and visualize this data effectively. These solutions enable analysts to uncover hidden patterns, enhance situational awareness, and support rapid, data-driven decision-making. The proliferation of sophisticated threats and the need for actionable intelligence underscore the critical role of Multi-INT Knowledge Graphs in national security, law enforcement, and cyber defense operations.




    Another significant factor fueling market growth is the advancement of AI and machine learning algorithms, which are increasingly being integrated into knowledge graph platforms. These technologies accelerate the automation of data ingestion, entity resolution, and relationship mapping, reducing manual effort and minimizing the risk of human error. Furthermore, the adoption of cloud-based deployment models has democratized access to Multi-INT Knowledge Graphs, allowing organizations of all sizes to harness scalable, flexible, and cost-effective intelligence solutions. This trend is particularly evident in the commercial sector, where enterprises leverage knowledge graphs for threat intelligence, fraud detection, and compliance monitoring.




    The evolving regulatory landscape and the growing emphasis on data privacy and security are also shaping the Multi-INT Knowledge Graphs market. Governments and organizations are investing heavily in secure, compliant platforms that ensure the integrity and confidentiality of sensitive intelligence data. This is driving innovation in encryption, access control, and auditability features within knowledge graph solutions. Additionally, the increasing frequency and sophistication of cyberattacks are compelling stakeholders to adopt advanced intelligence fusion platforms capable of providing holistic, real-time threat visibility across multiple domains.




    From a regional perspective, North America continues to dominate the Multi-INT Knowledge Graphs market, accounting for the largest share in 2024 due to significant investments by the U.S. Department of Defense, intelligence agencies, and leading technology vendors. Europe follows closely, driven by cross-border security initiatives and the modernization of defense infrastructure. The Asia Pacific region is experiencing the fastest growth, fueled by rising geopolitical tensions, expanding military budgets, and the rapid adoption of advanced surveillance and intelligence technologies. Meanwhile, the Middle East & Africa and Latin America are witnessing steady uptake, supported by increasing government focus on national security and counterterrorism efforts.





    Component Analysis



    The Multi-INT Knowledge Graphs market is segmented by component into Software, Hardware, and Services, each playing a pivotal role in the overall ecosystem. The Software segment holds the largest share, driven by the demand for advanced analytics, data fusion, and visualization platforms. These software solutions are designed to process vast volumes of structured and unstructured intelligence data, enabling organizations to ex

  5. d

    Alesco Phone ID Database - Phone Data with over 860 Million Phone Number...

    • datarade.ai
    .csv, .xls, .txt
    Updated Jul 5, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alesco Data (2018). Alesco Phone ID Database - Phone Data with over 860 Million Phone Number with Carrier Name, covers 94% of the US population - available for licensing! [Dataset]. https://datarade.ai/data-products/alesco-phone-id-database-the-industry-s-largest-and-most-ac-alesco-data
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jul 5, 2018
    Dataset authored and provided by
    Alesco Data
    Area covered
    United States
    Description

    The Alesco Phone ID Database data ties together a consumer's true identity, and with linkage to the Alesco Power Identity Graph, we are perfectly positioned to help customers solve today's most challenging marketing, analytics, and identity resolution problems.

    Our proprietary Phone ID database combines public and private sources and validates phone numbers against current and historical data 24 hours a day, 365 days a year.

    With over 650 million unique phone numbers, device and service information, our one-of-a-kind solutions are now available for your marketing and identity resolution challenges in both B2C and B2B applications!

    • Alesco Phone ID provides more than 860 million phone numbers monthly linked to a consumer or business name and includes landline, mobile phone number, VoIP, private and business phone numbers — all permissibly obtained and privacy-compliant and linked to other Alesco data sets

    • How we do it: Alesco Phone ID is multi-sourced with daily information and delivered monthly or quarterly to clients. Our proprietary machine learning and advanced analytics processes ensure quality levels far above industry standards. Alesco processes over 100 million phone signals per day, compiling, normalizing, and standardizing phone information from 37 input sources.

    • Accuracy: Each of Alesco’s phone data sources are vetted to ensure they are authoritative, giving you confidence in the accuracy of the information. Every record is validated, verified and processed to ensure the widest, most reliable coverage combined with stunning precision.

    Ease of use: Alesco’s Phone ID Database is available as an on-premise phone database license, giving you full control to host and access this powerful resource on-site. Ongoing updates are provided on a monthly basis ensure your data is up to date.

  6. I

    Global Patient Identity Resolution Software Market Growth Opportunities...

    • statsndata.org
    excel, pdf
    Updated Oct 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Patient Identity Resolution Software Market Growth Opportunities 2025-2032 [Dataset]. https://www.statsndata.org/report/patient-identity-resolution-software-market-117919
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Oct 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Patient Identity Resolution Software market is an essential sector within the healthcare industry, addressing the critical issue of accurately matching patients with their medical records amidst a myriad of data sources. As healthcare systems globally become increasingly complex due to the rise of electronic hea

  7. e

    641_ Входит ли смена | 641_Is product switching or

    • repository.econdata.tech
    Updated Oct 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). 641_ Входит ли смена | 641_Is product switching or [Dataset]. https://repository.econdata.tech/dataset/wb-fb-fcp-disr-as-ic-pc
    Explore at:
    Dataset updated
    Oct 22, 2025
    Description

    641_ Входит ли смена или закрытие продукта в тройку основных проблем, на которые были поданы жалобы в орган альтернативного разрешения споров (ADR)?_#VHNA_08 641_Is product switching or closure among the top-three issues complained about to the alternative dispute resolution (ADR) entity?_#VHNA_08

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ian Taylor; Andee Kaplan; Brenda Betancourt (2024). Fast Bayesian Record Linkage for Streaming Data Contexts [Dataset]. http://doi.org/10.6084/m9.figshare.24565758.v1
Organization logo

Data from: Fast Bayesian Record Linkage for Streaming Data Contexts

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jan 3, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Ian Taylor; Andee Kaplan; Brenda Betancourt
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu