Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.
Facebook
TwitterThis short, 8-minute technical training video is created by The Children's Bureau Data Analytics and Reporting Team and gives a brief demonstration which shows agencies how to validate their XML files against the XSD. For more information, or to access the XSD, please see AFCARS Technical Bulletin 21.
Audio Descriptive Version
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterThe goal of the SHRP 2 Project L33 Validation of Urban Freeway Models was to assess and enhance the predictive travel time reliability models developed in the SHRP 2 Project L03, Analytic Procedures for Determining the Impacts of Reliability Mitigation Strategies. SHRP 2 Project L03, which concluded in 2010, developed two categories of reliability models to be used for the estimation or prediction of travel time reliability within planning, programming, and systems management contexts: data-rich and data-poor models. The objectives of Project L33 were the following: • The first was to validate the most important models – the “Data Poor” and “Data Rich” models with new datasets. • The second objective was to assess the validation outcomes to recommend potential enhancements. • The third was to explore enhancements and develop a final set of predictive equations. • The fourth was to validate the enhanced models. • The last was to develop a clear set of application guidelines for practitioner use of the project outputs. The datasets in these 5 zip files are in support of SHRP 2 Report S2-L33-RW-1, Validation of Urban Freeway Models, https://rosap.ntl.bts.gov/view/dot/3604 The 5 zip files contain a total of 60 comma separated value (.csv) files. The compressed zip files total 3.8 GB in size. The files have been uploaded as-is; no further documentation was supplied. These files can be unzipped using any zip compression/decompression software. The files can be read in any simple text editor. [software requirements] Note: Data files larger than 1GB each. Direct data download links: L03-01: https://doi.org/10.21949/1500858 L03-02: https://doi.org/10.21949/1500868 L03-03: https://doi.org/10.21949/1500869 L03-04: https://doi.org/10.21949/1500870 L03-05: https://doi.org/10.21949/1500871
Facebook
Twitter1Spatial has put together a rule package (“Location Data to Emergency Service Zone Validation”) to validate location data (Road Centerline and Address Point layers) against a respective Emergency Service Zone (ESZ) layer. This rule catalog provides documentation for the current state of the rule package.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The NIST Extensible Resource Data Model (NERDm) is a set of schemas for encoding in JSON format metadata that describe digital resources. The variety of digital resources it can describe includes not only digital data sets and collections, but also software, digital services, web sites and portals, and digital twins. It was created to serve as the internal metadata format used by the NIST Public Data Repository and Science Portal to drive rich presentations on the web and to enable discovery; however, it was also designed to enable programmatic access to resources and their metadata by external users. Interoperability was also a key design aim: the schemas are defined using the JSON Schema standard, metadata are encoded as JSON-LD, and their semantics are tied to community ontologies, with an emphasis on DCAT and the US federal Project Open Data (POD) models. Finally, extensibility is also central to its design: the schemas are composed of a central core schema and various extension schemas. New extensions to support richer metadata concepts can be added over time without breaking existing applications. Validation is central to NERDm's extensibility model. Consuming applications should be able to choose which metadata extensions they care to support and ignore terms and extensions they don't support. Furthermore, they should not fail when a NERDm document leverages extensions they don't recognize, even when on-the-fly validation is required. To support this flexibility, the NERDm framework allows documents to declare what extensions are being used and where. We have developed an optional extension to the standard JSON Schema validation (see ejsonschema below) to support flexible validation: while a standard JSON Schema validater can validate a NERDm document against the NERDm core schema, our extension will validate a NERDm document against any recognized extensions and ignore those that are not recognized. The NERDm data model is based around the concept of resource, semantically equivalent to a schema.org Resource, and as in schema.org, there can be different types of resources, such as data sets and software. A NERDm document indicates what types the resource qualifies as via the JSON-LD "@type" property. All NERDm Resources are described by metadata terms from the core NERDm schema; however, different resource types can be described by additional metadata properties (often drawing on particular NERDm extension schemas). A Resource contains Components of various types (including DCAT-defined Distributions) that are considered part of the Resource; specifically, these can include downloadable data files, hierachical data collecitons, links to web sites (like software repositories), software tools, or other NERDm Resources. Through the NERDm extension system, domain-specific metadata can be included at either the resource or component level. The direct semantic and syntactic connections to the DCAT, POD, and schema.org schemas is intended to ensure unambiguous conversion of NERDm documents into those schemas. As of this writing, the Core NERDm schema and its framework stands at version 0.7 and is compatible with the "draft-04" version of JSON Schema. Version 1.0 is projected to be released in 2025. In that release, the NERDm schemas will be updated to the "draft2020" version of JSON Schema. Other improvements will include stronger support for RDF and the Linked Data Platform through its support of JSON-LD.
Facebook
TwitterSeveral genes involved in different processes were measured by qPCR in order to determine expression levels and to validate the data gathered by DNA microarrays.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Biology has become a highly mathematical discipline in which probabilistic models play a central role. As a result, research in the biological sciences is now dependent on computational tools capable of carrying out complex analyses. These tools must be validated before they can be used, but what is understood as validation varies widely among methodological contributions. This may be a consequence of the still embryonic stage of the literature on statistical software validation for computational biology. Our manuscript aims to advance this literature. Here, we describe, illustrate, and introduce new good practices for assessing the correctness of a model implementation, with an emphasis on Bayesian methods. We also introduce a suite of functionalities for automating validation protocols. It is our hope that the guidelines presented here help sharpen the focus of discussions on (as well as elevate) expected standards of statistical software for biology.
Facebook
Twitter
According to our latest research, the global synthetic data validation for ADAS market size reached USD 820 million in 2024, reflecting a robust and expanding sector within the automotive industry. The market is projected to grow at a CAGR of 23.7% from 2025 to 2033, culminating in a forecasted market size of approximately USD 6.5 billion by 2033. This remarkable growth is primarily fueled by the increasing adoption of advanced driver-assistance systems (ADAS) in both passenger and commercial vehicles, the rising complexity of autonomous driving functions, and the need for scalable, safe, and cost-effective validation processes.
A significant growth factor for the synthetic data validation for ADAS market is the accelerating integration of ADAS technologies across automotive OEMs and Tier 1 suppliers. As regulatory bodies worldwide tighten safety standards and mandate the inclusion of features such as automatic emergency braking, lane-keeping assistance, and adaptive cruise control, manufacturers are compelled to validate these systems rigorously. Traditional data collection for ADAS validation is not only time-consuming and resource-intensive but also limited in its ability to reproduce rare or hazardous scenarios. Synthetic data validation addresses these challenges by enabling the creation of diverse, customizable datasets that accurately simulate real-world driving conditions, substantially reducing development timelines and costs while ensuring compliance with safety regulations.
Another critical driver is the rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies, which underpin both synthetic data generation and validation processes. As ADAS algorithms become increasingly sophisticated, the demand for high-quality, annotated, and scalable datasets grows in tandem. Synthetic data validation empowers developers to generate massive volumes of data that cover edge cases and rare events, which are otherwise difficult or dangerous to capture in real-world testing. This capability not only expedites the training and validation of perception models but also enhances their robustness, reliability, and generalizability, paving the way for higher levels of vehicle autonomy and improved road safety.
The proliferation of connected and autonomous vehicles is further amplifying the need for synthetic data validation within the ADAS market. As vehicles become more reliant on sensor fusion, object detection, and path planning algorithms, the complexity of validation scenarios increases exponentially. Synthetic data validation enables the simulation of intricate driving environments, sensor malfunctions, and unpredictable human behaviors, ensuring that ADAS-equipped vehicles can safely navigate diverse and dynamic real-world conditions. The scalability and flexibility offered by synthetic data solutions are particularly attractive to automotive OEMs, Tier 1 suppliers, and research institutes striving to maintain a competitive edge in the fast-evolving mobility landscape.
Regionally, North America and Europe are leading adopters of synthetic data validation for ADAS, driven by stringent safety regulations, a strong presence of automotive technology pioneers, and significant investments in autonomous vehicle research. However, Asia Pacific is emerging as a high-growth market, fueled by the rapid expansion of the automotive sector, increasing consumer demand for advanced safety features, and government initiatives supporting smart mobility. Latin America and the Middle East & Africa are also witnessing gradual adoption, primarily through collaborations with global OEMs and technology providers. The global landscape is characterized by a dynamic interplay of regulatory frameworks, technological advancements, and evolving consumer expectations, shaping the future trajectory of the synthetic data validation for ADAS market.
The synthetic data validation for ADAS market is segmented by compone
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
U.S. Geological Survey (USGS) scientists conducted field data collection efforts during the time periods of April 25 - 26, 2017, October 24 - 28, 2017, and July 25 - 26, 2018, using a combination of surveying technologies to map and validate topography, structures, and other features at five sites in central South Dakota. The five sites included the Chamberlain Explorers Athletic Complex and the Chamberlain High School in Chamberlain, SD, Hanson Lake State Public Shooting Area near Corsica, SD, the State Capital Grounds in Pierre, SD, and Platte Creek State Recreation Area near Platte, SD. The work was initiated as an effort to evaluate airborne Geiger-Mode and Single Photon light detection and ranging (lidar) data that were collected over parts of central South Dakota. Both Single Photon and Geiger-Mode lidar offer the promise of being able to map areas at high altitudes, thus requiring less time than traditional airborne lidar collections, while acquiring higher point densiti ...
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global PLACI Data Quality Validation for Airfreight market size reached USD 1.18 billion in 2024, with a robust CAGR of 14.6% projected through the forecast period. By 2033, the market is expected to attain a value of USD 3.58 billion, driven by the increasing adoption of digital transformation initiatives and regulatory compliance requirements across the airfreight sector. The growth in this market is primarily fueled by the rising need for accurate, real-time data validation to ensure security, compliance, and operational efficiency in air cargo processes.
The surge in e-commerce and global trade has significantly contributed to the expansion of the PLACI Data Quality Validation for Airfreight market. As airfreight volumes continue to soar, the demand for rapid, secure, and compliant cargo movement has never been higher. This has necessitated the implementation of advanced data quality validation solutions to manage the vast amounts of information generated during air cargo operations. Regulatory mandates such as the Pre-Loading Advance Cargo Information (PLACI) requirements in various regions have further compelled airlines, freight forwarders, and customs authorities to adopt robust data validation systems. These solutions not only help in mitigating risks associated with incorrect or incomplete data but also streamline cargo screening and documentation processes, leading to improved efficiency and reduced operational bottlenecks.
Technological advancements have played a pivotal role in shaping the PLACI Data Quality Validation for Airfreight market. The integration of artificial intelligence, machine learning, and big data analytics has enabled stakeholders to automate and enhance data validation processes. These technologies facilitate real-time risk assessment, anomaly detection, and compliance checks, ensuring that only accurate and verified data is transmitted across the airfreight ecosystem. The shift towards cloud-based deployment models has further accelerated the adoption of these solutions, offering scalability, flexibility, and cost-effectiveness to both large enterprises and small and medium-sized businesses. As the market matures, we expect to see increased collaboration between technology providers and airfreight stakeholders to develop customized solutions tailored to specific operational and regulatory needs.
The evolving regulatory landscape is another key growth driver for the PLACI Data Quality Validation for Airfreight market. Governments and international organizations are continuously updating air cargo security protocols to address emerging threats and enhance global supply chain security. Compliance with these regulations requires airfreight operators to validate data accuracy at multiple touchpoints, from cargo screening to documentation validation. Failure to comply can result in severe penalties, shipment delays, and reputational damage. Consequently, there is a growing emphasis on implementing end-to-end data validation frameworks that not only meet regulatory requirements but also provide actionable insights for risk management and operational optimization. This trend is expected to persist throughout the forecast period, further propelling market growth.
From a regional perspective, North America currently dominates the PLACI Data Quality Validation for Airfreight market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of major air cargo hubs, stringent regulatory frameworks, and high technology adoption rates in these regions have contributed to their market leadership. Asia Pacific is expected to witness the fastest growth during the forecast period, driven by the rapid expansion of cross-border e-commerce, increasing air cargo volumes, and ongoing investments in digital infrastructure. Meanwhile, Latin America and the Middle East & Africa are gradually emerging as key markets, supported by improving logistics networks and growing awareness of data quality validation benefits.
The PLACI Data Quality Validation for Airfreight market is segmented by solution type into software and services, each playing a critical role in ensuring data integrity and compliance across the airfreight value chain. Software solutions encompass a wide range of applications, including automated data validation tools, risk assessment engines
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Loan Boarding Data Validation market size reached USD 1.42 billion in 2024, demonstrating robust momentum driven by increasing digitalization in the financial sector and stringent regulatory requirements. The market is projected to grow at a CAGR of 11.8% from 2025 to 2033, reaching an estimated USD 4.06 billion by 2033. This dynamic growth is underpinned by the escalating need for accurate data validation, risk mitigation, and compliance management across lending institutions worldwide.
A key growth factor propelling the Loan Boarding Data Validation market is the intensifying demand for automated solutions that ensure data accuracy throughout the loan lifecycle. With the proliferation of digital lending platforms, financial institutions are under increasing pressure to verify and validate vast volumes of loan data in real time. The integration of advanced analytics, machine learning, and artificial intelligence into validation processes has significantly enhanced the speed, accuracy, and efficiency of loan boarding. This technological evolution is not only reducing manual errors but also minimizing operational costs, thereby driving the adoption of sophisticated data validation tools across banks, mortgage lenders, and credit unions.
Another pivotal driver is the ever-tightening regulatory landscape governing the global financial services industry. Regulatory bodies such as the Basel Committee, the European Banking Authority, and the US Federal Reserve have imposed rigorous guidelines around data integrity, anti-money laundering (AML), and Know Your Customer (KYC) protocols. As a result, organizations are compelled to invest in comprehensive data validation solutions to ensure compliance, avoid penalties, and maintain customer trust. The increasing complexity and frequency of regulatory audits have made the deployment of robust validation frameworks not just a best practice, but a necessity for sustainable operations in the lending sector.
The surge in digital transformation initiatives across both developed and emerging economies is further accelerating market growth. Financial institutions are leveraging cloud-based solutions and digital onboarding platforms to enhance customer experience and streamline back-office operations. This shift is fostering the adoption of Loan Boarding Data Validation platforms that offer scalable, secure, and real-time validation capabilities. Moreover, the growing trend of mergers and acquisitions in the banking sector is necessitating seamless data migration and integration, which in turn fuels the demand for advanced validation technologies. The convergence of these factors is expected to sustain the market's upward trajectory throughout the forecast period.
Regionally, North America continues to dominate the Loan Boarding Data Validation market, accounting for the largest share in 2024, followed closely by Europe and the Asia Pacific. The presence of leading financial institutions, early adoption of digital technologies, and a robust regulatory environment have cemented North America's leadership position. Meanwhile, Asia Pacific is witnessing the fastest growth, driven by rapid digitalization, expanding financial inclusion, and government-led digital lending initiatives. Latin America and the Middle East & Africa are also emerging as promising markets, as local banks and lenders increasingly recognize the value of automated data validation in enhancing operational efficiency and regulatory compliance.
The Loan Boarding Data Validation market by component is primarily segmented into Software and Services. The software segment is witnessing substantial growth due to the rising adoption of automated validation tools that streamline the loan boarding process. These software solutions are equipped with features such as real-time data verification, audit trails, and customizable rule engines, which significantly reduce manual intervention and associated errors. Financial institutions are increasingly investing in advanced software platforms to ensure data accuracy, enhance compliance, and improve customer experience. The integration of artificial intelligence and machine learning algorithms within these software solutions is further elevating their efficiency and scalability, making them indispensable for modern lending operations.
<br /&
Facebook
TwitterOne table with data used to validate aerial fish surveys in Prince William Sound, Alaska. Data includes: date, location, latitude, longitude, aerial ID, validation ID, total length and validation method. Various catch methods were used to obtain fish samples for aerial validations, including: cast net, GoPro, hydroacoustics, jig, dip net, gillnet, purse seine, photo and visual identification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a synthetic smart card data set that can be used to test pattern detection methods for the extraction of temporal and spatial data. The data set is tab seperated and based on a stylized travel pattern description for city of Utrecht in The Netherlands and is developed and used in Chapter 6 of the PhD Thesis of Paul Bouman.
This dataset contains the following files:
journeys.tsv : the actual data set of synthetic smart card data
utrecht.xml : the activity pattern definition that was used to randomly generate the synthethic smart card data
validate.ref : a file derived from the activity pattern definition that can be used for validation purposes. It specifies which activity types occur at each location in the smart card data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data obtained from the inspection of steel disks used to validate the UT-SAFT resolution formulas in the referencing paper 'UT-SAFT resolution'. The purpose is to show, that the indication size of small test reflectors matches well with the resolution formulas derived in this paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT Purpose: to validate the content and usability of the “Network NASF” application, intended for the teams of the Extended Family Health and Primary Care Center (NASF-AB). Methods: eighteen specialists, researchers, and professionals from different fields of study participated to validate the content and usability of the application, carried out in four stages: adjustment of the instrument; administration of the Suitability Assessment of Materials (SAM); validation of the content by calculating the content validity index (CVI); and usability evaluation through the System Usability Scale (SUS), in this order. Results: the participants classified the material as valid regarding both its content and usability. The index achieved in the SAM was 83.5%, as four, out of the six topics in the instrument, had values over 0.78. Hence, these four were considered excellent, while the other two were considered good. The recommendations given by the specialized judges were accepted and the usability index (5.5%) was considered relevant. Conclusion: the application developed for NASF-AB professionals was considered valid regarding its content and usability.
Facebook
TwitterData producers or those who maintain parcel data can use this tool to validate their data against the state Geospatial Advisory Committee (GAC) Parcel Data Standard. The validations within the tool were originally created as part of a MetroGIS Regional Parcel Dataset workflow.
Counties using this tool can obtain a schema geodatabase from Parcel Data Standard page hosted by MnGeo (link below). All counties, cities or those maintaining authoritative data on a local jurisdiction's behalf, are encouraged to use and modify the tool as needed to support local workflows.
Parcel Data Standard Page
http://www.mngeo.state.mn.us/committee/standards/parcel_attrib/parcel_attrib.html
Specific validation information and tool requirements can be found in the following documents included within this resource.
Readme_HowTo.pdf
Readme_Validations.pdf
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes the date and time, latitude (“lat”), longitude (“lon”), sun angle (“sun_angle”, in degrees [o]), rainbow presence (TRUE = rainbow, FALSE = no rainbow), cloud cover (“cloud_cover”, proportion), and liquid precipitation (“liquid_precip”, kg m-2 s-1) for each record used to train and/or validate the models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Word document and Excel data were used to validate two Flow 2 sensors prior to experiments for the paper:"New methodology to evaluate and optimize indoor ventilation based on rapid response sensors"by María del Mar Durán del Amor, Antonia Baeza Caracena, Francisco Esquembre, and Mercedes Llorens Pascual del Riquelme(Under consideration)
Facebook
TwitterMeasurements made in the Florida Keys as part of efforts to Validate the VIIRS instrument.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The construction of a robust healthcare information system is fundamental to enhancing countries’ capabilities in the surveillance and control of hepatitis B virus (HBV). Making use of China’s rapidly expanding primary healthcare system, this innovative approach using big data and machine learning (ML) could help towards the World Health Organization’s (WHO) HBV infection elimination goals of reaching 90% diagnosis and treatment rates by 2030. We aimed to develop and validate HBV detection models using routine clinical data to improve the detection of HBV and support the development of effective interventions to mitigate the impact of this disease in China. Relevant data records extracted from the Family Medicine Clinic of the University of Hong Kong-Shenzhen Hospital’s Hospital Information System were structuralized using state-of-the-art Natural Language Processing techniques. Several ML models have been used to develop HBV risk assessment models. The performance of the ML model was then interpreted using the Shapley value (SHAP) and validated using cohort data randomly divided at a ratio of 2:1 using a five-fold cross-validation framework. The patterns of physical complaints of patients with and without HBV infection were identified by processing 158,988 clinic attendance records. After removing cases without any clinical parameters from the derivation sample (n = 105,992), 27,392 cases were analysed using six modelling methods. A simplified model for HBV using patients’ physical complaints and parameters was developed with good discrimination (AUC = 0.78) and calibration (goodness of fit test p-value >0.05). Suspected case detection models of HBV, showing potential for clinical deployment, have been developed to improve HBV surveillance in primary care setting in China. (Word count: 264) This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections.We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China. This study has developed a suspected case detection model for HBV, which can facilitate early identification and treatment of HBV in the primary care setting in China, contributing towards the achievement of WHO’s elimination goals of HBV infections. We utilized the state-of-art natural language processing techniques to structure the data records, leading to the development of a robust healthcare information system which enhances the surveillance and control of HBV in China.