31 datasets found
  1. h

    GNHK-Synthetic-OCR-Dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyansh Sharma, GNHK-Synthetic-OCR-Dataset [Dataset]. https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Shreyansh Sharma
    Description

    GNHK Synthetic OCR Dataset

      Overview
    

    Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using GNHK Dataset, and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers.

      What's Inside?
    

    Dataset Folder: The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image

    Parquet File: For easy handling and analysis… See the full description on the dataset page: https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset.

  2. P

    UTRSet-Synth Dataset

    • paperswithcode.com
    Updated Jun 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdur Rahman; Arjun Ghosh; Chetan Arora (2023). UTRSet-Synth Dataset [Dataset]. https://paperswithcode.com/dataset/utrset-synth
    Explore at:
    Dataset updated
    Jun 26, 2023
    Authors
    Abdur Rahman; Arjun Ghosh; Chetan Arora
    Description

    The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.

    To generate the dataset, a custom-designed synthetic data generation module which offers precise control over variations in crucial factors such as font, text size, colour, resolution, orientation, noise, style, and background, was employed. Moreover, the UTRSet-Synth dataset tackles the limitations observed in existing datasets. It addresses the challenge of standardizing fonts by incorporating over 130 diverse Urdu fonts, which were thoroughly refined to ensure consistent rendering schemes. It overcomes the scarcity of Arabic words, numerals, and Urdu digits by incorporating a significant number of samples representing these elements. Additionally, the dataset is enriched by randomly selecting words from a vocabulary of 100,000 words during the text generation process. As a result, UTRSet-Synth contains a total of 28,187 unique words, with an average word length of 7 characters.

    The availability of the UTRSet-Synth dataset, a synthetic dataset that closely emulates real-world variations, addresses the scarcity of comprehensive real-world printed Urdu OCR datasets. By providing researchers with a valuable resource for developing and benchmarking Urdu OCR models, this dataset promotes standardized evaluation, and reproducibility, and fosters advancements in the field of Urdu OCR. For more information and details about the UTRSet-Real & UTRSet-Synth datasets, please refer to the paper "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"

  3. Wikipedia corpus for synthetic data made for Handwritten Text Recognition...

    • zenodo.org
    txt, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas CONSTUM; Thomas CONSTUM (2025). Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition [Dataset]. http://doi.org/10.1007/s10032-024-00511-9
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas CONSTUM; Thomas CONSTUM
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

    The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

    The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

    Contents of the archive:

    • wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.

    • wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.

    • wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.

    • wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

    Data format for corpora in Hugging Face datasets structure:

    Each record in the datasets follows the dictionary structure below:

    {
    "id": "

  4. Synthetic Data Generation Engine Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Synthetic Data Generation Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-generation-engine-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Data Generation Engine Market Outlook



    According to our latest research, the global Synthetic Data Generation Engine market size reached USD 1.42 billion in 2024, reflecting a rapidly expanding sector driven by the escalating demand for advanced data solutions. The market is expected to achieve a robust CAGR of 37.8% from 2025 to 2033, propelling it to an estimated value of USD 21.8 billion by 2033. This exceptional growth is primarily fueled by the increasing need for high-quality, privacy-compliant datasets to train artificial intelligence and machine learning models in sectors such as healthcare, BFSI, and IT & telecommunications. As per our latest research, the proliferation of data-centric applications and stringent data privacy regulations are acting as significant catalysts for the adoption of synthetic data generation engines globally.



    One of the key growth factors for the synthetic data generation engine market is the mounting emphasis on data privacy and compliance with regulations such as GDPR and CCPA. Organizations are under immense pressure to protect sensitive customer information while still deriving actionable insights from data. Synthetic data generation engines offer a compelling solution by creating artificial datasets that mimic real-world data without exposing personally identifiable information. This not only ensures compliance but also enables organizations to accelerate their AI and analytics initiatives without the constraints of data access or privacy risks. The rising awareness among enterprises about the benefits of synthetic data in mitigating data breaches and regulatory penalties is further propelling market expansion.



    Another significant driver is the exponential growth in artificial intelligence and machine learning adoption across industries. Training robust and unbiased models requires vast and diverse datasets, which are often difficult to obtain due to privacy concerns, labeling costs, or data scarcity. Synthetic data generation engines address this challenge by providing scalable and customizable datasets for various applications, including machine learning model training, data augmentation, and fraud detection. The ability to generate balanced and representative data has become a critical enabler for organizations seeking to improve model accuracy, reduce bias, and accelerate time-to-market for AI solutions. This trend is particularly pronounced in sectors such as healthcare, automotive, and finance, where data diversity and privacy are paramount.



    Furthermore, the increasing complexity of data types and the need for multi-modal data synthesis are shaping the evolution of the synthetic data generation engine market. With the proliferation of unstructured data in the form of images, videos, audio, and text, organizations are seeking advanced engines capable of generating synthetic data across multiple modalities. This capability enhances the versatility of synthetic data solutions, enabling their application in emerging use cases such as autonomous vehicle simulation, natural language processing, and biometric authentication. The integration of generative AI techniques, such as GANs and diffusion models, is further enhancing the realism and utility of synthetic datasets, expanding the addressable market for synthetic data generation engines.



    From a regional perspective, North America continues to dominate the synthetic data generation engine market, accounting for the largest revenue share in 2024. The region's leadership is attributed to the strong presence of technology giants, early adoption of AI and machine learning, and stringent regulatory frameworks. Europe follows closely, driven by robust data privacy regulations and increasing investments in digital transformation. Meanwhile, the Asia Pacific region is emerging as the fastest-growing market, supported by expanding IT infrastructure, government-led AI initiatives, and a burgeoning startup ecosystem. Latin America and the Middle East & Africa are also witnessing gradual adoption, fueled by the growing recognition of synthetic data's potential to overcome data access and privacy challenges.





    &l

  5. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Dominican Republic, India, United Kingdom, Sint Maarten (Dutch part), Cook Islands, Western Sahara, Norway, Oman, Jordan, Barbados
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  6. Synthetic Data Generation Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Synthetic Data Generation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-generation-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Data Generation Market Outlook




    According to our latest research, the global synthetic data generation market size reached USD 1.6 billion in 2024, demonstrating robust expansion driven by increasing demand for high-quality, privacy-preserving datasets. The market is projected to grow at a CAGR of 38.2% over the forecast period, reaching USD 19.2 billion by 2033. This remarkable growth trajectory is fueled by the growing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, coupled with stringent data privacy regulations that necessitate innovative data solutions. As per our latest research, organizations worldwide are increasingly leveraging synthetic data to address data scarcity, enhance AI model training, and ensure compliance with evolving privacy standards.




    One of the primary growth factors for the synthetic data generation market is the rising emphasis on data privacy and regulatory compliance. With the implementation of stringent data protection laws such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States, enterprises are under immense pressure to safeguard sensitive information. Synthetic data offers a compelling solution by enabling organizations to generate artificial datasets that mirror the statistical properties of real data without exposing personally identifiable information. This not only facilitates regulatory compliance but also empowers organizations to innovate without the risk of data breaches or privacy violations. As businesses increasingly recognize the value of privacy-preserving data, the demand for advanced synthetic data generation solutions is set to surge.




    Another significant driver is the exponential growth in AI and ML adoption across various sectors, including healthcare, finance, automotive, and retail. High-quality, diverse, and unbiased data is the cornerstone of effective AI model development. However, acquiring such data is often challenging due to privacy concerns, limited availability, or high acquisition costs. Synthetic data generation bridges this gap by providing scalable, customizable datasets tailored to specific use cases, thereby accelerating AI training and reducing dependency on real-world data. Organizations are leveraging synthetic data to enhance algorithm performance, mitigate data bias, and simulate rare events, which are otherwise difficult to capture in real datasets. This capability is particularly valuable in sectors like autonomous vehicles, where training models on rare but critical scenarios is essential for safety and reliability.




    Furthermore, the growing complexity of data types—ranging from tabular and image data to text, audio, and video—has amplified the need for versatile synthetic data generation tools. Enterprises are increasingly seeking solutions that can generate multi-modal synthetic datasets to support diverse applications such as fraud detection, product testing, and quality assurance. The flexibility offered by synthetic data generation platforms enables organizations to simulate a wide array of scenarios, test software systems, and validate AI models in controlled environments. This not only enhances operational efficiency but also drives innovation by enabling rapid prototyping and experimentation. As the digital ecosystem continues to evolve, the ability to generate synthetic data across various formats will be a critical differentiator for businesses striving to maintain a competitive edge.




    Regionally, North America leads the synthetic data generation market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The dominance of North America can be attributed to the strong presence of technology giants, advanced research institutions, and a favorable regulatory environment that encourages AI innovation. Europe is witnessing rapid growth due to proactive data privacy regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, driven by the proliferation of digital technologies and rising adoption of AI-powered solutions across industries. Latin America and the Middle East & Africa are also expected to experience steady growth, supported by government-led digitalization programs and expanding IT infrastructure.



    <a href="https://growthmark

  7. h

    MJSynth_text_recognition

    • huggingface.co
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    priyank (2025). MJSynth_text_recognition [Dataset]. https://huggingface.co/datasets/priyank-m/MJSynth_text_recognition
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2025
    Authors
    priyank
    Description

    Dataset Card for "MJSynth_text_recognition"

    This is the MJSynth dataset for text recognition on document images, synthetically generated, covering 90K English words. It includes training, validation and test splits. Source of the dataset: https://www.robots.ox.ac.uk/~vgg/data/text/ Use dataset streaming functionality to try out the dataset quickly without downloading the entire dataset (refer: https://huggingface.co/docs/datasets/stream) Citation details provided on the source… See the full description on the dataset page: https://huggingface.co/datasets/priyank-m/MJSynth_text_recognition.

  8. h

    Synthetic Data Generation Market - Global Outlook 2020-2032

    • htfmarketinsights.com
    pdf & excel
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HTF Market Intelligence (2025). Synthetic Data Generation Market - Global Outlook 2020-2032 [Dataset]. https://www.htfmarketinsights.com/report/4360591-synthetic-data-generation-market
    Explore at:
    pdf & excelAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset authored and provided by
    HTF Market Intelligence
    License

    https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy

    Time period covered
    2019 - 2031
    Area covered
    Global
    Description

    Global Synthetic Data Generation is segmented by Application (AI training, Software testing, Fraud detection, Privacy preservation, Autonomous driving), Type (Tabular, Image, Video, Text, Time-series) and Geography(North America, LATAM, West Europe, Central & Eastern Europe, Northern Europe, Southern Europe, East Asia, Southeast Asia, South Asia, Central Asia, Oceania, MEA)

  9. R

    Invoice Management Dataset

    • universe.roboflow.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CVIP Workspace (2024). Invoice Management Dataset [Dataset]. https://universe.roboflow.com/cvip-workspace/invoice-management
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 28, 2024
    Dataset authored and provided by
    CVIP Workspace
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    Intelligent Invoice Management System

    Project Description:
    The Intelligent Invoice Management System is an advanced AI-powered platform designed to revolutionize traditional invoice processing. By automating the extraction, validation, and management of invoice data, this system addresses the inefficiencies, inaccuracies, and high costs associated with manual methods. It enables businesses to streamline operations, reduce human error, and expedite payment cycles.

    Problem Statement:
    Manual invoice processing involves labor-intensive tasks such as data entry, verification, and reconciliation. These processes are time-consuming, prone to errors, and can result in financial losses and delays. The diversity of invoice formats from various vendors adds complexity, making automation a critical need for efficiency and scalability.

    Proposed Solution:
    The Intelligent Invoice Management System automates the end-to-end process of invoice handling using AI and machine learning techniques. Core functionalities include:
    1. Invoice Generation: Automatically generate PDF invoices in at least four formats, populated with synthetic data.
    2. Data Development: Leverage a dataset containing fields such as receipt numbers, company details, sales tax information, and itemized tables to create realistic invoice samples.
    3. AI-Powered Labeling: Use Tesseract OCR to extract labeled data from invoice images, and train YOLO for label recognition, ensuring precise identification of fields.
    4. Database Integration: Store extracted information in a structured database for seamless retrieval and analysis.
    5. Web-Based Information System: Provide a user-friendly platform to upload invoices and retrieve key metrics, such as:
    - Total sales within a specified duration.
    - Total sales tax paid during a given timeframe.
    - Detailed invoice information in tabular form for specific date ranges.

    Key Features and Deliverables:
    1. Invoice Generation:
    - Generate 20,000 invoices using an automated script.
    - Include dummy logos, company details, and itemized tables for four items per invoice.

    1. Label Definition and Format:

      • Define structured labels (TBLR, CLASS Name, Recognized Text).
      • Provide labels in both XML and JSON formats for seamless integration.
    2. OCR and AI Training:

      • Automate labeling using Tesseract OCR for high-accuracy text recognition.
      • Train and test YOLO to detect and classify invoice fields (TBLR and CLASS).
    3. Database Management:

      • Store OCR-extracted labels and field data in a database.
      • Enable efficient search and aggregation of invoice data.
    4. Web-Based Interface:

      • Build a responsive system for users to upload invoices and retrieve data based on company name or NTN.
      • Display metrics and reports for total sales, tax paid, and invoice details over custom date ranges.

    Expected Outcomes: - Reduction in manual effort and operational costs.
    - Improved accuracy in invoice processing and financial reporting.
    - Enhanced scalability and adaptability for diverse invoice formats.
    - Faster turnaround time for invoice-related tasks.

    By automating critical aspects of invoice management, this system delivers a robust and intelligent solution to meet the evolving needs of businesses.

  10. h

    Synthetic-Text

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLEAR Global, Synthetic-Text [Dataset]. https://huggingface.co/datasets/CLEAR-Global/Synthetic-Text
    Explore at:
    Dataset authored and provided by
    CLEAR Global
    Description

    Synthetic Text Dataset for 10 African Languages

    This dataset contains synthetic text generated using large language models for ten African languages. It is intended to support research and evaluation in automatic speech recognition (ASR), natural language processing (NLP), and related fields for low-resource languages.

      Data Generation and Licensing
    

    I acknowledge that this dataset contains synthetic data generated through the process described in this paper. It is not… See the full description on the dataset page: https://huggingface.co/datasets/CLEAR-Global/Synthetic-Text.

  11. h

    myanmar-ocr-dataset

    • huggingface.co
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuu Htet Naing (2025). myanmar-ocr-dataset [Dataset]. https://huggingface.co/datasets/chuuhtetnaing/myanmar-ocr-dataset
    Explore at:
    Dataset updated
    Jun 19, 2025
    Authors
    Chuu Htet Naing
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Myanmar OCR Dataset

    A synthetic dataset for training and fine-tuning Optical Character Recognition (OCR) models specifically for the Myanmar language.

      Dataset Description
    

    This dataset contains synthetically generated OCR images created specifically for Myanmar text recognition tasks. The images were generated using myanmar-ocr-data-generator, a fork of TextRecognitionDataGenerator with fixes for proper Myanmar character splitting.

      Direct Download
    

    Available… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-ocr-dataset.

  12. Synthetic Data Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Synthetic Data Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Data Market Outlook



    According to our latest research, the synthetic data market size reached USD 1.52 billion in 2024, reflecting robust growth driven by increasing demand for privacy-preserving data and the acceleration of AI and machine learning initiatives across industries. The market is projected to expand at a compelling CAGR of 34.7% from 2025 to 2033, with the forecasted market size expected to reach USD 21.4 billion by 2033. Key growth factors include the rising necessity for high-quality, diverse, and privacy-compliant datasets, the proliferation of AI-driven applications, and stringent data protection regulations worldwide.




    The primary growth driver for the synthetic data market is the escalating need for advanced data privacy and compliance. Organizations across sectors such as healthcare, BFSI, and government are under increasing pressure to comply with regulations like GDPR, HIPAA, and CCPA. Synthetic data offers a viable solution by enabling the creation of realistic yet anonymized datasets, thus mitigating the risk of data breaches and privacy violations. This capability is especially crucial for industries handling sensitive personal and financial information, where traditional data anonymization techniques often fall short. As regulatory scrutiny intensifies, the adoption of synthetic data solutions is set to expand rapidly, ensuring organizations can leverage data-driven innovation without compromising on privacy or compliance.




    Another significant factor propelling the synthetic data market is the surge in AI and machine learning deployment across enterprises. AI models require vast, diverse, and high-quality datasets for effective training and validation. However, real-world data is often scarce, incomplete, or biased, limiting the performance of these models. Synthetic data addresses these challenges by generating tailored datasets that represent a wide range of scenarios and edge cases. This not only enhances the accuracy and robustness of AI systems but also accelerates the development cycle by reducing dependencies on real data collection and labeling. As the demand for intelligent automation and predictive analytics grows, synthetic data is emerging as a foundational enabler for next-generation AI applications.




    In addition to privacy and AI training, synthetic data is gaining traction in test data management and fraud detection. Enterprises are increasingly leveraging synthetic datasets to simulate complex business environments, test software systems, and identify vulnerabilities in a controlled manner. In fraud detection, synthetic data allows organizations to model and anticipate new fraudulent behaviors without exposing sensitive customer data. This versatility is driving adoption across diverse verticals, from automotive and manufacturing to retail and telecommunications. As digital transformation initiatives intensify and the need for robust data testing environments grows, the synthetic data market is poised for sustained expansion.




    Regionally, North America dominates the synthetic data market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of technology giants, a mature AI ecosystem, and early regulatory adoption are key factors supporting North America’s leadership. Meanwhile, Asia Pacific is witnessing the fastest growth, driven by rapid digitalization, expanding AI investments, and increasing awareness of data privacy. Europe continues to see steady adoption, particularly in sectors like healthcare and finance where data protection regulations are stringent. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a nascent stage, as organizations in these regions begin to recognize the value of synthetic data for digital innovation and compliance.





    Component Analysis



    The synthetic data market is segmented by component into software and services. The software segment currently holds the largest market

  13. Paper2Fig100k dataset

    • zenodo.org
    application/gzip
    Updated Nov 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez; Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez (2022). Paper2Fig100k dataset [Dataset]. http://doi.org/10.5281/zenodo.7299423
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez; Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Paper2Fig100k dataset

    A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts).

    The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure:

    • figure_id: Figure identification based on the arXiv identifier:
    • captions: Text pairs extracted from the paper that relates to the figure. For instance, the actual caption of the figure or references to the figure in the manuscript.
    • ocr_result: Result of performing OCR text recognition over the image. We provide a list of triplets (bounding box, confidence, text) present in the image.
    • aspect: Aspect ratio of the image (H/W).

    Take a look at the OCR-VQGAN GitHub repository, which uses the Paper2Fig100k dataset to train an image encoder for figures and diagrams, that uses OCR perceptual loss to render clear and readable texts inside images.

    The dataset is explained in more detail in the paper OCR-VQGAN: Taming Text-within-Image Generation @WACV 2023

    Paper abstract

    Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them. We demonstrate the superiority of our method by conducting several experiments on the task of figure reconstruction. Additionally, we explore the qualitative and quantitative impact of weighting different perceptual metrics in the overall loss function.

  14. LLM - Detect AI Datamix

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  15. TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Salvi; Davide Salvi; Brian Hosler; Brian Hosler; Paolo Bestagini; Paolo Bestagini; Matthew C. Stamm; Matthew C. Stamm; Stefano Tubaro; Stefano Tubaro (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. http://doi.org/10.5281/zenodo.6560159
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Davide Salvi; Davide Salvi; Brian Hosler; Brian Hosler; Paolo Bestagini; Paolo Bestagini; Matthew C. Stamm; Matthew C. Stamm; Stefano Tubaro; Stefano Tubaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images.
    The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time.
    This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

    In this paper we focus on the generation of an audio-visual deepfake dataset.
    First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions.
    This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

  16. GLINER-multi-task-synthetic-data

    • huggingface.co
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knowledgator Engineering (2024). GLINER-multi-task-synthetic-data [Dataset]. https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2024
    Dataset authored and provided by
    Knowledgator Engineering
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components:

    'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens.

    'ner': A list of lists containing named entity recognition information. Each inner list has three elements:

    Start index of the named entity in the… See the full description on the dataset page: https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data.

  17. Synthetic Dyslexia Handwriting Dataset (YOLO-Format)

    • zenodo.org
    zip
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nora Fink; Nora Fink (2025). Synthetic Dyslexia Handwriting Dataset (YOLO-Format) [Dataset]. http://doi.org/10.5281/zenodo.14852659
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nora Fink; Nora Fink
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description
    This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset “A-Z Handwritten Alphabets in .csv format” 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).

    In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:

    • 78,275 images labeled as Normal
    • 52,196 images labeled as Reversal
    • 8,029 images labeled as Corrected

    Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a “text line” style on a black background, providing YOLO-compatible .txt annotations that specify bounding boxes for each letter.

    Key Points of the Synthetic Generation Process

    1. Letter-Level Source Data
      Individual characters were sampled from the original image sets.
    2. Randomized Layout
      Letters are randomly assembled into words and lines, ensuring a wide variety of visual arrangements.
    3. Bounding Box Labels
      Each character is assigned a bounding box with (x, y, width, height) in YOLO format.
    4. Class Annotations
      Classes include 0 = Normal, 1 = Reversal, and 2 = Corrected.
    5. Preservation of Visual Characteristics
      Letters retain their key dyslexia-relevant features (e.g., reversals).

    Historical References & Credits

    If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:

    • M. S. A. B. Rosli, I. S. Isa, S. A. Ramlan, S. N. Sulaiman and M. I. F. Maruzuki, "Development of CNN Transfer Learning for Dyslexia Handwriting Recognition," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 194–199, doi: 10.1109/ICCSCE52189.2021.9530971.
    • N. S. L. Seman, I. S. Isa, S. A. Ramlan, W. Li-Chih and M. I. F. Maruzuki, "Notice of Removal: Classification of Handwriting Impairment Using CNN for Potential Dyslexia Symptom," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 188–193, doi: 10.1109/ICCSCE52189.2021.9530989.
    • Isa, Iza Sazanita. CNN Comparisons Models On Dyslexia Handwriting Classification / Iza Sazanita Isa … [et Al.]. Universiti Teknologi MARA Cawangan Pulau Pinang, 2021.
    • Isa, I. S., Rahimi, W. N. S., Ramlan, S. A., & Sulaiman, S. N. (2019). Automated detection of dyslexia symptom based on handwriting image for primary school children. Procedia Computer Science, 163, 440–449.

    References to Original Data Sources

    111 P. J. Grother, “NIST Special Database 19,” NIST, 2016. [Online]. Available:
    https://www.nist.gov/srd/nist-special-database-19

    222 S. Patel, “A-Z Handwritten Alphabets in .csv format,” Kaggle, 2017. [Online]. Available:
    https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format

    Usage & Citation

    Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.

    Password Note (Original Data)

    The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.

  18. Synthetic nursing handover training and development data set - audio files

    • data.csiro.au
    • researchdata.edu.au
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - audio files [Dataset]. http://doi.org/10.4225/08/58d0977ab4888
    Explore at:
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Dataset funded by
    NICTA
    Description

    This is one of two collection records. Please see the link below for the other collection of associated text files.

    The two collections together comprise an open clinical dataset of three sets of 10 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

    This collection contains 3 X 100 spoken free-form audio files in WAV Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

    See Suominen et al (2015) in the links below for a detailed description and examples.

  19. z

    SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic...

    • zenodo.org
    bin
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abderrahmane Kefali; Abderrahmane Kefali (2025). SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents [Dataset]. http://doi.org/10.5281/zenodo.10896124
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Zenodo
    Authors
    Abderrahmane Kefali; Abderrahmane Kefali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Guelma
    Description

    SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents

    Description:

    This is a partial release of the SDADDS-Guelma dataset.

    SDADDS-Guelma (Synthetic Degraded Arabic Document DataSet of the University of Guelma) is a database of synthetic noisy or degraded Arabic document images. It was created by Dr. Abderrahmane Kefali and his team to support research on preprocessing, analysis, and recognition of degraded Arabic documents, where having a large set of images for training and testing is essential. This dataset is made publicly available to researchers in the field of document analysis and recognition, with the hope that it will be useful and contribute to their research endeavors.

    In this first release of the dataset, 84 handwritten images and 120 printed images have been used, along with 25 images of historical backgrounds, forming a total of 26316 synthetic images of degraded Arabic documents along with their corresponding ground-truth files.

    This release is separated into two parts to facilitate upload and use: one for the handwritten documents and the second for the printed documents.

    Composition of the dataset:

    Each of the parts of the SDADDS-Guelma dataset is organized into directories as follows:

    • TXT_Files: Contains texts in UTF-8 format.
    • IMG: Contains images of printed and handwritten Arabic text constructed from the text files.
    • Bin_IMG: Contains binary images corresponding to the original images.
    • BG_IMG: Contains images of empty old document backgrounds used for the generation of synthetic historical document images.
    • GT_Files: Contains XML annotation files corresponding to the text images.
    • Degraded_IMG: This directory contains synthetically generated degraded images, separated into sub-directories based on noise types such as Local_Noise, Show_through, Rotation, Curvature, Comb_IMG, etc.

    Ground-truth information:

    Ground truth information is essential for a document dataset, as it annotates documents and represents their essential characteristics. Our dataset is designed to be a large-scale and multipurpose dataset. As such, our methodology ensures that ground truth information is provided at three levels: text level (character codes), pixel level (binary and cleaned image), and document physical structure and other annotation information level.

    • Textual Ground Truth: these are identical to the original texts.
    • Pixel-level ground truth: presented in the form of binary images.
    • Ground truth at the document structure level: the structure of each document image, alongside the textual transcription of the words and PAWs, is recorded in a corresponding XML annotation file. The XML format utilized resembles that employed in similar works with adjustments made according to the specific characteristics of Arabic texts, including the presence of PAWs.

    Consequently, each original text image in our dataset is associated to an XML file detailing the entire ground truth and associated metadata.

    Structure of XML file:

    Each XML annotation file contains metadata about the document image and text content within the image, including the language, number of lines, and font attributes. It also provides detailed information about each text line, word, and Part of Arabic Words (PAWs), including their bounding boxes and textual transcriptions.

    Thus, each ground truth file takes the following form:

    Contact:

    Name: Dr. Abderrahmane Kefali
    Affiliation: University of 8 May 1945-Guelma, Algeria
    Email: kefali.abderrahmane@univ-guelma.dz

  20. f

    Tran et al. Final_Dataset.xlsx

    • figshare.com
    xlsx
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam (2024). Tran et al. Final_Dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.27619839.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    figshare
    Authors
    Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shreyansh Sharma, GNHK-Synthetic-OCR-Dataset [Dataset]. https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset

GNHK-Synthetic-OCR-Dataset

shreyansh1347/GNHK-Synthetic-OCR-Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shreyansh Sharma
Description

GNHK Synthetic OCR Dataset

  Overview

Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using GNHK Dataset, and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers.

  What's Inside?

Dataset Folder: The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image

Parquet File: For easy handling and analysis… See the full description on the dataset page: https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset.

Search
Clear search
Close search
Google apps
Main menu