31 datasets found

h
GNHK-Synthetic-OCR-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyansh Sharma, GNHK-Synthetic-OCR-Dataset [Dataset]. https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shreyansh Sharma
Description
GNHK Synthetic OCR Dataset

Overview

Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using GNHK Dataset, and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers.

What's Inside?

Dataset Folder: The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image

Parquet File: For easy handling and analysis… See the full description on the dataset page: https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset.
P
UTRSet-Synth Dataset
paperswithcode.com
Updated Jun 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdur Rahman; Arjun Ghosh; Chetan Arora (2023). UTRSet-Synth Dataset [Dataset]. https://paperswithcode.com/dataset/utrset-synth
Explore at:
Dataset updated
Jun 29, 2023
Authors
Abdur Rahman; Arjun Ghosh; Chetan Arora
Description
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.

To generate the dataset, a custom-designed synthetic data generation module which offers precise control over variations in crucial factors such as font, text size, colour, resolution, orientation, noise, style, and background, was employed. Moreover, the UTRSet-Synth dataset tackles the limitations observed in existing datasets. It addresses the challenge of standardizing fonts by incorporating over 130 diverse Urdu fonts, which were thoroughly refined to ensure consistent rendering schemes. It overcomes the scarcity of Arabic words, numerals, and Urdu digits by incorporating a significant number of samples representing these elements. Additionally, the dataset is enriched by randomly selecting words from a vocabulary of 100,000 words during the text generation process. As a result, UTRSet-Synth contains a total of 28,187 unique words, with an average word length of 7 characters.

The availability of the UTRSet-Synth dataset, a synthetic dataset that closely emulates real-world variations, addresses the scarcity of comprehensive real-world printed Urdu OCR datasets. By providing researchers with a valuable resource for developing and benchmarking Urdu OCR models, this dataset promotes standardized evaluation, and reproducibility, and fosters advancements in the field of Urdu OCR. For more information and details about the UTRSet-Real & UTRSet-Synth datasets, please refer to the paper "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"
Wikipedia corpus for synthetic data made for Handwritten Text Recognition...
zenodo.org
txt, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas CONSTUM; Thomas CONSTUM (2025). Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition [Dataset]. http://doi.org/10.1007/s10032-024-00511-9
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.1007/s10032-024-00511-9
Dataset updated
Jul 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas CONSTUM; Thomas CONSTUM
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

Contents of the archive:

wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.

wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.

wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.

wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

Data format for corpora in Hugging Face datasets structure:

Each record in the datasets follows the dictionary structure below:

{
"id": "
Synthetic Data Generation Engine Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Synthetic Data Generation Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-generation-engine-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Synthetic Data Generation Engine Market Outlook

According to our latest research, the global Synthetic Data Generation Engine market size reached USD 1.42 billion in 2024, reflecting a rapidly expanding sector driven by the escalating demand for advanced data solutions. The market is expected to achieve a robust CAGR of 37.8% from 2025 to 2033, propelling it to an estimated value of USD 21.8 billion by 2033. This exceptional growth is primarily fueled by the increasing need for high-quality, privacy-compliant datasets to train artificial intelligence and machine learning models in sectors such as healthcare, BFSI, and IT & telecommunications. As per our latest research, the proliferation of data-centric applications and stringent data privacy regulations are acting as significant catalysts for the adoption of synthetic data generation engines globally.

One of the key growth factors for the synthetic data generation engine market is the mounting emphasis on data privacy and compliance with regulations such as GDPR and CCPA. Organizations are under immense pressure to protect sensitive customer information while still deriving actionable insights from data. Synthetic data generation engines offer a compelling solution by creating artificial datasets that mimic real-world data without exposing personally identifiable information. This not only ensures compliance but also enables organizations to accelerate their AI and analytics initiatives without the constraints of data access or privacy risks. The rising awareness among enterprises about the benefits of synthetic data in mitigating data breaches and regulatory penalties is further propelling market expansion.

Another significant driver is the exponential growth in artificial intelligence and machine learning adoption across industries. Training robust and unbiased models requires vast and diverse datasets, which are often difficult to obtain due to privacy concerns, labeling costs, or data scarcity. Synthetic data generation engines address this challenge by providing scalable and customizable datasets for various applications, including machine learning model training, data augmentation, and fraud detection. The ability to generate balanced and representative data has become a critical enabler for organizations seeking to improve model accuracy, reduce bias, and accelerate time-to-market for AI solutions. This trend is particularly pronounced in sectors such as healthcare, automotive, and finance, where data diversity and privacy are paramount.

Furthermore, the increasing complexity of data types and the need for multi-modal data synthesis are shaping the evolution of the synthetic data generation engine market. With the proliferation of unstructured data in the form of images, videos, audio, and text, organizations are seeking advanced engines capable of generating synthetic data across multiple modalities. This capability enhances the versatility of synthetic data solutions, enabling their application in emerging use cases such as autonomous vehicle simulation, natural language processing, and biometric authentication. The integration of generative AI techniques, such as GANs and diffusion models, is further enhancing the realism and utility of synthetic datasets, expanding the addressable market for synthetic data generation engines.

From a regional perspective, North America continues to dominate the synthetic data generation engine market, accounting for the largest revenue share in 2024. The region's leadership is attributed to the strong presence of technology giants, early adoption of AI and machine learning, and stringent regulatory frameworks. Europe follows closely, driven by robust data privacy regulations and increasing investments in digital transformation. Meanwhile, the Asia Pacific region is emerging as the fastest-growing market, supported by expanding IT infrastructure, government-led AI initiatives, and a burgeoning startup ecosystem. Latin America and the Middle East & Africa are also witnessing gradual adoption, fueled by the growing recognition of synthetic data's potential to overcome data access and privacy challenges.

&l
d
Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Barbados, Dominican Republic, India, United Kingdom, Jordan, Sint Maarten (Dutch part), Cook Islands, Norway, Oman, Western Sahara
Description
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.
Synthetic Data Generation Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Synthetic Data Generation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-generation-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Synthetic Data Generation Market Outlook

According to our latest research, the global synthetic data generation market size reached USD 1.6 billion in 2024, demonstrating robust expansion driven by increasing demand for high-quality, privacy-preserving datasets. The market is projected to grow at a CAGR of 38.2% over the forecast period, reaching USD 19.2 billion by 2033. This remarkable growth trajectory is fueled by the growing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, coupled with stringent data privacy regulations that necessitate innovative data solutions. As per our latest research, organizations worldwide are increasingly leveraging synthetic data to address data scarcity, enhance AI model training, and ensure compliance with evolving privacy standards.

One of the primary growth factors for the synthetic data generation market is the rising emphasis on data privacy and regulatory compliance. With the implementation of stringent data protection laws such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States, enterprises are under immense pressure to safeguard sensitive information. Synthetic data offers a compelling solution by enabling organizations to generate artificial datasets that mirror the statistical properties of real data without exposing personally identifiable information. This not only facilitates regulatory compliance but also empowers organizations to innovate without the risk of data breaches or privacy violations. As businesses increasingly recognize the value of privacy-preserving data, the demand for advanced synthetic data generation solutions is set to surge.

Another significant driver is the exponential growth in AI and ML adoption across various sectors, including healthcare, finance, automotive, and retail. High-quality, diverse, and unbiased data is the cornerstone of effective AI model development. However, acquiring such data is often challenging due to privacy concerns, limited availability, or high acquisition costs. Synthetic data generation bridges this gap by providing scalable, customizable datasets tailored to specific use cases, thereby accelerating AI training and reducing dependency on real-world data. Organizations are leveraging synthetic data to enhance algorithm performance, mitigate data bias, and simulate rare events, which are otherwise difficult to capture in real datasets. This capability is particularly valuable in sectors like autonomous vehicles, where training models on rare but critical scenarios is essential for safety and reliability.

Furthermore, the growing complexity of data types—ranging from tabular and image data to text, audio, and video—has amplified the need for versatile synthetic data generation tools. Enterprises are increasingly seeking solutions that can generate multi-modal synthetic datasets to support diverse applications such as fraud detection, product testing, and quality assurance. The flexibility offered by synthetic data generation platforms enables organizations to simulate a wide array of scenarios, test software systems, and validate AI models in controlled environments. This not only enhances operational efficiency but also drives innovation by enabling rapid prototyping and experimentation. As the digital ecosystem continues to evolve, the ability to generate synthetic data across various formats will be a critical differentiator for businesses striving to maintain a competitive edge.

Regionally, North America leads the synthetic data generation market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The dominance of North America can be attributed to the strong presence of technology giants, advanced research institutions, and a favorable regulatory environment that encourages AI innovation. Europe is witnessing rapid growth due to proactive data privacy regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, driven by the proliferation of digital technologies and rising adoption of AI-powered solutions across industries. Latin America and the Middle East & Africa are also expected to experience steady growth, supported by government-led digitalization programs and expanding IT infrastructure.

<a href="https://growthmark
h
MJSynth_text_recognition
huggingface.co
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
priyank (2025). MJSynth_text_recognition [Dataset]. https://huggingface.co/datasets/priyank-m/MJSynth_text_recognition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2025
Authors
priyank
Description
Dataset Card for "MJSynth_text_recognition"

This is the MJSynth dataset for text recognition on document images, synthetically generated, covering 90K English words. It includes training, validation and test splits. Source of the dataset: https://www.robots.ox.ac.uk/~vgg/data/text/ Use dataset streaming functionality to try out the dataset quickly without downloading the entire dataset (refer: https://huggingface.co/docs/datasets/stream) Citation details provided on the source… See the full description on the dataset page: https://huggingface.co/datasets/priyank-m/MJSynth_text_recognition.
h
Synthetic Data Generation Market - Global Outlook 2020-2032
htfmarketinsights.com
pdf & excel
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HTF Market Intelligence (2025). Synthetic Data Generation Market - Global Outlook 2020-2032 [Dataset]. https://www.htfmarketinsights.com/report/4360591-synthetic-data-generation-market
Explore at:
pdf & excelAvailable download formats
Dataset updated
Jun 18, 2025
Dataset authored and provided by
HTF Market Intelligence
License
https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy
Time period covered
2019 - 2031
Area covered
Global
Description
Global Synthetic Data Generation is segmented by Application (AI training, Software testing, Fraud detection, Privacy preservation, Autonomous driving), Type (Tabular, Image, Video, Text, Time-series) and Geography(North America, LATAM, West Europe, Central & Eastern Europe, Northern Europe, Southern Europe, East Asia, Southeast Asia, South Asia, Central Asia, Oceania, MEA)
h
Synthetic-Text
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLEAR Global, Synthetic-Text [Dataset]. https://huggingface.co/datasets/CLEAR-Global/Synthetic-Text
Explore at:
Dataset authored and provided by
CLEAR Global
Description
Synthetic Text Dataset for 10 African Languages

This dataset contains synthetic text generated using large language models for ten African languages. It is intended to support research and evaluation in automatic speech recognition (ASR), natural language processing (NLP), and related fields for low-resource languages.

Data Generation and Licensing

I acknowledge that this dataset contains synthetic data generated through the process described in this paper. It is not… See the full description on the dataset page: https://huggingface.co/datasets/CLEAR-Global/Synthetic-Text.
Airport Synthetic Data Generation Market Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Airport Synthetic Data Generation Market Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/airport-synthetic-data-generation-market-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Jul 16, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Airport Synthetic Data Generation Market Outlook

According to the latest research, the global airport synthetic data generation market size in 2024 is valued at USD 1.42 billion. The market is experiencing robust growth, driven by the increasing adoption of artificial intelligence and machine learning in airport operations. The market is projected to reach USD 6.81 billion by 2033, expanding at a remarkable CAGR of 18.9% from 2025 to 2033. One of the primary growth factors is the escalating need for high-quality, diverse datasets to train AI models for security, passenger management, and operational efficiency within airport environments.

Growth in the airport synthetic data generation market is primarily fueled by the aviation industry’s rapid digital transformation. Airports worldwide are increasingly leveraging synthetic data to overcome the limitations of real-world data, such as privacy concerns, data scarcity, and high labeling costs. The ability to generate vast amounts of representative, bias-free, and customizable data is empowering airports to develop and test AI-driven solutions for security, baggage handling, and passenger flow management. As airports strive to enhance operational efficiency and passenger experience, the demand for synthetic data generation solutions is expected to surge further, especially as regulatory frameworks around data privacy become more stringent.

Another significant driver is the growing sophistication of cyber threats and the need for advanced security and surveillance systems in airport environments. Synthetic data generation technologies enable the creation of diverse and complex scenarios that are difficult to capture in real-world datasets. This capability is crucial for training robust AI models for facial recognition, anomaly detection, and predictive maintenance, without compromising passenger privacy. The integration of synthetic data with real-time sensor and video feeds is also facilitating more accurate and adaptive security protocols, which is a top priority for airport authorities and government agencies worldwide.

Moreover, the increasing adoption of cloud-based solutions and the evolution of AI-as-a-Service (AIaaS) platforms are accelerating the deployment of synthetic data generation tools across airports of all sizes. Cloud deployment offers scalability, flexibility, and cost-effectiveness, enabling airports to access advanced synthetic data capabilities without significant upfront investments in infrastructure. Additionally, the collaboration between technology providers, airlines, and regulatory bodies is fostering innovation and standardization in synthetic data generation practices. This collaborative ecosystem is expected to drive further market growth by enabling seamless integration of synthetic data into existing airport management systems.

From a regional perspective, North America currently leads the airport synthetic data generation market, accounting for the largest share in 2024. This dominance is attributed to the presence of major technology vendors, high airport traffic, and early adoption of AI-driven solutions. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, fueled by rapid infrastructure development, increased air travel demand, and government initiatives to modernize airport operations. Europe, Latin America, and the Middle East & Africa are also exhibiting steady growth, supported by investments in smart airport projects and digital transformation strategies.

Component Analysis

The airport synthetic data generation market by component is segmented into software and services. Software solutions dominate the market, as they form the backbone of synthetic data generation, offering customizable platforms for data simulation, annotation, and validation. These solutions are crucial for generating large-scale, high-fidelity datasets tailored to specific airport applications, such as security, baggage handling, and passenger analytics. Leading software providers are continuously enh
Synthetic Data Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Synthetic Data Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Synthetic Data Market Outlook

According to our latest research, the synthetic data market size reached USD 1.52 billion in 2024, reflecting robust growth driven by increasing demand for privacy-preserving data and the acceleration of AI and machine learning initiatives across industries. The market is projected to expand at a compelling CAGR of 34.7% from 2025 to 2033, with the forecasted market size expected to reach USD 21.4 billion by 2033. Key growth factors include the rising necessity for high-quality, diverse, and privacy-compliant datasets, the proliferation of AI-driven applications, and stringent data protection regulations worldwide.

The primary growth driver for the synthetic data market is the escalating need for advanced data privacy and compliance. Organizations across sectors such as healthcare, BFSI, and government are under increasing pressure to comply with regulations like GDPR, HIPAA, and CCPA. Synthetic data offers a viable solution by enabling the creation of realistic yet anonymized datasets, thus mitigating the risk of data breaches and privacy violations. This capability is especially crucial for industries handling sensitive personal and financial information, where traditional data anonymization techniques often fall short. As regulatory scrutiny intensifies, the adoption of synthetic data solutions is set to expand rapidly, ensuring organizations can leverage data-driven innovation without compromising on privacy or compliance.

Another significant factor propelling the synthetic data market is the surge in AI and machine learning deployment across enterprises. AI models require vast, diverse, and high-quality datasets for effective training and validation. However, real-world data is often scarce, incomplete, or biased, limiting the performance of these models. Synthetic data addresses these challenges by generating tailored datasets that represent a wide range of scenarios and edge cases. This not only enhances the accuracy and robustness of AI systems but also accelerates the development cycle by reducing dependencies on real data collection and labeling. As the demand for intelligent automation and predictive analytics grows, synthetic data is emerging as a foundational enabler for next-generation AI applications.

In addition to privacy and AI training, synthetic data is gaining traction in test data management and fraud detection. Enterprises are increasingly leveraging synthetic datasets to simulate complex business environments, test software systems, and identify vulnerabilities in a controlled manner. In fraud detection, synthetic data allows organizations to model and anticipate new fraudulent behaviors without exposing sensitive customer data. This versatility is driving adoption across diverse verticals, from automotive and manufacturing to retail and telecommunications. As digital transformation initiatives intensify and the need for robust data testing environments grows, the synthetic data market is poised for sustained expansion.

Regionally, North America dominates the synthetic data market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of technology giants, a mature AI ecosystem, and early regulatory adoption are key factors supporting North America’s leadership. Meanwhile, Asia Pacific is witnessing the fastest growth, driven by rapid digitalization, expanding AI investments, and increasing awareness of data privacy. Europe continues to see steady adoption, particularly in sectors like healthcare and finance where data protection regulations are stringent. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a nascent stage, as organizations in these regions begin to recognize the value of synthetic data for digital innovation and compliance.

Component Analysis

The synthetic data market is segmented by component into software and services. The software segment currently holds the largest market
Paper2Fig100k dataset
zenodo.org
application/gzip
Updated Nov 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez; Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez (2022). Paper2Fig100k dataset [Dataset]. http://doi.org/10.5281/zenodo.7299423
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7299423
Dataset updated
Nov 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez; Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Paper2Fig100k dataset

A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts).

The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure:

figure_id: Figure identification based on the arXiv identifier:

captions: Text pairs extracted from the paper that relates to the figure. For instance, the actual caption of the figure or references to the figure in the manuscript.

ocr_result: Result of performing OCR text recognition over the image. We provide a list of triplets (bounding box, confidence, text) present in the image.

aspect: Aspect ratio of the image (H/W).

Take a look at the OCR-VQGAN GitHub repository, which uses the Paper2Fig100k dataset to train an image encoder for figures and diagrams, that uses OCR perceptual loss to render clear and readable texts inside images.

The dataset is explained in more detail in the paper OCR-VQGAN: Taming Text-within-Image Generation @WACV 2023

Paper abstract

Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them. We demonstrate the superiority of our method by conducting several experiments on the task of figure reconstruction. Additionally, we explore the qualitative and quantitative impact of weighting different perceptual metrics in the overall loss function.
LLM - Detect AI Datamix
kaggle.com
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Z
TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection
data.niaid.nih.gov
zenodo.org
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Tubaro (2022). TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6560158
Explore at:
Dataset updated
Sep 21, 2022
Dataset provided by
Matthew C. Stamm
Davide Salvi
Paolo Bestagini
Brian Hosler
Stefano Tubaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.

In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

For the initial version of TIMIT-TTS v1.0

Arxiv: https://arxiv.org/abs/2209.08000

TIMIT-TTS Database v1.0: https://zenodo.org/record/6560159
h
myanmar-ocr-dataset
huggingface.co
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuu Htet Naing (2025). myanmar-ocr-dataset [Dataset]. https://huggingface.co/datasets/chuuhtetnaing/myanmar-ocr-dataset
Explore at:
Dataset updated
Jun 19, 2025
Authors
Chuu Htet Naing
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Myanmar OCR Dataset

A synthetic dataset for training and fine-tuning Optical Character Recognition (OCR) models specifically for the Myanmar language.

Dataset Description

This dataset contains synthetically generated OCR images created specifically for Myanmar text recognition tasks. The images were generated using myanmar-ocr-data-generator, a fork of TextRecognitionDataGenerator with fixes for proper Myanmar character splitting.

Direct Download

Available… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-ocr-dataset.
Synthetic Dyslexia Handwriting Dataset (YOLO-Format)
zenodo.org
zip
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nora Fink; Nora Fink (2025). Synthetic Dyslexia Handwriting Dataset (YOLO-Format) [Dataset]. http://doi.org/10.5281/zenodo.14852659
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14852659
Dataset updated
Feb 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nora Fink; Nora Fink
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description
This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset “A-Z Handwritten Alphabets in .csv format” 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).

In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:

78,275 images labeled as Normal

52,196 images labeled as Reversal

8,029 images labeled as Corrected

Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a “text line” style on a black background, providing YOLO-compatible .txt annotations that specify bounding boxes for each letter.

Key Points of the Synthetic Generation Process

Letter-Level Source Data
Individual characters were sampled from the original image sets.

Randomized Layout
Letters are randomly assembled into words and lines, ensuring a wide variety of visual arrangements.

Bounding Box Labels
Each character is assigned a bounding box with (x, y, width, height) in YOLO format.

Class Annotations
Classes include 0 = Normal, 1 = Reversal, and 2 = Corrected.

Preservation of Visual Characteristics
Letters retain their key dyslexia-relevant features (e.g., reversals).

Historical References & Credits

If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:

M. S. A. B. Rosli, I. S. Isa, S. A. Ramlan, S. N. Sulaiman and M. I. F. Maruzuki, "Development of CNN Transfer Learning for Dyslexia Handwriting Recognition," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 194–199, doi: 10.1109/ICCSCE52189.2021.9530971.

N. S. L. Seman, I. S. Isa, S. A. Ramlan, W. Li-Chih and M. I. F. Maruzuki, "Notice of Removal: Classification of Handwriting Impairment Using CNN for Potential Dyslexia Symptom," 2021 11th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2021, pp. 188–193, doi: 10.1109/ICCSCE52189.2021.9530989.

Isa, Iza Sazanita. CNN Comparisons Models On Dyslexia Handwriting Classification / Iza Sazanita Isa … [et Al.]. Universiti Teknologi MARA Cawangan Pulau Pinang, 2021.

Isa, I. S., Rahimi, W. N. S., Ramlan, S. A., & Sulaiman, S. N. (2019). Automated detection of dyslexia symptom based on handwriting image for primary school children. Procedia Computer Science, 163, 440–449.

References to Original Data Sources

111 P. J. Grother, “NIST Special Database 19,” NIST, 2016. [Online]. Available:
https://www.nist.gov/srd/nist-special-database-19

222 S. Patel, “A-Z Handwritten Alphabets in .csv format,” Kaggle, 2017. [Online]. Available:
https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format

Usage & Citation

Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.

Password Note (Original Data)

The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.
GLINER-multi-task-synthetic-data
huggingface.co
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knowledgator Engineering (2024). GLINER-multi-task-synthetic-data [Dataset]. https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2024
Dataset authored and provided by
Knowledgator Engineering
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components:

'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens.

'ner': A list of lists containing named entity recognition information. Each inner list has three elements:

Start index of the named entity in the… See the full description on the dataset page: https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data.
Synthetic nursing handover training and development data set - audio files
data.csiro.au
researchdata.edu.au
Updated Mar 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - audio files [Dataset]. http://doi.org/10.4225/08/58d0977ab4888
Explore at:
Unique identifier
https://doi.org/10.4225/08/58d0977ab4888
Dataset updated
Mar 21, 2017
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset funded by
NICTA
Description
This is one of two collection records. Please see the link below for the other collection of associated text files.

The two collections together comprise an open clinical dataset of three sets of 10 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

This collection contains 3 X 100 spoken free-form audio files in WAV Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

See Suominen et al (2015) in the links below for a detailed description and examples.
z
SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic...
zenodo.org
bin
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abderrahmane Kefali; Abderrahmane Kefali (2025). SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents [Dataset]. http://doi.org/10.5281/zenodo.10896124
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10896124
Dataset updated
Jan 9, 2025
Dataset provided by
Zenodo
Authors
Abderrahmane Kefali; Abderrahmane Kefali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Guelma
Description
SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents

Description:

This is a partial release of the SDADDS-Guelma dataset.

SDADDS-Guelma (Synthetic Degraded Arabic Document DataSet of the University of Guelma) is a database of synthetic noisy or degraded Arabic document images. It was created by Dr. Abderrahmane Kefali and his team to support research on preprocessing, analysis, and recognition of degraded Arabic documents, where having a large set of images for training and testing is essential. This dataset is made publicly available to researchers in the field of document analysis and recognition, with the hope that it will be useful and contribute to their research endeavors.

In this first release of the dataset, 84 handwritten images and 120 printed images have been used, along with 25 images of historical backgrounds, forming a total of 26316 synthetic images of degraded Arabic documents along with their corresponding ground-truth files.

This release is separated into two parts to facilitate upload and use: one for the handwritten documents and the second for the printed documents.

Composition of the dataset:

Each of the parts of the SDADDS-Guelma dataset is organized into directories as follows:

TXT_Files: Contains texts in UTF-8 format.

IMG: Contains images of printed and handwritten Arabic text constructed from the text files.

Bin_IMG: Contains binary images corresponding to the original images.

BG_IMG: Contains images of empty old document backgrounds used for the generation of synthetic historical document images.

GT_Files: Contains XML annotation files corresponding to the text images.

Degraded_IMG: This directory contains synthetically generated degraded images, separated into sub-directories based on noise types such as Local_Noise, Show_through, Rotation, Curvature, Comb_IMG, etc.

Ground-truth information:

Ground truth information is essential for a document dataset, as it annotates documents and represents their essential characteristics. Our dataset is designed to be a large-scale and multipurpose dataset. As such, our methodology ensures that ground truth information is provided at three levels: text level (character codes), pixel level (binary and cleaned image), and document physical structure and other annotation information level.

Textual Ground Truth: these are identical to the original texts.

Pixel-level ground truth: presented in the form of binary images.

Ground truth at the document structure level: the structure of each document image, alongside the textual transcription of the words and PAWs, is recorded in a corresponding XML annotation file. The XML format utilized resembles that employed in similar works with adjustments made according to the specific characteristics of Arabic texts, including the presence of PAWs.

Consequently, each original text image in our dataset is associated to an XML file detailing the entire ground truth and associated metadata.

Structure of XML file:

Each XML annotation file contains metadata about the document image and text content within the image, including the language, number of lines, and font attributes. It also provides detailed information about each text line, word, and Part of Arabic Words (PAWs), including their bounding boxes and textual transcriptions.

Thus, each ground truth file takes the following form:

Contact:

Name: Dr. Abderrahmane Kefali
Affiliation: University of 8 May 1945-Guelma, Algeria
Email: kefali.abderrahmane@univ-guelma.dz
f
Tran et al. Final_Dataset.xlsx
figshare.com
xlsx
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam (2024). Tran et al. Final_Dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.27619839.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27619839.v1
Dataset updated
Nov 12, 2024
Dataset provided by
figshare
Authors
Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shreyansh Sharma, GNHK-Synthetic-OCR-Dataset [Dataset]. https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset

GNHK-Synthetic-OCR-Dataset

shreyansh1347/GNHK-Synthetic-OCR-Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

Shreyansh Sharma

Description

GNHK Synthetic OCR Dataset

  Overview

Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using GNHK Dataset, and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers.

  What's Inside?

Dataset Folder: The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image

Parquet File: For easy handling and analysis… See the full description on the dataset page: https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset.

Clear search

Close search

Google apps

Main menu

GNHK-Synthetic-OCR-Dataset

UTRSet-Synth Dataset

Wikipedia corpus for synthetic data made for Handwritten Text Recognition...

Synthetic Data Generation Engine Market Research Report 2033

Synthetic Data Generation Engine Market Outlook

Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

Synthetic Data Generation Market Research Report 2033

Synthetic Data Generation Market Outlook

MJSynth_text_recognition

Synthetic Data Generation Market - Global Outlook 2020-2032

Synthetic-Text

Airport Synthetic Data Generation Market Market Research Report 2033

Airport Synthetic Data Generation Market Outlook

Component Analysis

Synthetic Data Market Research Report 2033

Synthetic Data Market Outlook

Component Analysis

Paper2Fig100k dataset

LLM - Detect AI Datamix

TIMIT-TTS: a Text-to-Speech Dataset for Synthetic Speech Detection

myanmar-ocr-dataset

Synthetic Dyslexia Handwriting Dataset (YOLO-Format)

Key Points of the Synthetic Generation Process

Historical References & Credits

References to Original Data Sources

Usage & Citation

Password Note (Original Data)

GLINER-multi-task-synthetic-data

Synthetic nursing handover training and development data set - audio files

SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic...

SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents

Description:

Composition of the dataset:

Ground-truth information:

Structure of XML file:

Contact:

Tran et al. Final_Dataset.xlsx

GNHK-Synthetic-OCR-Dataset

shreyansh1347/GNHK-Synthetic-OCR-Dataset