GNHK Synthetic OCR Dataset
Overview
Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using GNHK Dataset, and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers.
What's Inside?
Dataset Folder: The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image
Parquet File: For easy handling and analysis… See the full description on the dataset page: https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset.
The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.
To generate the dataset, a custom-designed synthetic data generation module which offers precise control over variations in crucial factors such as font, text size, colour, resolution, orientation, noise, style, and background, was employed. Moreover, the UTRSet-Synth dataset tackles the limitations observed in existing datasets. It addresses the challenge of standardizing fonts by incorporating over 130 diverse Urdu fonts, which were thoroughly refined to ensure consistent rendering schemes. It overcomes the scarcity of Arabic words, numerals, and Urdu digits by incorporating a significant number of samples representing these elements. Additionally, the dataset is enriched by randomly selecting words from a vocabulary of 100,000 words during the text generation process. As a result, UTRSet-Synth contains a total of 28,187 unique words, with an average word length of 7 characters.
The availability of the UTRSet-Synth dataset, a synthetic dataset that closely emulates real-world variations, addresses the scarcity of comprehensive real-world printed Urdu OCR datasets. By providing researchers with a valuable resource for developing and benchmarking Urdu OCR models, this dataset promotes standardized evaluation, and reproducibility, and fosters advancements in the field of Urdu OCR. For more information and details about the UTRSet-Real & UTRSet-Synth datasets, please refer to the paper "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).
The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.
The contents of the archive should be placed in the Datasets/raw
directory of the DANIEL codebase.
Contents of the archive:
wiki_en
: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.
wiki_en_ner
: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.
wiki_fr
: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.
wiki_de.txt
: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.
Data format for corpora in Hugging Face datasets structure:
Each record in the datasets follows the dictionary structure below:
{
"id": "
According to our latest research, the global Synthetic Data Generation Engine market size reached USD 1.42 billion in 2024, reflecting a rapidly expanding sector driven by the escalating demand for advanced data solutions. The market is expected to achieve a robust CAGR of 37.8% from 2025 to 2033, propelling it to an estimated value of USD 21.8 billion by 2033. This exceptional growth is primarily fueled by the increasing need for high-quality, privacy-compliant datasets to train artificial intelligence and machine learning models in sectors such as healthcare, BFSI, and IT & telecommunications. As per our latest research, the proliferation of data-centric applications and stringent data privacy regulations are acting as significant catalysts for the adoption of synthetic data generation engines globally.
One of the key growth factors for the synthetic data generation engine market is the mounting emphasis on data privacy and compliance with regulations such as GDPR and CCPA. Organizations are under immense pressure to protect sensitive customer information while still deriving actionable insights from data. Synthetic data generation engines offer a compelling solution by creating artificial datasets that mimic real-world data without exposing personally identifiable information. This not only ensures compliance but also enables organizations to accelerate their AI and analytics initiatives without the constraints of data access or privacy risks. The rising awareness among enterprises about the benefits of synthetic data in mitigating data breaches and regulatory penalties is further propelling market expansion.
Another significant driver is the exponential growth in artificial intelligence and machine learning adoption across industries. Training robust and unbiased models requires vast and diverse datasets, which are often difficult to obtain due to privacy concerns, labeling costs, or data scarcity. Synthetic data generation engines address this challenge by providing scalable and customizable datasets for various applications, including machine learning model training, data augmentation, and fraud detection. The ability to generate balanced and representative data has become a critical enabler for organizations seeking to improve model accuracy, reduce bias, and accelerate time-to-market for AI solutions. This trend is particularly pronounced in sectors such as healthcare, automotive, and finance, where data diversity and privacy are paramount.
Furthermore, the increasing complexity of data types and the need for multi-modal data synthesis are shaping the evolution of the synthetic data generation engine market. With the proliferation of unstructured data in the form of images, videos, audio, and text, organizations are seeking advanced engines capable of generating synthetic data across multiple modalities. This capability enhances the versatility of synthetic data solutions, enabling their application in emerging use cases such as autonomous vehicle simulation, natural language processing, and biometric authentication. The integration of generative AI techniques, such as GANs and diffusion models, is further enhancing the realism and utility of synthetic datasets, expanding the addressable market for synthetic data generation engines.
From a regional perspective, North America continues to dominate the synthetic data generation engine market, accounting for the largest revenue share in 2024. The region's leadership is attributed to the strong presence of technology giants, early adoption of AI and machine learning, and stringent regulatory frameworks. Europe follows closely, driven by robust data privacy regulations and increasing investments in digital transformation. Meanwhile, the Asia Pacific region is emerging as the fastest-growing market, supported by expanding IT infrastructure, government-led AI initiatives, and a burgeoning startup ecosystem. Latin America and the Middle East & Africa are also witnessing gradual adoption, fueled by the growing recognition of synthetic data's potential to overcome data access and privacy challenges.
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.
What Makes Our Data Unique?
Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.
Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.
Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.
Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.
How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.
Primary Use Cases and Verticals
Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.
Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.
B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.
HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.
How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.
Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.
Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.
Contact us for sample datasets or to discuss your specific needs.
According to our latest research, the global synthetic data generation market size reached USD 1.6 billion in 2024, demonstrating robust expansion driven by increasing demand for high-quality, privacy-preserving datasets. The market is projected to grow at a CAGR of 38.2% over the forecast period, reaching USD 19.2 billion by 2033. This remarkable growth trajectory is fueled by the growing adoption of artificial intelligence (AI) and machine learning (ML) technologies across industries, coupled with stringent data privacy regulations that necessitate innovative data solutions. As per our latest research, organizations worldwide are increasingly leveraging synthetic data to address data scarcity, enhance AI model training, and ensure compliance with evolving privacy standards.
One of the primary growth factors for the synthetic data generation market is the rising emphasis on data privacy and regulatory compliance. With the implementation of stringent data protection laws such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States, enterprises are under immense pressure to safeguard sensitive information. Synthetic data offers a compelling solution by enabling organizations to generate artificial datasets that mirror the statistical properties of real data without exposing personally identifiable information. This not only facilitates regulatory compliance but also empowers organizations to innovate without the risk of data breaches or privacy violations. As businesses increasingly recognize the value of privacy-preserving data, the demand for advanced synthetic data generation solutions is set to surge.
Another significant driver is the exponential growth in AI and ML adoption across various sectors, including healthcare, finance, automotive, and retail. High-quality, diverse, and unbiased data is the cornerstone of effective AI model development. However, acquiring such data is often challenging due to privacy concerns, limited availability, or high acquisition costs. Synthetic data generation bridges this gap by providing scalable, customizable datasets tailored to specific use cases, thereby accelerating AI training and reducing dependency on real-world data. Organizations are leveraging synthetic data to enhance algorithm performance, mitigate data bias, and simulate rare events, which are otherwise difficult to capture in real datasets. This capability is particularly valuable in sectors like autonomous vehicles, where training models on rare but critical scenarios is essential for safety and reliability.
Furthermore, the growing complexity of data types—ranging from tabular and image data to text, audio, and video—has amplified the need for versatile synthetic data generation tools. Enterprises are increasingly seeking solutions that can generate multi-modal synthetic datasets to support diverse applications such as fraud detection, product testing, and quality assurance. The flexibility offered by synthetic data generation platforms enables organizations to simulate a wide array of scenarios, test software systems, and validate AI models in controlled environments. This not only enhances operational efficiency but also drives innovation by enabling rapid prototyping and experimentation. As the digital ecosystem continues to evolve, the ability to generate synthetic data across various formats will be a critical differentiator for businesses striving to maintain a competitive edge.
Regionally, North America leads the synthetic data generation market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The dominance of North America can be attributed to the strong presence of technology giants, advanced research institutions, and a favorable regulatory environment that encourages AI innovation. Europe is witnessing rapid growth due to proactive data privacy regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, driven by the proliferation of digital technologies and rising adoption of AI-powered solutions across industries. Latin America and the Middle East & Africa are also expected to experience steady growth, supported by government-led digitalization programs and expanding IT infrastructure.
Dataset Card for "MJSynth_text_recognition"
This is the MJSynth dataset for text recognition on document images, synthetically generated, covering 90K English words. It includes training, validation and test splits. Source of the dataset: https://www.robots.ox.ac.uk/~vgg/data/text/ Use dataset streaming functionality to try out the dataset quickly without downloading the entire dataset (refer: https://huggingface.co/docs/datasets/stream) Citation details provided on the source… See the full description on the dataset page: https://huggingface.co/datasets/priyank-m/MJSynth_text_recognition.
https://www.htfmarketinsights.com/privacy-policyhttps://www.htfmarketinsights.com/privacy-policy
Global Synthetic Data Generation is segmented by Application (AI training, Software testing, Fraud detection, Privacy preservation, Autonomous driving), Type (Tabular, Image, Video, Text, Time-series) and Geography(North America, LATAM, West Europe, Central & Eastern Europe, Northern Europe, Southern Europe, East Asia, Southeast Asia, South Asia, Central Asia, Oceania, MEA)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Intelligent Invoice Management System
Project Description:
The Intelligent Invoice Management System is an advanced AI-powered platform designed to revolutionize traditional invoice processing. By automating the extraction, validation, and management of invoice data, this system addresses the inefficiencies, inaccuracies, and high costs associated with manual methods. It enables businesses to streamline operations, reduce human error, and expedite payment cycles.
Problem Statement:
Manual invoice processing involves labor-intensive tasks such as data entry, verification, and reconciliation. These processes are time-consuming, prone to errors, and can result in financial losses and delays. The diversity of invoice formats from various vendors adds complexity, making automation a critical need for efficiency and scalability.
Proposed Solution:
The Intelligent Invoice Management System automates the end-to-end process of invoice handling using AI and machine learning techniques. Core functionalities include:
1. Invoice Generation: Automatically generate PDF invoices in at least four formats, populated with synthetic data.
2. Data Development: Leverage a dataset containing fields such as receipt numbers, company details, sales tax information, and itemized tables to create realistic invoice samples.
3. AI-Powered Labeling: Use Tesseract OCR to extract labeled data from invoice images, and train YOLO for label recognition, ensuring precise identification of fields.
4. Database Integration: Store extracted information in a structured database for seamless retrieval and analysis.
5. Web-Based Information System: Provide a user-friendly platform to upload invoices and retrieve key metrics, such as:
- Total sales within a specified duration.
- Total sales tax paid during a given timeframe.
- Detailed invoice information in tabular form for specific date ranges.
Key Features and Deliverables:
1. Invoice Generation:
- Generate 20,000 invoices using an automated script.
- Include dummy logos, company details, and itemized tables for four items per invoice.
Label Definition and Format:
OCR and AI Training:
Database Management:
Web-Based Interface:
Expected Outcomes:
- Reduction in manual effort and operational costs.
- Improved accuracy in invoice processing and financial reporting.
- Enhanced scalability and adaptability for diverse invoice formats.
- Faster turnaround time for invoice-related tasks.
By automating critical aspects of invoice management, this system delivers a robust and intelligent solution to meet the evolving needs of businesses.
Synthetic Text Dataset for 10 African Languages
This dataset contains synthetic text generated using large language models for ten African languages. It is intended to support research and evaluation in automatic speech recognition (ASR), natural language processing (NLP), and related fields for low-resource languages.
Data Generation and Licensing
I acknowledge that this dataset contains synthetic data generated through the process described in this paper. It is not… See the full description on the dataset page: https://huggingface.co/datasets/CLEAR-Global/Synthetic-Text.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Myanmar OCR Dataset
A synthetic dataset for training and fine-tuning Optical Character Recognition (OCR) models specifically for the Myanmar language.
Dataset Description
This dataset contains synthetically generated OCR images created specifically for Myanmar text recognition tasks. The images were generated using myanmar-ocr-data-generator, a fork of TextRecognitionDataGenerator with fixes for proper Myanmar character splitting.
Direct Download
Available… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-ocr-dataset.
According to our latest research, the synthetic data market size reached USD 1.52 billion in 2024, reflecting robust growth driven by increasing demand for privacy-preserving data and the acceleration of AI and machine learning initiatives across industries. The market is projected to expand at a compelling CAGR of 34.7% from 2025 to 2033, with the forecasted market size expected to reach USD 21.4 billion by 2033. Key growth factors include the rising necessity for high-quality, diverse, and privacy-compliant datasets, the proliferation of AI-driven applications, and stringent data protection regulations worldwide.
The primary growth driver for the synthetic data market is the escalating need for advanced data privacy and compliance. Organizations across sectors such as healthcare, BFSI, and government are under increasing pressure to comply with regulations like GDPR, HIPAA, and CCPA. Synthetic data offers a viable solution by enabling the creation of realistic yet anonymized datasets, thus mitigating the risk of data breaches and privacy violations. This capability is especially crucial for industries handling sensitive personal and financial information, where traditional data anonymization techniques often fall short. As regulatory scrutiny intensifies, the adoption of synthetic data solutions is set to expand rapidly, ensuring organizations can leverage data-driven innovation without compromising on privacy or compliance.
Another significant factor propelling the synthetic data market is the surge in AI and machine learning deployment across enterprises. AI models require vast, diverse, and high-quality datasets for effective training and validation. However, real-world data is often scarce, incomplete, or biased, limiting the performance of these models. Synthetic data addresses these challenges by generating tailored datasets that represent a wide range of scenarios and edge cases. This not only enhances the accuracy and robustness of AI systems but also accelerates the development cycle by reducing dependencies on real data collection and labeling. As the demand for intelligent automation and predictive analytics grows, synthetic data is emerging as a foundational enabler for next-generation AI applications.
In addition to privacy and AI training, synthetic data is gaining traction in test data management and fraud detection. Enterprises are increasingly leveraging synthetic datasets to simulate complex business environments, test software systems, and identify vulnerabilities in a controlled manner. In fraud detection, synthetic data allows organizations to model and anticipate new fraudulent behaviors without exposing sensitive customer data. This versatility is driving adoption across diverse verticals, from automotive and manufacturing to retail and telecommunications. As digital transformation initiatives intensify and the need for robust data testing environments grows, the synthetic data market is poised for sustained expansion.
Regionally, North America dominates the synthetic data market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of technology giants, a mature AI ecosystem, and early regulatory adoption are key factors supporting North America’s leadership. Meanwhile, Asia Pacific is witnessing the fastest growth, driven by rapid digitalization, expanding AI investments, and increasing awareness of data privacy. Europe continues to see steady adoption, particularly in sectors like healthcare and finance where data protection regulations are stringent. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a nascent stage, as organizations in these regions begin to recognize the value of synthetic data for digital innovation and compliance.
The synthetic data market is segmented by component into software and services. The software segment currently holds the largest market
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Paper2Fig100k dataset
A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts).
The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure:
Take a look at the OCR-VQGAN GitHub repository, which uses the Paper2Fig100k dataset to train an image encoder for figures and diagrams, that uses OCR perceptual loss to render clear and readable texts inside images.
The dataset is explained in more detail in the paper OCR-VQGAN: Taming Text-within-Image Generation @WACV 2023
Paper abstract
Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them. We demonstrate the superiority of our method by conducting several experiments on the task of figure reconstruction. Additionally, we explore the qualitative and quantitative impact of weighting different perceptual metrics in the overall loss function.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖
during the LLM - Detect AI Generated Text
competition. This dataset helped us to win the competition. It facilitates a text-classification
task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos (e.g., deepfakes where both the visual and audio contents can be counterfeited) that are taking the scene over still images.
The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time.
This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors (systems that consider both audio and video components). This is due to the difficulty of developing them but also to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms.
In this paper we focus on the generation of an audio-visual deepfake dataset.
First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions.
This highlights the need for multimodal forensic detectors and more multimodal deepfake data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components:
'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens.
'ner': A list of lists containing named entity recognition information. Each inner list has three elements:
Start index of the named entity in the… See the full description on the dataset page: https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This synthetic dataset has been generated to facilitate object detection (in YOLO format) for research on dyslexia-related handwriting patterns. It builds upon an original corpus of uppercase and lowercase letters obtained from multiple sources: the NIST Special Database 19 111, the Kaggle dataset “A-Z Handwritten Alphabets in .csv format” 222, as well as handwriting samples from dyslexic primary school children of Seberang Jaya, Penang (Malaysia).
In the original dataset, uppercase letters originated from NIST Special Database 19, while lowercase letters came from the Kaggle dataset curated by S. Patel. Additional images (categorized as Normal, Reversal, and Corrected) were collected and labeled based on handwriting samples of dyslexic and non-dyslexic students, resulting in:
Building upon this foundation, the Synthetic Dyslexia Handwriting Dataset presented here was programmatically generated to produce labeled examples suitable for training and validating object detection models. Each synthetic image arranges multiple letters of various classes (Normal, Reversal, Corrected) in a “text line” style on a black background, providing YOLO-compatible .txt
annotations that specify bounding boxes for each letter.
(x, y, width, height)
in YOLO format.0 = Normal
, 1 = Reversal
, and 2 = Corrected
.If you are using this synthetic dataset or the original Dyslexia Handwriting Dataset, please cite the following papers:
111 P. J. Grother, “NIST Special Database 19,” NIST, 2016. [Online]. Available:
https://www.nist.gov/srd/nist-special-database-19
222 S. Patel, “A-Z Handwritten Alphabets in .csv format,” Kaggle, 2017. [Online]. Available:
https://www.kaggle.com/sachinpatel21/az-handwritten-alphabets-in-csv-format
Researchers and practitioners are encouraged to integrate this synthetic dataset into their computer vision pipelines for tasks such as dyslexia pattern analysis, character recognition, and educational technology development. Please cite the original authors and publications if you utilize this synthetic dataset in your work.
The original RAR file was password-protected with the password: WanAsy321. This synthetic dataset, however, is provided openly for streamlined usage.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This is one of two collection records. Please see the link below for the other collection of associated text files.
The two collections together comprise an open clinical dataset of three sets of 10 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.
This collection contains 3 X 100 spoken free-form audio files in WAV Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.
See Suominen et al (2015) in the links below for a detailed description and examples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a partial release of the SDADDS-Guelma dataset.
SDADDS-Guelma (Synthetic Degraded Arabic Document DataSet of the University of Guelma) is a database of synthetic noisy or degraded Arabic document images. It was created by Dr. Abderrahmane Kefali and his team to support research on preprocessing, analysis, and recognition of degraded Arabic documents, where having a large set of images for training and testing is essential. This dataset is made publicly available to researchers in the field of document analysis and recognition, with the hope that it will be useful and contribute to their research endeavors.
In this first release of the dataset, 84 handwritten images and 120 printed images have been used, along with 25 images of historical backgrounds, forming a total of 26316 synthetic images of degraded Arabic documents along with their corresponding ground-truth files.
This release is separated into two parts to facilitate upload and use: one for the handwritten documents and the second for the printed documents.
Each of the parts of the SDADDS-Guelma dataset is organized into directories as follows:
Ground truth information is essential for a document dataset, as it annotates documents and represents their essential characteristics. Our dataset is designed to be a large-scale and multipurpose dataset. As such, our methodology ensures that ground truth information is provided at three levels: text level (character codes), pixel level (binary and cleaned image), and document physical structure and other annotation information level.
Consequently, each original text image in our dataset is associated to an XML file detailing the entire ground truth and associated metadata.
Each XML annotation file contains metadata about the document image and text content within the image, including the language, number of lines, and font attributes. It also provides detailed information about each text line, word, and Part of Arabic Words (PAWs), including their bounding boxes and textual transcriptions.
Thus, each ground truth file takes the following form:
Name: Dr. Abderrahmane Kefali
Affiliation: University of 8 May 1945-Guelma, Algeria
Email: kefali.abderrahmane@univ-guelma.dz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.
GNHK Synthetic OCR Dataset
Overview
Welcome to the GNHK Synthetic OCR Dataset repository. Here I have generated synthetic data using GNHK Dataset, and Open Source LLMs like Mixtral. The dataset contains queries on the images and their answers.
What's Inside?
Dataset Folder: The Dataset Folder contains the images, and corresponding to each image, there is a JSON file which carries the ocr information of that image
Parquet File: For easy handling and analysis… See the full description on the dataset page: https://huggingface.co/datasets/shreyansh1347/GNHK-Synthetic-OCR-Dataset.