41 datasets found
  1. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  2. F

    Bahasa Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Bahasa Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/bahasa-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Bahasa Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Bahasa language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Bahasa OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Bahasa text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Bahasa people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Bahasa text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Bahasa crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Bahasa language. Your journey to enhanced language understanding and processing starts here.

  3. F

    Arabic Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Arabic Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/arabic-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Arabic Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Arabic language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Arabic OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Arabic text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Arabic people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Arabic text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Arabic crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Arabic language. Your journey to enhanced language understanding and processing starts here.

  4. F

    Punjabi Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/punjabi-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Punjabi Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Punjabi language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Punjabi OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Punjabi text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Punjabi people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Punjabi text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Punjabi crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Punjabi language. Your journey to enhanced language understanding and processing starts here.

  5. u

    PDMX

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.

  6. F

    Bengali Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Bengali Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/bengali-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Bengali Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Bengali language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Bengali OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Bengali text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Bengali people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Bengali text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Bengali crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Bengali language. Your journey to enhanced language understanding and processing starts here.

  7. f

    Word Dataset (Sword6k)

    • figshare.com
    zip
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Payel Sengupta; Ayatullah Faruk Mollah (2024). Word Dataset (Sword6k) [Dataset]. http://doi.org/10.6084/m9.figshare.21523479.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    figshare
    Authors
    Payel Sengupta; Ayatullah Faruk Mollah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Roman word dataset of scene word text images named Scene Word Dataset (SWord6k) is developed for character segmentation, Character recognition, text detection, and script identification. All images of SWord6k datasets are captured from an outdoor environment. These images are collected from banners, advertisements, shop names, and posters from different sources, like shopping malls, book fairs, and puja pandals. The SWord6k dataset contains 6,661 scene word images in "png" file format. Three types of ground truth annotations are composed for the SWord6k dataset, viz (i) component level i.e. to check whether a component is text or not, (ii) Script level i.e. to identify the script, (iii) recognition level i.e., character/word recognition of text. Each image's ground truth level annotations are stored in XML(extensible markup language) file format.

  8. Z

    FRLL-Morphs

    • data.niaid.nih.gov
    Updated Jul 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarkar, Eklavya (2022). FRLL-Morphs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4415159
    Explore at:
    Dataset updated
    Jul 7, 2022
    Dataset provided by
    Korshunov, Pavel
    Sarkar, Eklavya
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    DISCLAIMER

    The preprocess.py script included in this dataset is no longer necessary, and thus should NOT be run.

    Database Description

    FRLL-Morphs is a dataset of morphed faces based on images selected from the publicly available Face Research London Lab dataset. We created 4 types of morphs for each pre-selected pair of images using the following morphing tools:

    OpenCV

    FaceMorpher

    StyleGAN 2

    WebMorpher

    Instructions

    This dataset is planned for vulnerability analysis experiments in the context of face recognition. Therefore, it is intended to be used in conjunction with the original Face Research London Lab dataset.

    To prepare this folder's file structure so it may easily be used for such experiments:

    Download and extract only the neutral_front and smiling_front datasets from the Face London Research Dataset.

    Place them in a new facelab_london/raw folder.

    Rename them simply as neutral and smiling respectively.

    Remove the .tem files from the neutral folder if not specifically required for any experiments as these could clash with other operations.

    Once completed the directory's structure should be as given below:

    +-- facelab_london | +-- morph_amsl | +-- morph_facemorpher | +-- morph_opencv | +-- morph_stylegan | +-- morph_webmorph | +-- raw | +-- protocols | +-- preprocess.py | +-- README.txt

    Protocols

    The vulnerability analysis can be conducted in two ways, using:

    morphed images as references (reverse-protocol)

    morphed images as probes (scores-protocol)

    The protocols for both types of experiments are provided in the protocols folder, each of which contains the file lists of detailing the exact images used as references (for_models.lst) and as probes (for_probes.lst) for each morphing tool.

    The data is split into two sets, development (dev) and evaluation (eval), in order to be easily used by a toolkit such as bob (https://www.idiap.ch/software/bob). The split is made such as that no original identities used to make the morphed images overlap with one another, make the two sets completely independent of one another.

    References

    Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of FRLL_Morphs must cite the following papers:

    @INPROCEEDINGS{9746477, author = {Sarkar, Eklavya and Korshunov, Pavel and Colbois, Laurent and Marcel, Sébastien}, booktitle = {ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title = {Are GAN-based morphs threatening face recognition?}, year={2022}, pages={2959-2963}, url={https://doi.org/10.1109/ICASSP43922.2022.9746477} doi={10.1109/ICASSP43922.2022.9746477} }

    @article{Sarkar2020, title={Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks}, author={Eklavya Sarkar and Pavel Korshunov and Laurent Colbois and S\'{e}bastien Marcel}, year={2020}, month=oct, journal={arXiv preprint}, url={https://arxiv.org/abs/2012.05344} }

    Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of Face Research London Lab must cite the following source:

    @misc{debruine_jones_2017, title={Face Research Lab London Set}, url={https://figshare.com/articles/dataset/Face_Research_Lab_London_Set/5047666/3}, DOI={10.6084/m9.figshare.5047666.v3}, publisher={figshare}, author={DeBruine, Lisa and Jones, Benedict}, year={2017}, month={May} }

    Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of Advanced Multimedia Security Lab’s (AMSL) Face Morph Image dataset must cite the following source:

    @article{https://doi.org/10.1049/iet-bmt.2017.0147, author={Neubert, Tom and Makrushin, Andrey and Hildebrandt, Mario and Kraetzer, Christian and Dittmann, Jana}, title={Extended StirTrace benchmarking of biometric and forensic qualities of morphed face images}, journal={IET Biometrics}, volume={7}, number={4}, pages={325-332}, doi={https://doi.org/10.1049/iet-bmt.2017.0147}, url={https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/iet-bmt.2017.0147}, eprint={https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/iet-bmt.2017.0147}, year={2018} }

  9. F

    French Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/french-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the French Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the French language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this French OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible French text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native French people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of French text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native French crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the French language. Your journey to enhanced language understanding and processing starts here.

  10. F

    Gujarati Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Gujarati Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/gujarati-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Gujarati Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Gujarati language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Gujarati OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Gujarati text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Gujarati people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Gujarati text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Gujarati crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Gujarati language. Your journey to enhanced language understanding and processing starts here.

  11. F

    Danish Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Danish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/danish-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Danish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Danish language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Danish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Danish text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Danish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Danish text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Danish crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Danish language. Your journey to enhanced language understanding and processing starts here.

  12. F

    Filipino Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Filipino Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/filipino-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Filipino Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Filipino language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Filipino OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Filipino text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Filipino people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Filipino text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Filipino crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Filipino language. Your journey to enhanced language understanding and processing starts here.

  13. f

    Comparison based on average CPU time in image type identification.

    • figshare.com
    xls
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani (2025). Comparison based on average CPU time in image type identification. [Dataset]. http://doi.org/10.1371/journal.pone.0315823.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison based on average CPU time in image type identification.

  14. F

    Chinese Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Chinese Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/chinese-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Chinese Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Chinese language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Chinese OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Chinese text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Chinese people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Chinese text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Chinese crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Chinese language. Your journey to enhanced language understanding and processing starts here.

  15. f

    Performance comparison of our method with existing methods in image type...

    • plos.figshare.com
    xls
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani (2025). Performance comparison of our method with existing methods in image type identification. [Dataset]. http://doi.org/10.1371/journal.pone.0315823.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of our method with existing methods in image type identification.

  16. f

    Classification performance of the Wide ResNet-32 architectures on the...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junhyeok An; Soojin Jang; Junehyoung Kwon; Kyohoon Jin; YoungBin Kim (2023). Classification performance of the Wide ResNet-32 architectures on the CIFAR-10 and CIFAR-100 datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0274767.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Junhyeok An; Soojin Jang; Junehyoung Kwon; Kyohoon Jin; YoungBin Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification performance of the Wide ResNet-32 architectures on the CIFAR-10 and CIFAR-100 datasets.

  17. F

    Norwegian Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Norwegian Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/norwegian-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Norwegian Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Norwegian language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Norwegian OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Norwegian text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Norwegian people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Norwegian text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Norwegian crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Norwegian language. Your journey to enhanced language understanding and processing starts here.

  18. F

    Polish Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Polish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/polish-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Polish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Polish language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Polish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Polish text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Polish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Polish text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Polish crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Polish language. Your journey to enhanced language understanding and processing starts here.

  19. f

    Comparison based on sub-images segmentation accuracy in terms of percentage....

    • plos.figshare.com
    xls
    Updated Feb 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani (2025). Comparison based on sub-images segmentation accuracy in terms of percentage. [Dataset]. http://doi.org/10.1371/journal.pone.0315823.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison based on sub-images segmentation accuracy in terms of percentage.

  20. F

    Swedish Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Swedish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/swedish-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Swedish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Swedish language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Swedish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Swedish text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Swedish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Swedish text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Swedish crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Swedish language. Your journey to enhanced language understanding and processing starts here.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Organization logo

Best Books Ever Dataset

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
csvAvailable download formats
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |

Search
Clear search
Close search
Google apps
Main menu