41 datasets found

Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
F
Bahasa Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/bahasa-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Bahasa Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Bahasa language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Bahasa OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Bahasa text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Bahasa people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Bahasa text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Bahasa crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Bahasa language. Your journey to enhanced language understanding and processing starts here.
F
Arabic Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/arabic-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Arabic Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Arabic language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Arabic OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Arabic text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Arabic people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Arabic text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Arabic crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Arabic language. Your journey to enhanced language understanding and processing starts here.
F
Punjabi Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/punjabi-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Punjabi Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Punjabi language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Punjabi OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Punjabi text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Punjabi people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Punjabi text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Punjabi crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Punjabi language. Your journey to enhanced language understanding and processing starts here.
u
PDMX
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
F
Bengali Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bengali Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/bengali-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Bengali Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Bengali language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Bengali OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Bengali text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Bengali people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Bengali text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Bengali crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Bengali language. Your journey to enhanced language understanding and processing starts here.
f
Word Dataset (Sword6k)
figshare.com
zip
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Payel Sengupta; Ayatullah Faruk Mollah (2024). Word Dataset (Sword6k) [Dataset]. http://doi.org/10.6084/m9.figshare.21523479.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21523479.v1
Dataset updated
Jan 18, 2024
Dataset provided by
figshare
Authors
Payel Sengupta; Ayatullah Faruk Mollah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Roman word dataset of scene word text images named Scene Word Dataset (SWord6k) is developed for character segmentation, Character recognition, text detection, and script identification. All images of SWord6k datasets are captured from an outdoor environment. These images are collected from banners, advertisements, shop names, and posters from different sources, like shopping malls, book fairs, and puja pandals. The SWord6k dataset contains 6,661 scene word images in "png" file format. Three types of ground truth annotations are composed for the SWord6k dataset, viz (i) component level i.e. to check whether a component is text or not, (ii) Script level i.e. to identify the script, (iii) recognition level i.e., character/word recognition of text. Each image's ground truth level annotations are stored in XML(extensible markup language) file format.
Z
FRLL-Morphs
data.niaid.nih.gov
Updated Jul 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarkar, Eklavya (2022). FRLL-Morphs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4415159
Explore at:
Dataset updated
Jul 7, 2022
Dataset provided by
Korshunov, Pavel
Sarkar, Eklavya
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
DISCLAIMER

The preprocess.py script included in this dataset is no longer necessary, and thus should NOT be run.

Database Description

FRLL-Morphs is a dataset of morphed faces based on images selected from the publicly available Face Research London Lab dataset. We created 4 types of morphs for each pre-selected pair of images using the following morphing tools:

OpenCV

FaceMorpher

StyleGAN 2

WebMorpher

Instructions

This dataset is planned for vulnerability analysis experiments in the context of face recognition. Therefore, it is intended to be used in conjunction with the original Face Research London Lab dataset.

To prepare this folder's file structure so it may easily be used for such experiments:

Download and extract only the neutral_front and smiling_front datasets from the Face London Research Dataset.

Place them in a new facelab_london/raw folder.

Rename them simply as neutral and smiling respectively.

Remove the .tem files from the neutral folder if not specifically required for any experiments as these could clash with other operations.

Once completed the directory's structure should be as given below:

+-- facelab_london | +-- morph_amsl | +-- morph_facemorpher | +-- morph_opencv | +-- morph_stylegan | +-- morph_webmorph | +-- raw | +-- protocols | +-- preprocess.py | +-- README.txt

Protocols

The vulnerability analysis can be conducted in two ways, using:

morphed images as references (reverse-protocol)

morphed images as probes (scores-protocol)

The protocols for both types of experiments are provided in the protocols folder, each of which contains the file lists of detailing the exact images used as references (for_models.lst) and as probes (for_probes.lst) for each morphing tool.

The data is split into two sets, development (dev) and evaluation (eval), in order to be easily used by a toolkit such as bob (https://www.idiap.ch/software/bob). The split is made such as that no original identities used to make the morphed images overlap with one another, make the two sets completely independent of one another.

References

Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of FRLL_Morphs must cite the following papers:

@INPROCEEDINGS{9746477, author = {Sarkar, Eklavya and Korshunov, Pavel and Colbois, Laurent and Marcel, Sébastien}, booktitle = {ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title = {Are GAN-based morphs threatening face recognition?}, year={2022}, pages={2959-2963}, url={https://doi.org/10.1109/ICASSP43922.2022.9746477} doi={10.1109/ICASSP43922.2022.9746477} }

@article{Sarkar2020, title={Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks}, author={Eklavya Sarkar and Pavel Korshunov and Laurent Colbois and S\'{e}bastien Marcel}, year={2020}, month=oct, journal={arXiv preprint}, url={https://arxiv.org/abs/2012.05344} }

Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of Face Research London Lab must cite the following source:

@misc{debruine_jones_2017, title={Face Research Lab London Set}, url={https://figshare.com/articles/dataset/Face_Research_Lab_London_Set/5047666/3}, DOI={10.6084/m9.figshare.5047666.v3}, publisher={figshare}, author={DeBruine, Lisa and Jones, Benedict}, year={2017}, month={May} }

Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of Advanced Multimedia Security Lab’s (AMSL) Face Morph Image dataset must cite the following source:

@article{https://doi.org/10.1049/iet-bmt.2017.0147, author={Neubert, Tom and Makrushin, Andrey and Hildebrandt, Mario and Kraetzer, Christian and Dittmann, Jana}, title={Extended StirTrace benchmarking of biometric and forensic qualities of morphed face images}, journal={IET Biometrics}, volume={7}, number={4}, pages={325-332}, doi={https://doi.org/10.1049/iet-bmt.2017.0147}, url={https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/iet-bmt.2017.0147}, eprint={https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/iet-bmt.2017.0147}, year={2018} }
F
French Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/french-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the French Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the French language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this French OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible French text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native French people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of French text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native French crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the French language. Your journey to enhanced language understanding and processing starts here.
F
Gujarati Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Gujarati Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/gujarati-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Gujarati Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Gujarati language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Gujarati OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Gujarati text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Gujarati people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Gujarati text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Gujarati crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Gujarati language. Your journey to enhanced language understanding and processing starts here.
F
Danish Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Danish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/danish-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Danish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Danish language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Danish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Danish text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Danish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Danish text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Danish crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Danish language. Your journey to enhanced language understanding and processing starts here.
F
Filipino Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Filipino Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/filipino-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Filipino Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Filipino language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Filipino OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Filipino text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Filipino people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Filipino text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Filipino crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Filipino language. Your journey to enhanced language understanding and processing starts here.
f
Comparison based on average CPU time in image type identification.
figshare.com
xls
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani (2025). Comparison based on average CPU time in image type identification. [Dataset]. http://doi.org/10.1371/journal.pone.0315823.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315823.t005
Dataset updated
Feb 20, 2025
Dataset provided by
PLOS ONE
Authors
Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison based on average CPU time in image type identification.
F
Chinese Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Chinese Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/chinese-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Chinese Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Chinese language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Chinese OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Chinese text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Chinese people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Chinese text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Chinese crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Chinese language. Your journey to enhanced language understanding and processing starts here.
f
Performance comparison of our method with existing methods in image type...
plos.figshare.com
xls
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani (2025). Performance comparison of our method with existing methods in image type identification. [Dataset]. http://doi.org/10.1371/journal.pone.0315823.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315823.t001
Dataset updated
Feb 20, 2025
Dataset provided by
PLOS ONE
Authors
Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of our method with existing methods in image type identification.
f
Classification performance of the Wide ResNet-32 architectures on the...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junhyeok An; Soojin Jang; Junehyoung Kwon; Kyohoon Jin; YoungBin Kim (2023). Classification performance of the Wide ResNet-32 architectures on the CIFAR-10 and CIFAR-100 datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0274767.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274767.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Junhyeok An; Soojin Jang; Junehyoung Kwon; Kyohoon Jin; YoungBin Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification performance of the Wide ResNet-32 architectures on the CIFAR-10 and CIFAR-100 datasets.
F
Norwegian Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Norwegian Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/norwegian-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Norwegian Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Norwegian language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Norwegian OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Norwegian text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Norwegian people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Norwegian text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Norwegian crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Norwegian language. Your journey to enhanced language understanding and processing starts here.
F
Polish Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Polish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/polish-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Polish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Polish language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Polish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Polish text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Polish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Polish text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Polish crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Polish language. Your journey to enhanced language understanding and processing starts here.
f
Comparison based on sub-images segmentation accuracy in terms of percentage....
plos.figshare.com
xls
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani (2025). Comparison based on sub-images segmentation accuracy in terms of percentage. [Dataset]. http://doi.org/10.1371/journal.pone.0315823.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315823.t003
Dataset updated
Feb 20, 2025
Dataset provided by
PLOS ONE
Authors
Faqir Gul; Mohsin Shah; Mushtaq Ali; Lal Hussain; Touseef Sadiq; Adeel Ahmed Abbasi; Mohammad Shahbaz Khan; Badr S. Alkahtani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison based on sub-images segmentation accuracy in terms of percentage.
F
Swedish Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Swedish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/swedish-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Swedish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Swedish language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Swedish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Swedish text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Swedish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Swedish text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Swedish crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Swedish language. Your journey to enhanced language understanding and processing starts here.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096

Best Books Ever Dataset

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4265096

Dataset updated

Nov 10, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |

Clear search

Close search

Google apps

Main menu

Best Books Ever Dataset

Bahasa Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Arabic Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Punjabi Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

PDMX

Bengali Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Word Dataset (Sword6k)

FRLL-Morphs

French Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Gujarati Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Danish Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Filipino Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Comparison based on average CPU time in image type identification.

Chinese Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Performance comparison of our method with existing methods in image type...

Classification performance of the Wide ResNet-32 architectures on the...

Norwegian Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Polish Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Comparison based on sub-images segmentation accuracy in terms of percentage....

Swedish Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Best Books Ever Dataset