100+ datasets found
  1. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  2. P

    BookCorpus Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler, BookCorpus Dataset [Dataset]. https://paperswithcode.com/dataset/bookcorpus
    Explore at:
    Authors
    Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
    Description

    BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).

  3. h

    blbooks

    • huggingface.co
    Updated Jan 15, 1996
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Library (1996). blbooks [Dataset]. https://huggingface.co/datasets/TheBritishLibrary/blbooks
    Explore at:
    Dataset updated
    Jan 15, 1996
    Dataset authored and provided by
    British Library
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature.

  4. Goodreads Book Reviews

    • kaggle.com
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Goodreads Book Reviews [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews1/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.

    Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930

    Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.

    Example (interaction data): json { "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

    Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.

    Citation: Please cite the following if you use the data: Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys, 2018 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)

    Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.

    Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.

    Datasets:

    Meta-Data of Books:

    • Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.
    • Detailed Information of Authors (goodreads_book_authors.json.gz):
      • An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.
      • Download Link
    • Detailed Information of Works (goodreads_book_works.json.gz):
      • This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.
      • Download Link
    • Detailed Information of Book Series (goodreads_book_series.json.gz):
      • A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.
      • Download Link
    • Extracted Fuzzy Book Genres (goodreads_book_genres_initial.json....
  5. R

    Book Dataset

    • universe.roboflow.com
    zip
    Updated Oct 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kwsr (2024). Book Dataset [Dataset]. https://universe.roboflow.com/kwsr/book-gtby9/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset authored and provided by
    kwsr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Name Bounding Boxes
    Description

    Book

    ## Overview
    
    Book is a dataset for object detection tasks - it contains Name annotations for 4,300 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. Books Dataset

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giuseppe Mendola (2016). Books Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1441255.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Giuseppe Mendola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database contains information about books gathered with help of Google Books API. The database contains 7 different tables where 3 of them are only to relate the other tables together. Tables: Books contains 1062 records. Authors contains 1595 records. Categories 109 records. Metadata 37 records. MD5 (GBooks_2015-06-09.sql) = bfd09094d0e123e668b2e58332b1a98b

  7. P

    BookSum Dataset

    • paperswithcode.com
    • tensorflow.org
    • +1more
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wojciech Kryściński; Nazneen Rajani; Divyansh Agarwal; Caiming Xiong; Dragomir Radev (2024). BookSum Dataset [Dataset]. https://paperswithcode.com/dataset/booksum
    Explore at:
    Dataset updated
    Apr 9, 2024
    Authors
    Wojciech Kryściński; Nazneen Rajani; Divyansh Agarwal; Caiming Xiong; Dragomir Radev
    Description

    BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.

    BookSum contains summaries for 142,753 paragraphs, 12,293 chapters and 436 books.

  8. h

    opus_books

    • huggingface.co
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for OPUS Books

      Dataset Summary
    

    This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

  9. Goodreads Book Datasets With User Rating 2M

    • kaggle.com
    zip
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bahram Jannesar (2020). Goodreads Book Datasets With User Rating 2M [Dataset]. https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m
    Explore at:
    zip(368593554 bytes)Available download formats
    Dataset updated
    Jul 9, 2020
    Authors
    Bahram Jannesar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Best quot ever :

    Don't ever tell anybody anything, if you do, you start missing everybody J.D. Salinger

    Story

    Every one of us knows the Goodreads, and every book lovers when want to buy a book, firstly search the title of the book on this website and read all of that reviews and ratings are available there for that book. do you know the better place for scraping data from there? tell us ba.jannesar@gmail.com or ghaderi.soroush1995@gmail.com Goodreads one the best place for this job! 💯

    These datasets are very good for two jobs :

    1 . Creating book recommendation system based on 10 M books 🥇 2 . Using the Description columns for NLP 🥈

    Github repo

    Project link on github or here.

    Content

    Approximately 10,000,000 books are available on the site's archives, and these datasets are collecting from them. for requesting on the API, we used Goodreads python library, ****Datasets will be updated every 2 days.****

    Acknowledgements

    This data was entirely scrapped from the Goodreads API.

    Inspiration

    Do you know what is NLP? , download these datasets then upvote 💯.

    Book Sample

    JSON : { "Id": "5107", "Name": "The Catcher in the Rye", "RatingDist1": "1:133165", "RatingDist2": "2:224884", "RatingDist3": "3:553476", "RatingDist4": "4:808278", "RatingDist5": "5:891037", "pagesNumber": 277, "RatingDistTotal": "total:2610840", "PublishMonth": 30, "PublishDay": 1, "Publisher": "Back Bay Books", "CountsOfReview": 44046, "PublishYear": 2001, "Language": "eng", "Authors": "J.D. Salinger", "Rating": 3.8, "ISBN": "0316769177", "Count of text reviews": 55539, "Description": "The hero-narrator of The Catcher in the Rye is an ancient child of sixteen, a native New Yorker named Holden Caulfield. Through circumstances that tend to preclude adult, secondhand description, he leaves his prep school in Pennsylvania and goes underground in New York City for three days. " } Or CSV :

    5107,The Catcher in the Rye,1:133165,277,4:808278,total:2610840,30,1,Back Bay Books,44046,2001,eng,J.D. Salinger,3.8,2:224884,5:891037,0316769177,3:553476,55539,"The hero-narrator of The Catcher in the Rye is an ancient child of sixteen, a native New Yorker named Holden Caulfield. Through circumstances that tend to preclude adult, secondhand description, he leaves his prep school in Pennsylvania and goes underground in New York City for three days. "

  10. Audible Dataset

    • kaggle.com
    Updated Apr 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Snehangsu De (2022). Audible Dataset [Dataset]. https://www.kaggle.com/datasets/snehangsude/audible-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Snehangsu De
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    With the trend toward audiobooks growing, I gathered this data to understand how the audiobook market has been growing over the years. From authors of audiobooks to release dates, the data represents the important details of audiobooks from 1998 till 2025 (pre-planned releases).

    I have yet to find a great audiobooks dataset and hence the urge to make a dataset that provides us with information on the basics and the history of audiobooks. I look to improve the dataset with more details in the near future.

    File Information

    The Uncleaned data or audible_uncleaned.csv is exactly the raw data I derived from Audible.in The Cleaned one or audible_cleaned.csv consists of a few basic data cleaning steps.

    Libraries used

    The data was collected using webs-scraping. - re - Beautiful Soup - Selenium

    Beautiful Soup and Selenium were used in unison to mainly gather the data. The code can be re-used and you can find the code here: https://github.com/snehangsude/audible_scraper

    Column Breakdown

    • name: Name of the audiobook
    • author: Author of the audiobook
    • narrator: Narrator of the audiobook
    • time: Length of the audiobook
    • releasedate: Release date of the audiobook
    • language: Language of the audiobook
    • stars: No. of stars the audiobook received
    • price: Price of the audiobook in INR
    • ratings: No. of reviews received by the audiobook
  11. R

    Chess Book Dataset

    • universe.roboflow.com
    zip
    Updated Oct 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nada (2023). Chess Book Dataset [Dataset]. https://universe.roboflow.com/nada-mpxyo/chess-book
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2023
    Dataset authored and provided by
    Nada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Pgn Board Bounding Boxes
    Description

    Chess Book

    ## Overview
    
    Chess Book is a dataset for object detection tasks - it contains Pgn Board annotations for 478 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  12. P

    BookTest Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ondrej Bajgar; Rudolf Kadlec; Jan Kleindienst (2022). BookTest Dataset [Dataset]. https://paperswithcode.com/dataset/booktest
    Explore at:
    Dataset updated
    Mar 25, 2022
    Authors
    Ondrej Bajgar; Rudolf Kadlec; Jan Kleindienst
    Description

    BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.

  13. openbookqa

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Mar 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). openbookqa [Dataset]. https://huggingface.co/datasets/allenai/openbookqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for OpenBookQA

      Dataset Summary
    

    OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.

  14. R

    Oriented Books Dataset

    • universe.roboflow.com
    zip
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    koteitan (2024). Oriented Books Dataset [Dataset]. https://universe.roboflow.com/koteitan/oriented-books
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 4, 2024
    Dataset authored and provided by
    koteitan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Book Bounding Boxes
    Description

    Oriented Books

    ## Overview
    
    Oriented Books is a dataset for object detection tasks - it contains Book annotations for 661 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  15. o

    Project Gutenberg Book Corpus

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Project Gutenberg Book Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/0979850d-7ed8-4aeb-887d-4ad585d2f661
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Education & Learning Analytics
    Description

    This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.

    Columns

    The dataset primarily includes the following columns:

    • Title: The title of the book.
    • Author: The author of the book.
    • Link: The direct download link for the book's text.
    • Bookshelf: The category or genre assigned to the book on Project Gutenberg.
    • Text Data: The actual text content of the books, which can be downloaded using a provided script.

    Distribution

    The dataset's metadata is initially available in a gutenberg_metadata.csv file. The full text data for each book can be downloaded using a gutenberg_download.py script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.

    Usage

    This dataset is ideal for various applications in education and learning analytics. Specific use cases include:

    • Natural Language Processing (NLP) tasks, such as text analysis, topic modelling, and language understanding.
    • Literature studies and computational humanities research.
    • Developing and training AI and Machine Learning models on large text corpora.
    • Working with audio data, as some books are included as audiobooks.

    Coverage

    The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.

    License

    CC-BY-SA

    Who Can Use It

    This dataset is suitable for:

    • Researchers and academics focusing on text analysis, literary studies, or digital humanities.
    • Data scientists and machine learning engineers building and testing NLP models.
    • Students undertaking projects in linguistics, computer science, or library science.
    • Developers creating applications that require a large corpus of literary texts.

    Dataset Name Suggestions

    • Project Gutenberg Book Corpus
    • Digital Literature Collection
    • Classic Book Text Dataset
    • Historical Text Library

    Attributes

    Original Data Source: 15000 Gutenberg Books

  16. R

    Book Spines Test Dataset

    • universe.roboflow.com
    zip
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Art Processors (2023). Book Spines Test Dataset [Dataset]. https://universe.roboflow.com/art-processors/book-spines-test
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Art Processors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Book Spines Polygons
    Description

    Book Spines Test

    ## Overview
    
    Book Spines Test is a dataset for instance segmentation tasks - it contains Book Spines annotations for 835 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. R

    Book Reading Dataset

    • universe.roboflow.com
    zip
    Updated May 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tim (2024). Book Reading Dataset [Dataset]. https://universe.roboflow.com/tim-4ijf0/book-reading
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2024
    Dataset authored and provided by
    tim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Open Bounding Boxes
    Description

    Book Reading

    ## Overview
    
    Book Reading is a dataset for object detection tasks - it contains Open annotations for 357 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. goodbooks-10k

    • kaggle.com
    zip
    Updated Sep 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Foxtrot (2017). goodbooks-10k [Dataset]. http://www.kaggle.com/zygmunt/goodbooks-10k?select=ratings.csv
    Explore at:
    zip(12155229 bytes)Available download formats
    Dataset updated
    Sep 2, 2017
    Authors
    Foxtrot
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This version of the dataset is obsolete. It contains duplicate ratings (same user_id,book_id), as reported by Philipp Spachtholz in his illustrious notebook.

    The current version has duplicates removed, and more ratings (six million), sorted by time. Book and user IDs are the same.

    **It is available at https://github.com/zygmuntz/goodbooks-10k. **

    There have been good datasets for movies (Netflix, Movielens) and music (Million Songs) recommendation, but not for books. That is, until now.

    This dataset contains ratings for ten thousand popular books. As to the source, let's say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five.

    Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.

    There are also books marked to read by the users, book metadata (author, year, etc.) and tags.

    Contents

    ratings.csv contains ratings and looks like that:

    book_id,user_id,rating
    1,314,5
    1,439,3
    1,588,5
    1,1169,4
    1,1185,4
    

    to_read.csv provides IDs of the books marked "to read" by each user, as user_id,book_id pairs.

    books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.).

    The metadata have been extracted from goodreads XML files, available in the third version of this dataset as books_xml.tar.gz. The archive contains 10000 XML files. One of them is available as sample_book.xml. To make the download smaller, these files are absent from the current version. Download version 3 if you want them.

    book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs.

    tags.csv translates tag IDs to names.

    See the notebook for some basic stats of the dataset.

    goodreads IDs

    Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

    You can use the goodreads book and work IDs to create URLs as follows:

    https://www.goodreads.com/book/show/2767052
    https://www.goodreads.com/work/editions/2792775

  19. i

    A dataset containing the table of contents of 56K ebook titles extracted...

    • ieee-dataport.org
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleni Giannopoulou (2022). A dataset containing the table of contents of 56K ebook titles extracted from Springer [Dataset]. https://ieee-dataport.org/open-access/dataset-containing-table-contents-56k-ebook-titles-extracted-springer
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Eleni Giannopoulou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    title

  20. u

    Amazon review data 2018

    • cseweb.ucsd.edu
    • nijianmo.github.io
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Organization logo

Best Books Ever Dataset

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
csvAvailable download formats
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |

Search
Clear search
Close search
Google apps
Main menu