100+ datasets found

Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
P
BookCorpus Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler, BookCorpus Dataset [Dataset]. https://paperswithcode.com/dataset/bookcorpus
Explore at:
Authors
Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Description
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
h
blbooks
huggingface.co
Updated Jan 15, 1996
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Library (1996). blbooks [Dataset]. https://huggingface.co/datasets/TheBritishLibrary/blbooks
Explore at:
Dataset updated
Jan 15, 1996
Dataset authored and provided by
British Library
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature.
Goodreads Book Reviews
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Goodreads Book Reviews [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews1/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.

Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930

Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.

Example (interaction data): json { "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.

Citation: Please cite the following if you use the data: Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys, 2018 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)

Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.

Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.

Datasets:

Meta-Data of Books:

Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.

Download Link

Detailed Information of Authors (goodreads_book_authors.json.gz):

An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.

Download Link

Detailed Information of Works (goodreads_book_works.json.gz):

This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.

Download Link

Detailed Information of Book Series (goodreads_book_series.json.gz):

A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.

Download Link

Extracted Fuzzy Book Genres (goodreads_book_genres_initial.json....
R
Book Dataset
universe.roboflow.com
zip
Updated Oct 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kwsr (2024). Book Dataset [Dataset]. https://universe.roboflow.com/kwsr/book-gtby9/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Oct 9, 2024
Dataset authored and provided by
kwsr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Name Bounding Boxes
Description
Book

## Overview Book is a dataset for object detection tasks - it contains Name annotations for 4,300 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Books Dataset
figshare.com
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Mendola (2016). Books Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1441255.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1441255.v1
Dataset updated
Jan 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giuseppe Mendola
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database contains information about books gathered with help of Google Books API. The database contains 7 different tables where 3 of them are only to relate the other tables together. Tables: Books contains 1062 records. Authors contains 1595 records. Categories 109 records. Metadata 37 records. MD5 (GBooks_2015-06-09.sql) = bfd09094d0e123e668b2e58332b1a98b
P
BookSum Dataset
paperswithcode.com
tensorflow.org
+1more
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wojciech Kryściński; Nazneen Rajani; Divyansh Agarwal; Caiming Xiong; Dragomir Radev (2024). BookSum Dataset [Dataset]. https://paperswithcode.com/dataset/booksum
Explore at:
Dataset updated
Apr 9, 2024
Authors
Wojciech Kryściński; Nazneen Rajani; Divyansh Agarwal; Caiming Xiong; Dragomir Radev
Description
BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.

BookSum contains summaries for 142,753 paragraphs, 12,293 chapters and 436 books.
h
opus_books
huggingface.co
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for OPUS Books

Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
Goodreads Book Datasets With User Rating 2M
kaggle.com
zip
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bahram Jannesar (2020). Goodreads Book Datasets With User Rating 2M [Dataset]. https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m
Explore at:
zip(368593554 bytes)Available download formats
Dataset updated
Jul 9, 2020
Authors
Bahram Jannesar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Best quot ever :

Don't ever tell anybody anything, if you do, you start missing everybody J.D. Salinger

Story

Every one of us knows the Goodreads, and every book lovers when want to buy a book, firstly search the title of the book on this website and read all of that reviews and ratings are available there for that book. do you know the better place for scraping data from there? tell us ba.jannesar@gmail.com or ghaderi.soroush1995@gmail.com Goodreads one the best place for this job! 💯

These datasets are very good for two jobs :

1 . Creating book recommendation system based on 10 M books 🥇 2 . Using the Description columns for NLP 🥈

Github repo

Project link on github or here.

Content

Approximately 10,000,000 books are available on the site's archives, and these datasets are collecting from them. for requesting on the API, we used Goodreads python library, ****Datasets will be updated every 2 days.****

Acknowledgements

This data was entirely scrapped from the Goodreads API.

Inspiration

Do you know what is NLP? , download these datasets then upvote 💯.

Book Sample

JSON : { "Id": "5107", "Name": "The Catcher in the Rye", "RatingDist1": "1:133165", "RatingDist2": "2:224884", "RatingDist3": "3:553476", "RatingDist4": "4:808278", "RatingDist5": "5:891037", "pagesNumber": 277, "RatingDistTotal": "total:2610840", "PublishMonth": 30, "PublishDay": 1, "Publisher": "Back Bay Books", "CountsOfReview": 44046, "PublishYear": 2001, "Language": "eng", "Authors": "J.D. Salinger", "Rating": 3.8, "ISBN": "0316769177", "Count of text reviews": 55539, "Description": "The hero-narrator of The Catcher in the Rye is an ancient child of sixteen, a native New Yorker named Holden Caulfield. Through circumstances that tend to preclude adult, secondhand description, he leaves his prep school in Pennsylvania and goes underground in New York City for three days. " } Or CSV :

5107,The Catcher in the Rye,1:133165,277,4:808278,total:2610840,30,1,Back Bay Books,44046,2001,eng,J.D. Salinger,3.8,2:224884,5:891037,0316769177,3:553476,55539,"The hero-narrator of The Catcher in the Rye is an ancient child of sixteen, a native New Yorker named Holden Caulfield. Through circumstances that tend to preclude adult, secondhand description, he leaves his prep school in Pennsylvania and goes underground in New York City for three days. "
Audible Dataset
kaggle.com
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Snehangsu De (2022). Audible Dataset [Dataset]. https://www.kaggle.com/datasets/snehangsude/audible-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Snehangsu De
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

With the trend toward audiobooks growing, I gathered this data to understand how the audiobook market has been growing over the years. From authors of audiobooks to release dates, the data represents the important details of audiobooks from 1998 till 2025 (pre-planned releases).

I have yet to find a great audiobooks dataset and hence the urge to make a dataset that provides us with information on the basics and the history of audiobooks. I look to improve the dataset with more details in the near future.

File Information

The Uncleaned data or audible_uncleaned.csv is exactly the raw data I derived from Audible.in The Cleaned one or audible_cleaned.csv consists of a few basic data cleaning steps.

Libraries used

The data was collected using webs-scraping. - re - Beautiful Soup - Selenium

Beautiful Soup and Selenium were used in unison to mainly gather the data. The code can be re-used and you can find the code here: https://github.com/snehangsude/audible_scraper

Column Breakdown

name: Name of the audiobook

author: Author of the audiobook

narrator: Narrator of the audiobook

time: Length of the audiobook

releasedate: Release date of the audiobook

language: Language of the audiobook

stars: No. of stars the audiobook received

price: Price of the audiobook in INR

ratings: No. of reviews received by the audiobook
R
Chess Book Dataset
universe.roboflow.com
zip
Updated Oct 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nada (2023). Chess Book Dataset [Dataset]. https://universe.roboflow.com/nada-mpxyo/chess-book
Explore at:
zipAvailable download formats
Dataset updated
Oct 27, 2023
Dataset authored and provided by
Nada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Pgn Board Bounding Boxes
Description
Chess Book

## Overview Chess Book is a dataset for object detection tasks - it contains Pgn Board annotations for 478 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
P
BookTest Dataset
paperswithcode.com
opendatalab.com
Updated Mar 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ondrej Bajgar; Rudolf Kadlec; Jan Kleindienst (2022). BookTest Dataset [Dataset]. https://paperswithcode.com/dataset/booktest
Explore at:
Dataset updated
Mar 25, 2022
Authors
Ondrej Bajgar; Rudolf Kadlec; Jan Kleindienst
Description
BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.
openbookqa
huggingface.co
paperswithcode.com
+1more
Updated Mar 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). openbookqa [Dataset]. https://huggingface.co/datasets/allenai/openbookqa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for OpenBookQA

Dataset Summary

OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.
R
Oriented Books Dataset
universe.roboflow.com
zip
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
koteitan (2024). Oriented Books Dataset [Dataset]. https://universe.roboflow.com/koteitan/oriented-books
Explore at:
zipAvailable download formats
Dataset updated
Jul 4, 2024
Dataset authored and provided by
koteitan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Book Bounding Boxes
Description
Oriented Books

## Overview Oriented Books is a dataset for object detection tasks - it contains Book annotations for 661 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
o
Project Gutenberg Book Corpus
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Project Gutenberg Book Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/0979850d-7ed8-4aeb-887d-4ad585d2f661
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
Area covered
Education & Learning Analytics
Description
This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.

Columns

The dataset primarily includes the following columns:

Title: The title of the book.

Author: The author of the book.

Link: The direct download link for the book's text.

Bookshelf: The category or genre assigned to the book on Project Gutenberg.

Text Data: The actual text content of the books, which can be downloaded using a provided script.

Distribution

The dataset's metadata is initially available in a gutenberg_metadata.csv file. The full text data for each book can be downloaded using a gutenberg_download.py script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.

Usage

This dataset is ideal for various applications in education and learning analytics. Specific use cases include:

Natural Language Processing (NLP) tasks, such as text analysis, topic modelling, and language understanding.

Literature studies and computational humanities research.

Developing and training AI and Machine Learning models on large text corpora.

Working with audio data, as some books are included as audiobooks.

Coverage

The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.

License

CC-BY-SA

Who Can Use It

This dataset is suitable for:

Researchers and academics focusing on text analysis, literary studies, or digital humanities.

Data scientists and machine learning engineers building and testing NLP models.

Students undertaking projects in linguistics, computer science, or library science.

Developers creating applications that require a large corpus of literary texts.

Dataset Name Suggestions

Project Gutenberg Book Corpus

Digital Literature Collection

Classic Book Text Dataset

Historical Text Library

Attributes

Original Data Source: 15000 Gutenberg Books
R
Book Spines Test Dataset
universe.roboflow.com
zip
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Art Processors (2023). Book Spines Test Dataset [Dataset]. https://universe.roboflow.com/art-processors/book-spines-test
Explore at:
zipAvailable download formats
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Art Processors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Book Spines Polygons
Description
Book Spines Test

## Overview Book Spines Test is a dataset for instance segmentation tasks - it contains Book Spines annotations for 835 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Book Reading Dataset
universe.roboflow.com
zip
Updated May 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tim (2024). Book Reading Dataset [Dataset]. https://universe.roboflow.com/tim-4ijf0/book-reading
Explore at:
zipAvailable download formats
Dataset updated
May 4, 2024
Dataset authored and provided by
tim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Open Bounding Boxes
Description
Book Reading

## Overview Book Reading is a dataset for object detection tasks - it contains Open annotations for 357 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
goodbooks-10k
kaggle.com
zip
Updated Sep 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Foxtrot (2017). goodbooks-10k [Dataset]. http://www.kaggle.com/zygmunt/goodbooks-10k?select=ratings.csv
Explore at:
zip(12155229 bytes)Available download formats
Dataset updated
Sep 2, 2017
Authors
Foxtrot
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This version of the dataset is obsolete. It contains duplicate ratings (same user_id,book_id), as reported by Philipp Spachtholz in his illustrious notebook.

The current version has duplicates removed, and more ratings (six million), sorted by time. Book and user IDs are the same.

**It is available at https://github.com/zygmuntz/goodbooks-10k. **

There have been good datasets for movies (Netflix, Movielens) and music (Million Songs) recommendation, but not for books. That is, until now.

This dataset contains ratings for ten thousand popular books. As to the source, let's say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five.

Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.

There are also books marked to read by the users, book metadata (author, year, etc.) and tags.

Contents

ratings.csv contains ratings and looks like that:

book_id,user_id,rating 1,314,5 1,439,3 1,588,5 1,1169,4 1,1185,4

to_read.csv provides IDs of the books marked "to read" by each user, as user_id,book_id pairs.

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.).

The metadata have been extracted from goodreads XML files, available in the third version of this dataset as books_xml.tar.gz. The archive contains 10000 XML files. One of them is available as sample_book.xml. To make the download smaller, these files are absent from the current version. Download version 3 if you want them.

book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs.

tags.csv translates tag IDs to names.

See the notebook for some basic stats of the dataset.

goodreads IDs

Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

You can use the goodreads book and work IDs to create URLs as follows:

https://www.goodreads.com/book/show/2767052
https://www.goodreads.com/work/editions/2792775
i
A dataset containing the table of contents of 56K ebook titles extracted...
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleni Giannopoulou (2022). A dataset containing the table of contents of 56K ebook titles extracted from Springer [Dataset]. https://ieee-dataport.org/open-access/dataset-containing-table-contents-56k-ebook-titles-extracted-springer
Explore at:
Dataset updated
May 18, 2022
Authors
Eleni Giannopoulou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
title
u
Amazon review data 2018
cseweb.ucsd.edu
nijianmo.github.io
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
Explore at:
Dataset authored and provided by
UCSD CSE Research Project
Description
Context

This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

More reviews:

The total number of reviews is 233.1 million (142.8 million in 2014).

New reviews:

Current data includes reviews in the range May 1996 - Oct 2018.

Metadata: - We have added transaction metadata for each review shown on the review page.

Added more detailed metadata of the product landing page.

Acknowledgements

If you publish articles based on this dataset, please cite the following paper:

Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096

Best Books Ever Dataset

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4265096

Dataset updated

Nov 10, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |

Clear search

Close search

Google apps

Main menu

Best Books Ever Dataset

BookCorpus Dataset

blbooks

Goodreads Book Reviews

Meta-Data of Books:

Book Dataset

Book

Books Dataset

BookSum Dataset

opus_books

Goodreads Book Datasets With User Rating 2M

Best quot ever :

Story

Github repo

Content

Acknowledgements

Inspiration

Book Sample

Audible Dataset

Introduction

File Information

Libraries used

Column Breakdown

Chess Book Dataset

Chess Book

BookTest Dataset

openbookqa

Oriented Books Dataset

Oriented Books

Project Gutenberg Book Corpus

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Book Spines Test Dataset

Book Spines Test

Book Reading Dataset

Book Reading

goodbooks-10k

Contents

goodreads IDs

A dataset containing the table of contents of 56K ebook titles extracted...

Amazon review data 2018

Context

Acknowledgements

Best Books Ever Dataset