Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
A dataset comprising of text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.
Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930
Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.
Example (interaction data):
json
{
"user_id": "8842281e1d1347389f2ab93d60773d4d",
"book_id": "130580",
"review_id": "330f9c153c8d3347eb914c06b89c94da",
"isRead": true,
"rating": 4,
"date_added": "Mon Aug 01 13:41:57 -0700 2011",
"date_updated": "Mon Aug 01 13:42:41 -0700 2011",
"read_at": "Fri Jan 01 00:00:00 -0800 1988",
"started_at": ""
}
Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.
Citation:
Please cite the following if you use the data:
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
[PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)
Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.
Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.
Datasets:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Book is a dataset for object detection tasks - it contains Name annotations for 4,300 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains information about books gathered with help of Google Books API. The database contains 7 different tables where 3 of them are only to relate the other tables together. Tables: Books contains 1062 records. Authors contains 1595 records. Categories 109 records. Metadata 37 records. MD5 (GBooks_2015-06-09.sql) = bfd09094d0e123e668b2e58332b1a98b
BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.
BookSum contains summaries for 142,753 paragraphs, 12,293 chapters and 436 books.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for OPUS Books
Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Don't ever tell anybody anything, if you do, you start missing everybody J.D. Salinger
Every one of us knows the Goodreads, and every book lovers when want to buy a book, firstly search the title of the book on this website and read all of that reviews and ratings are available there for that book. do you know the better place for scraping data from there? tell us ba.jannesar@gmail.com or ghaderi.soroush1995@gmail.com Goodreads one the best place for this job! 💯
These datasets are very good for two jobs :
1 . Creating book recommendation system based on 10 M books 🥇 2 . Using the Description columns for NLP 🥈
Project link on github or here.
Approximately 10,000,000 books are available on the site's archives, and these datasets are collecting from them. for requesting on the API, we used Goodreads python library, ****Datasets will be updated every 2 days.****
This data was entirely scrapped from the Goodreads API.
Do you know what is NLP? , download these datasets then upvote 💯.
JSON :
{
"Id": "5107",
"Name": "The Catcher in the Rye",
"RatingDist1": "1:133165",
"RatingDist2": "2:224884",
"RatingDist3": "3:553476",
"RatingDist4": "4:808278",
"RatingDist5": "5:891037",
"pagesNumber": 277,
"RatingDistTotal": "total:2610840",
"PublishMonth": 30,
"PublishDay": 1,
"Publisher": "Back Bay Books",
"CountsOfReview": 44046,
"PublishYear": 2001,
"Language": "eng",
"Authors": "J.D. Salinger",
"Rating": 3.8,
"ISBN": "0316769177",
"Count of text reviews": 55539,
"Description": "The hero-narrator of The Catcher in the Rye is an ancient child of sixteen, a native New Yorker named Holden Caulfield. Through
circumstances that tend to preclude adult, secondhand description, he leaves his prep school in Pennsylvania and goes underground in New York City for
three days. "
}
Or CSV :
5107,The Catcher in the Rye,1:133165,277,4:808278,total:2610840,30,1,Back Bay Books,44046,2001,eng,J.D. Salinger,3.8,2:224884,5:891037,0316769177,3:553476,55539,"The hero-narrator of The Catcher in the Rye is an ancient child of sixteen, a native New Yorker named Holden Caulfield. Through circumstances that tend to preclude adult, secondhand description, he leaves his prep school in Pennsylvania and goes underground in New York City for three days. "
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
With the trend toward audiobooks growing, I gathered this data to understand how the audiobook market has been growing over the years. From authors of audiobooks to release dates, the data represents the important details of audiobooks from 1998 till 2025 (pre-planned releases).
I have yet to find a great audiobooks dataset and hence the urge to make a dataset that provides us with information on the basics and the history of audiobooks. I look to improve the dataset with more details in the near future.
The Uncleaned data or audible_uncleaned.csv is exactly the raw data I derived from Audible.in The Cleaned one or audible_cleaned.csv consists of a few basic data cleaning steps.
The data was collected using webs-scraping.
- re
- Beautiful Soup
- Selenium
Beautiful Soup
and Selenium
were used in unison to mainly gather the data. The code can be re-used and you can find the code here: https://github.com/snehangsude/audible_scraper
name
: Name of the audiobook author
: Author of the audiobook narrator
: Narrator of the audiobooktime
: Length of the audiobookreleasedate
: Release date of the audiobooklanguage
: Language of the audiobookstars
: No. of stars the audiobook received price
: Price of the audiobook in INRratings
: No. of reviews received by the audiobookAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Chess Book is a dataset for object detection tasks - it contains Pgn Board annotations for 478 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for OpenBookQA
Dataset Summary
OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
## Overview
Oriented Books is a dataset for object detection tasks - it contains Book annotations for 661 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.
The dataset primarily includes the following columns:
The dataset's metadata is initially available in a gutenberg_metadata.csv
file. The full text data for each book can be downloaded using a gutenberg_download.py
script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.
This dataset is ideal for various applications in education and learning analytics. Specific use cases include:
The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.
CC-BY-SA
This dataset is suitable for:
Original Data Source: 15000 Gutenberg Books
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Book Spines Test is a dataset for instance segmentation tasks - it contains Book Spines annotations for 835 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Book Reading is a dataset for object detection tasks - it contains Open annotations for 357 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This version of the dataset is obsolete. It contains duplicate ratings (same user_id,book_id), as reported by Philipp Spachtholz in his illustrious notebook.
The current version has duplicates removed, and more ratings (six million), sorted by time. Book and user IDs are the same.
**It is available at https://github.com/zygmuntz/goodbooks-10k. **
There have been good datasets for movies (Netflix, Movielens) and music (Million Songs) recommendation, but not for books. That is, until now.
This dataset contains ratings for ten thousand popular books. As to the source, let's say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five.
Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.
There are also books marked to read by the users, book metadata (author, year, etc.) and tags.
ratings.csv contains ratings and looks like that:
book_id,user_id,rating
1,314,5
1,439,3
1,588,5
1,1169,4
1,1185,4
to_read.csv provides IDs of the books marked "to read" by each user, as user_id,book_id pairs.
books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.).
The metadata have been extracted from goodreads XML files, available in the third version of this dataset as books_xml.tar.gz. The archive contains 10000 XML files. One of them is available as sample_book.xml. To make the download smaller, these files are absent from the current version. Download version 3 if you want them.
book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs.
tags.csv translates tag IDs to names.
See the notebook for some basic stats of the dataset.
Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.
You can use the goodreads book and work IDs to create URLs as follows:
https://www.goodreads.com/book/show/2767052
https://www.goodreads.com/work/editions/2792775
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
title
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |