Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.
Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930
Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.
Example (interaction data):
json
{
"user_id": "8842281e1d1347389f2ab93d60773d4d",
"book_id": "130580",
"review_id": "330f9c153c8d3347eb914c06b89c94da",
"isRead": true,
"rating": 4,
"date_added": "Mon Aug 01 13:41:57 -0700 2011",
"date_updated": "Mon Aug 01 13:42:41 -0700 2011",
"read_at": "Fri Jan 01 00:00:00 -0800 1988",
"started_at": ""
}
Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.
Citation:
Please cite the following if you use the data:
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
[PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)
Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.
Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.
Datasets:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The books dataset includes information about all published books from Springer Nature.See also: https://scigraph.springernature.com/explorer/datasets/data_at_a_glance/A book record usually includes information about the chapters it contains, external identifiers, authors, editors and affiliations information, links to related grants, subjects and abstract when available.Version info:* http://scigraph.downloads.uberresearch.com/archives/current/TIMESTAMP.txt* http://scigraph.downloads.uberresearch.com/archives/current/LICENSE.txt
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Breakdown of Revenue by Media Type: Books - Print Books for Book Publishers, All Establishments, Employer Firms (RPCMPBEF51113ALLEST) from 2013 to 2022 about book, printing, employer firms, accounting, revenue, establishments, services, and USA.
The latest titles in large-print format at LAPL, updated weekly.
The statistic presents data on the average number of pirated e-books downloaded per user in the past 12 months in the United States in 2017. Illegal downloaders obtained an average of 3.14 illegal e-books from a friend in the past 12 months.
This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.
The dataset primarily includes the following columns:
The dataset's metadata is initially available in a gutenberg_metadata.csv
file. The full text data for each book can be downloaded using a gutenberg_download.py
script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.
This dataset is ideal for various applications in education and learning analytics. Specific use cases include:
The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.
CC-BY-SA
This dataset is suitable for:
Original Data Source: 15000 Gutenberg Books
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Book Reading is a dataset for object detection tasks - it contains Open annotations for 357 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Retail Sales: Book Stores (MRTSSM451211USN) from Jan 1992 to Apr 2025 about book, retail trade, sales, retail, and USA.
This dataset provides an in-depth look into Amazon's top 100 bestselling books along with their customer reviews, ratings, and pricing information. It offers a window into the world of popular reading and customer sentiment. The dataset was collected in November 2023, making it suitable for analysing recent literary trends and consumer behaviour.
The dataset includes the following fields: * Book Rank: The ranking of the book among the top 100 bestselling books on Amazon. * Book Title: The title of the book. Examples include "The Ballad of Songbirds and Snakes" and "Iron Flame". * Price: The price of the book in USD. * Rating: The overall rating of the book, on a scale of 1 to 5. * Author: The author of the book. Notable authors include Sarah J. Maas and Adam Wallace. * Year of Publication: The year in which the book was published. * Genre: The category to which the book belongs. Popular genres include Nonfiction and Childrens, literature. * URL: The direct URL link to the book on Amazon's platform. * Review Title: The title of the customer review. * Reviewer: The name of the person who wrote the review. * Reviewer Rating: The rating given by the reviewer for the book, on a scale of 1 to 5. * Review Description: The textual content of the review. * Is_verified: Indicates whether the review is a verified customer purchase. * Date: The date when the review was posted. * Timestamp: The timestamp indicating when the review was posted. * ASIN: Amazon Standard Identification Number assigned to products on Amazon.
The dataset focuses on the top 100 bestselling books. * Price: Book prices range from 1.00 USD to 100.00 USD. There are approximately 10 books within each 9.90 USD price band across this range. * Rating: Overall book ratings are generally high, ranging from 4.10 to 5.00. A notable number of books have ratings between 4.73 and 4.82. * Year of Publication: Books in the dataset were published between 1947 and 2024. A significant portion, 64 books, were published between 2016 and 2024, indicating a strong presence of recent titles. * Genre: While diverse, Nonfiction and Childrens, literature are among the more prominent genres. * Authors/Titles: "The Ballad of Songbirds and Snakes" and "Iron Flame" are among the top-ranked titles. Sarah J. Maas and Adam Wallace are featured authors. The dataset covers review data for each of the top 100 books, though the exact number of reviews per book is not specified.
This dataset is ideal for: * Market analysis: Identifying bestselling trends, pricing strategies, and popular authors. * Sentiment analysis: Analysing customer reviews to understand public perception and extract insights. * Recommender systems: Building or improving book recommendation engines. * Natural Language Processing (NLP): Training models for text classification, entity recognition, or summarisation based on review content. * Data visualisation: Creating visualisations of literary trends, rating distributions, or reviewer behaviour.
CC-BY
Original Data Source: Top 100 Bestselling Book Reviews on Amazon
Data on e-book downloading among internet users in the United Kingdom found that in 2022, a total of 18 percent of respondents had downloaded an e-book in the three months running to the survey, the same as in the previous year. Despite this, the most popular way of accessing e-books remains purchasing rather than downloading or sharing.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Weblog dataset from a scholarly shadow library. Comma separated file, zippedFields:date - Timestamp when the book was downloadedlat - Latitude redacted to 3 decimalslong - Longitude redacted to 4 decimalscity - City of downloadcountry - Country of downloadisbn - ISBN number of the book downloadedtitle - Title of the book downloaded
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
Comprehensive dataset of 5,599 Book publishers in United States as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Get Nasdaq real-time and historical data with support for fast market replay at over 19 million book updates per second. Test our data for free with only 4 lines of code.
Nasdaq TotalView-ITCH is a proprietary data feed that disseminates full order book depth and last sale data from the Nasdaq stock market (XNAS). It delivers every quote and order at each price level, along with any event that updates the order book after an order is placed, such as trade executions, modifications, or cancellations. Nasdaq is the most active US equity exchange by volume and represented 13.03% of the average daily volume (ADV) as of January 2025.
With its L3 granularity, Nasdaq TotalView-ITCH captures information beyond the L1, top-of-book data available through SIP feeds and enables more accurate modeling of book imbalances, trade directionality, quote lifetimes, and more. This includes explicit trade aggressor side, odd lots, auction imbalance data, and the Net Order Imbalance Indicator (NOII) for the Nasdaq Opening and Closing Crosses and Nasdaq IPO/Halt Cross—the best predictor of Nasdaq opening and closing prices available. Other key advantages of Nasdaq TotalView-ITCH over SIP data include faster real-time dissemination and precise exchange-side timestamping directly from Nasdaq.
Real-time Nasdaq TotalView-ITCH data is included with a Plus or Unlimited subscription through our Databento US Equities service. Historical data is available for usage-based rates or with any subscription. Visit our pricing page for more details or to upgrade your plan.
Breadth of coverage: 20,329 products
Asset class(es): Equities
Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.
Supported data encodings: DBN, CSV, JSON Learn more
Supported market data schemas: MBO, MBP-1, MBP-10, BBO-1s, BBO-1m, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Statistics, Status, Imbalance Learn more
Resolution: Immediate publication, nanosecond-resolution timestamps
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for OPUS Books
Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
With the trend toward audiobooks growing, I gathered this data to understand how the audiobook market has been growing over the years. From authors of audiobooks to release dates, the data represents the important details of audiobooks from 1998 till 2025 (pre-planned releases).
I have yet to find a great audiobooks dataset and hence the urge to make a dataset that provides us with information on the basics and the history of audiobooks. I look to improve the dataset with more details in the near future.
The Uncleaned data or audible_uncleaned.csv is exactly the raw data I derived from Audible.in The Cleaned one or audible_cleaned.csv consists of a few basic data cleaning steps.
The data was collected using webs-scraping.
- re
- Beautiful Soup
- Selenium
Beautiful Soup
and Selenium
were used in unison to mainly gather the data. The code can be re-used and you can find the code here: https://github.com/snehangsude/audible_scraper
name
: Name of the audiobook author
: Author of the audiobook narrator
: Narrator of the audiobooktime
: Length of the audiobookreleasedate
: Release date of the audiobooklanguage
: Language of the audiobookstars
: No. of stars the audiobook received price
: Price of the audiobook in INRratings
: No. of reviews received by the audiobookComprehensive dataset of 1 Books in Thailand as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Measuring scholarly impact and societal relevance in the humanities and social sciences can be done in several ways. Here we will look at a collection of e-books from the FWF-E-Book-Library, which is made available through the OAPEN Library. In 2014, 146 books of the FWF-E-Book-Library collection were made available via the OAPEN Library.The analysis is based on COUNTER compliant download data. This means that downloads by automated systems ('bots') and other types of suspicious download behaviour is discarded from the reports. The data of the 28,139 downloads used for this analysis originated from 23,652 IP addresses. It is clear that many providers use several IP addresses: the IP addresses were linked to 2,839 provider names. Where no information about a provider could be found, the download data was omitted.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Book Publication, Editions for United States (M0106AUSM234NNBR) from Jan 1913 to Dec 1928 about book and USA.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |