100+ datasets found

Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
Goodreads Book Reviews
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Goodreads Book Reviews [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews1/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.

Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930

Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.

Example (interaction data): json { "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.

Citation: Please cite the following if you use the data: Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys, 2018 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)

Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.

Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.

Datasets:

Meta-Data of Books:

Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.

Download Link

Detailed Information of Authors (goodreads_book_authors.json.gz):

An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.

Download Link

Detailed Information of Works (goodreads_book_works.json.gz):

This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.

Download Link

Detailed Information of Book Series (goodreads_book_series.json.gz):

A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.

Download Link

Extracted Fuzzy Book Genres (goodreads_book_genres_initial.json....
f
Dataset: Books
figshare.com
application/gzip
Updated Jan 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SN SciGraph Team (2023). Dataset: Books [Dataset]. http://doi.org/10.6084/m9.figshare.7739084.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7739084.v2
Dataset updated
Jan 31, 2023
Dataset provided by
SN SciGraph
Authors
SN SciGraph Team
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The books dataset includes information about all published books from Springer Nature.See also: https://scigraph.springernature.com/explorer/datasets/data_at_a_glance/A book record usually includes information about the chapters it contains, external identifiers, authors, editors and affiliations information, links to related grants, subjects and abstract when available.Version info:* http://scigraph.downloads.uberresearch.com/archives/current/TIMESTAMP.txt* http://scigraph.downloads.uberresearch.com/archives/current/LICENSE.txt
F
Breakdown of Revenue by Media Type: Books - Print Books for Book Publishers,...
fred.stlouisfed.org
json
Updated Jan 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Breakdown of Revenue by Media Type: Books - Print Books for Book Publishers, All Establishments, Employer Firms [Dataset]. https://fred.stlouisfed.org/series/RPCMPBEF51113ALLEST
Explore at:
jsonAvailable download formats
Dataset updated
Jan 31, 2024
License
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Description
Graph and download economic data for Breakdown of Revenue by Media Type: Books - Print Books for Book Publishers, All Establishments, Employer Firms (RPCMPBEF51113ALLEST) from 2013 to 2022 about book, printing, employer firms, accounting, revenue, establishments, services, and USA.
d
Library New Titles - Large Print Books
catalog.data.gov
data.lacity.org
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.lacity.org (2020). Library New Titles - Large Print Books [Dataset]. https://catalog.data.gov/dataset/library-new-titles-large-print-books
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
data.lacity.org
Description
The latest titles in large-print format at LAPL, updated weekly.
Number of book piracy downloads in the U.S. 2017, by method
statista.com
Updated Mar 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2021). Number of book piracy downloads in the U.S. 2017, by method [Dataset]. https://www.statista.com/statistics/688228/book-piracy-download-number-method/
Explore at:
Dataset updated
Mar 22, 2021
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2017
Area covered
United States
Description
The statistic presents data on the average number of pirated e-books downloaded per user in the past 12 months in the United States in 2017. Illegal downloaders obtained an average of 3.14 illegal e-books from a friend in the past 12 months.
o
Project Gutenberg Book Corpus
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Project Gutenberg Book Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/0979850d-7ed8-4aeb-887d-4ad585d2f661
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
Area covered
Education & Learning Analytics
Description
This dataset is a collection of over 15,000 book texts, complete with their authors and titles. It has been compiled by scraping the Project Gutenberg website, specifically parsing its bookshelves. The dataset includes metadata such as titles, authors, categories (bookshelves), and download links for the book texts. Some books from Project Gutenberg are not included if they haven't been categorised. Notably, the dataset also retains audiobooks, offering flexibility for users interested in audio data alongside text.

Columns

The dataset primarily includes the following columns:

Title: The title of the book.

Author: The author of the book.

Link: The direct download link for the book's text.

Bookshelf: The category or genre assigned to the book on Project Gutenberg.

Text Data: The actual text content of the books, which can be downloaded using a provided script.

Distribution

The dataset's metadata is initially available in a gutenberg_metadata.csv file. The full text data for each book can be downloaded using a gutenberg_download.py script, which then saves the results into a CSV file. This final CSV file, containing the book texts, authors, titles, and categories, is approximately 5 GB in size. The corpus features more than 15,000 unique book texts.

Usage

This dataset is ideal for various applications in education and learning analytics. Specific use cases include:

Natural Language Processing (NLP) tasks, such as text analysis, topic modelling, and language understanding.

Literature studies and computational humanities research.

Developing and training AI and Machine Learning models on large text corpora.

Working with audio data, as some books are included as audiobooks.

Coverage

The dataset has a global region coverage, reflecting the diverse origins of books within Project Gutenberg. It focuses on books that have been categorised on the Project Gutenberg website; un-categorised books are not included. No specific time range or demographic scope is detailed in the available information.

License

CC-BY-SA

Who Can Use It

This dataset is suitable for:

Researchers and academics focusing on text analysis, literary studies, or digital humanities.

Data scientists and machine learning engineers building and testing NLP models.

Students undertaking projects in linguistics, computer science, or library science.

Developers creating applications that require a large corpus of literary texts.

Dataset Name Suggestions

Project Gutenberg Book Corpus

Digital Literature Collection

Classic Book Text Dataset

Historical Text Library

Attributes

Original Data Source: 15000 Gutenberg Books
R
Data from: Book Reading Dataset
universe.roboflow.com
zip
Updated May 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tim (2024). Book Reading Dataset [Dataset]. https://universe.roboflow.com/tim-4ijf0/book-reading
Explore at:
zipAvailable download formats
Dataset updated
May 4, 2024
Dataset authored and provided by
tim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Open Bounding Boxes
Description
Book Reading

## Overview Book Reading is a dataset for object detection tasks - it contains Open annotations for 357 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
F
Retail Sales: Book Stores
fred.stlouisfed.org
json
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Retail Sales: Book Stores [Dataset]. https://fred.stlouisfed.org/series/MRTSSM451211USN
Explore at:
jsonAvailable download formats
Dataset updated
Jun 17, 2025
License
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Description
Graph and download economic data for Retail Sales: Book Stores (MRTSSM451211USN) from Jan 1992 to Apr 2025 about book, retail trade, sales, retail, and USA.
o
Amazon Bestselling Books & Customer Reviews
opendatabay.com
.undefined
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Amazon Bestselling Books & Customer Reviews [Dataset]. https://www.opendatabay.com/data/ai-ml/1639fb85-1580-4646-8216-326b2fac3437
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Datasimple
Area covered
Reviews & Ratings
Description
This dataset provides an in-depth look into Amazon's top 100 bestselling books along with their customer reviews, ratings, and pricing information. It offers a window into the world of popular reading and customer sentiment. The dataset was collected in November 2023, making it suitable for analysing recent literary trends and consumer behaviour.

Columns

The dataset includes the following fields: * Book Rank: The ranking of the book among the top 100 bestselling books on Amazon. * Book Title: The title of the book. Examples include "The Ballad of Songbirds and Snakes" and "Iron Flame". * Price: The price of the book in USD. * Rating: The overall rating of the book, on a scale of 1 to 5. * Author: The author of the book. Notable authors include Sarah J. Maas and Adam Wallace. * Year of Publication: The year in which the book was published. * Genre: The category to which the book belongs. Popular genres include Nonfiction and Childrens, literature. * URL: The direct URL link to the book on Amazon's platform. * Review Title: The title of the customer review. * Reviewer: The name of the person who wrote the review. * Reviewer Rating: The rating given by the reviewer for the book, on a scale of 1 to 5. * Review Description: The textual content of the review. * Is_verified: Indicates whether the review is a verified customer purchase. * Date: The date when the review was posted. * Timestamp: The timestamp indicating when the review was posted. * ASIN: Amazon Standard Identification Number assigned to products on Amazon.

Distribution

The dataset focuses on the top 100 bestselling books. * Price: Book prices range from 1.00 USD to 100.00 USD. There are approximately 10 books within each 9.90 USD price band across this range. * Rating: Overall book ratings are generally high, ranging from 4.10 to 5.00. A notable number of books have ratings between 4.73 and 4.82. * Year of Publication: Books in the dataset were published between 1947 and 2024. A significant portion, 64 books, were published between 2016 and 2024, indicating a strong presence of recent titles. * Genre: While diverse, Nonfiction and Childrens, literature are among the more prominent genres. * Authors/Titles: "The Ballad of Songbirds and Snakes" and "Iron Flame" are among the top-ranked titles. Sarah J. Maas and Adam Wallace are featured authors. The dataset covers review data for each of the top 100 books, though the exact number of reviews per book is not specified.

Usage

This dataset is ideal for: * Market analysis: Identifying bestselling trends, pricing strategies, and popular authors. * Sentiment analysis: Analysing customer reviews to understand public perception and extract insights. * Recommender systems: Building or improving book recommendation engines. * Natural Language Processing (NLP): Training models for text classification, entity recognition, or summarisation based on review content. * Data visualisation: Creating visualisations of literary trends, rating distributions, or reviewer behaviour.

Coverage

Geographic Scope: The data pertains to the global Amazon marketplace.

Time Range: Book publication years span from 1947 to 2024. Review data was collected up to November 2023.

License

CC-BY

Who Can Use It

Data scientists and analysts: For machine learning projects, statistical analysis, and predictive modelling.

Book enthusiasts and literary researchers: To explore popular reading habits and genre trends.

Publishers and authors: To gain insights into market demand and reader feedback.

Students and educators: For academic projects related to data science, literature, or consumer studies.

Dataset Name Suggestions

Amazon Bestselling Books & Customer Reviews

Top 100 Amazon Books Data 2023

Amazon Literary Trends Dataset

Bestselling Book Reviews on Amazon

Attributes

Original Data Source: Top 100 Bestselling Book Reviews on Amazon
Frequency of e-book downloading in the UK 2015-2022
statista.com
ai-chatbox.pro
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Frequency of e-book downloading in the UK 2015-2022 [Dataset]. https://www.statista.com/statistics/291124/ebook-downloading-in-the-uk-by-frequency/
Explore at:
Dataset updated
Dec 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United Kingdom
Description
Data on e-book downloading among internet users in the United Kingdom found that in 2022, a total of 18 percent of respondents had downloaded an e-book in the three months running to the survey, the same as in the previous year. Despite this, the most popular way of accessing e-books remains purchasing rather than downloading or sharing.
f
Shadow library book downloads, time, location, ISBN, title
uvaauas.figshare.com
zip
Updated Dec 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
B. Bodó; Daniel Antal; Zoltán Puha (2020). Shadow library book downloads, time, location, ISBN, title [Dataset]. http://doi.org/10.21942/uva.12330959.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.21942/uva.12330959.v1
Dataset updated
Dec 4, 2020
Dataset provided by
University of Amsterdam / Amsterdam University of Applied Sciences
Authors
B. Bodó; Daniel Antal; Zoltán Puha
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Weblog dataset from a scholarly shadow library. Comma separated file, zippedFields:date - Timestamp when the book was downloadedlat - Latitude redacted to 3 decimalslong - Longitude redacted to 4 decimalscity - City of downloadcountry - Country of downloadisbn - ISBN number of the book downloadedtitle - Title of the book downloaded
P
BookCorpus Dataset
paperswithcode.com
opendatalab.com
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler (2021). BookCorpus Dataset [Dataset]. https://paperswithcode.com/dataset/bookcorpus
Explore at:
Dataset updated
Dec 19, 2021
Authors
Yukun Zhu; Ryan Kiros; Richard Zemel; Ruslan Salakhutdinov; Raquel Urtasun; Antonio Torralba; Sanja Fidler
Description
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
p
Book Publishers in United States - 5,599 Verified Listings Database
poidata.io
csv, excel, json
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Book Publishers in United States - 5,599 Verified Listings Database [Dataset]. https://www.poidata.io/report/book-publisher/united-states
Explore at:
json, csv, excelAvailable download formats
Dataset updated
Jul 7, 2025
Dataset provided by
Poidata.io
Area covered
United States
Description
Comprehensive dataset of 5,599 Book publishers in United States as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Nasdaq Stock Market Data (Nasdaq TotalView-ITCH feed)
databento.com
csv, dbn, json
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databento (2025). Nasdaq Stock Market Data (Nasdaq TotalView-ITCH feed) [Dataset]. https://databento.com/datasets/XNAS.ITCH
Explore at:
dbn, json, csvAvailable download formats
Dataset updated
Jan 14, 2025
Dataset provided by
Databento Inc.
Authors
Databento
Time period covered
May 1, 2018 - Present
Area covered
United States
Description
Get Nasdaq real-time and historical data with support for fast market replay at over 19 million book updates per second. Test our data for free with only 4 lines of code.

Nasdaq TotalView-ITCH is a proprietary data feed that disseminates full order book depth and last sale data from the Nasdaq stock market (XNAS). It delivers every quote and order at each price level, along with any event that updates the order book after an order is placed, such as trade executions, modifications, or cancellations. Nasdaq is the most active US equity exchange by volume and represented 13.03% of the average daily volume (ADV) as of January 2025.

With its L3 granularity, Nasdaq TotalView-ITCH captures information beyond the L1, top-of-book data available through SIP feeds and enables more accurate modeling of book imbalances, trade directionality, quote lifetimes, and more. This includes explicit trade aggressor side, odd lots, auction imbalance data, and the Net Order Imbalance Indicator (NOII) for the Nasdaq Opening and Closing Crosses and Nasdaq IPO/Halt Cross—the best predictor of Nasdaq opening and closing prices available. Other key advantages of Nasdaq TotalView-ITCH over SIP data include faster real-time dissemination and precise exchange-side timestamping directly from Nasdaq.

Real-time Nasdaq TotalView-ITCH data is included with a Plus or Unlimited subscription through our Databento US Equities service. Historical data is available for usage-based rates or with any subscription. Visit our pricing page for more details or to upgrade your plan.

Breadth of coverage: 20,329 products

Asset class(es): Equities

Origin: Directly captured at Equinix NY4 (Secaucus, NJ) with an FPGA-based network card and hardware timestamping. Synchronized to UTC with PTP.

Supported data encodings: DBN, CSV, JSON Learn more

Supported market data schemas: MBO, MBP-1, MBP-10, BBO-1s, BBO-1m, TBBO, Trades, OHLCV-1s, OHLCV-1m, OHLCV-1h, OHLCV-1d, Definition, Statistics, Status, Imbalance Learn more

Resolution: Immediate publication, nanosecond-resolution timestamps
h
opus_books
huggingface.co
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for OPUS Books

Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
Audible Dataset
kaggle.com
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Snehangsu De (2022). Audible Dataset [Dataset]. https://www.kaggle.com/datasets/snehangsude/audible-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Snehangsu De
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

With the trend toward audiobooks growing, I gathered this data to understand how the audiobook market has been growing over the years. From authors of audiobooks to release dates, the data represents the important details of audiobooks from 1998 till 2025 (pre-planned releases).

I have yet to find a great audiobooks dataset and hence the urge to make a dataset that provides us with information on the basics and the history of audiobooks. I look to improve the dataset with more details in the near future.

File Information

The Uncleaned data or audible_uncleaned.csv is exactly the raw data I derived from Audible.in The Cleaned one or audible_cleaned.csv consists of a few basic data cleaning steps.

Libraries used

The data was collected using webs-scraping. - re - Beautiful Soup - Selenium

Beautiful Soup and Selenium were used in unison to mainly gather the data. The code can be re-used and you can find the code here: https://github.com/snehangsude/audible_scraper

Column Breakdown

name: Name of the audiobook

author: Author of the audiobook

narrator: Narrator of the audiobook

time: Length of the audiobook

releasedate: Release date of the audiobook

language: Language of the audiobook

stars: No. of stars the audiobook received

price: Price of the audiobook in INR

ratings: No. of reviews received by the audiobook
p
Books in Thailand - 1 Verified Listings Database
poidata.io
csv, excel, json
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Books in Thailand - 1 Verified Listings Database [Dataset]. https://www.poidata.io/report/books/thailand
Explore at:
excel, json, csvAvailable download formats
Dataset updated
Jul 4, 2025
Dataset provided by
Poidata.io
Area covered
Thailand
Description
Comprehensive dataset of 1 Books in Thailand as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
D
Evaluating the impact of the FWF-E-Book-Library collection in the OAPEN...
ssh.datastations.nl
ods, pdf, tsv, zip
Updated Mar 25, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R. Snijder; R. Snijder (2015). Evaluating the impact of the FWF-E-Book-Library collection in the OAPEN Library [Dataset]. http://doi.org/10.17026/DANS-ZM7-X6E9
Explore at:
ods(1085463), zip(15133), pdf(1453317), tsv(12818)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-ZM7-X6E9
Dataset updated
Mar 25, 2015
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
R. Snijder; R. Snijder
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Measuring scholarly impact and societal relevance in the humanities and social sciences can be done in several ways. Here we will look at a collection of e-books from the FWF-E-Book-Library, which is made available through the OAPEN Library. In 2014, 146 books of the FWF-E-Book-Library collection were made available via the OAPEN Library.The analysis is based on COUNTER compliant download data. This means that downloads by automated systems ('bots') and other types of suspicious download behaviour is discarded from the reports. The data of the 28,139 downloads used for this analysis originated from 23,652 IP addresses. It is clear that many providers use several IP addresses: the IP addresses were linked to 2,839 provider names. Where no information about a provider could be found, the download data was omitted.
F
Book Publication, Editions for United States
fred.stlouisfed.org
json
Updated Aug 15, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2012). Book Publication, Editions for United States [Dataset]. https://fred.stlouisfed.org/series/M0106AUSM234NNBR
Explore at:
jsonAvailable download formats
Dataset updated
Aug 15, 2012
License
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Area covered
United States
Description
Graph and download economic data for Book Publication, Editions for United States (M0106AUSM234NNBR) from Jan 1913 to Dec 1928 about book and USA.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096

Best Books Ever Dataset

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4265096

Dataset updated

Nov 10, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |

Clear search

Close search

Google apps

Main menu

Best Books Ever Dataset

Goodreads Book Reviews

Meta-Data of Books:

Dataset: Books

Breakdown of Revenue by Media Type: Books - Print Books for Book Publishers,...

Library New Titles - Large Print Books

Number of book piracy downloads in the U.S. 2017, by method

Project Gutenberg Book Corpus

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data from: Book Reading Dataset

Book Reading

Retail Sales: Book Stores

Amazon Bestselling Books & Customer Reviews

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Frequency of e-book downloading in the UK 2015-2022

Shadow library book downloads, time, location, ISBN, title

BookCorpus Dataset

Book Publishers in United States - 5,599 Verified Listings Database

Nasdaq Stock Market Data (Nasdaq TotalView-ITCH feed)

opus_books

Audible Dataset

Introduction

File Information

Libraries used

Column Breakdown

Books in Thailand - 1 Verified Listings Database

Evaluating the impact of the FWF-E-Book-Library collection in the OAPEN...

Book Publication, Editions for United States

Best Books Ever Dataset