Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is Learning Data Mining with Python. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 11 rows and is filtered where the book subjects is Data mining-Social aspects. It features 9 columns including author, publication date, language, and book publisher.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Large-scale International Study shows comparative availability and terms for a much larger sample of almost 100,000 books across those same five jurisdictions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Data mining techniques in CRM : inside customer segmentation. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 4 rows and is filtered where the books is Spatial data mining : theory and application. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
This data was gathered as part of the data mining project for General Assembly Data Science Immersive course.
This data was acquired from Google Books store. Google API was used to acquire the data. Nine features were gathered for each book in the data set. the column names mostly are self explanatory nevertheless, it will be explained below.
I like to thank google for making a free available API for their services and websites. I also would love to acknowledge the effort of the web scraper extension developer, it is really nice and powerful tool for web scraping.
©2019 Google
Here is a story. you love reading books, and recently, you bought a book that you thought you liked. However, after reading half the book you still don't feel the enthusiasm and joy you expected. I think that machine learning algorithms might help solve such a problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Post, mine, repeat : social media data mining becomes ordinary. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research explored what happens when social media data mining becomes ordinary and is carried out by organisations that might be seen as the pillars of everyday life. The interviews on which the transcripts are based are discussed in Chapter 6 of the book. The referenced book contains a description of the methods. No other publications resulted from working with these transcripts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch. It features 7 columns including author, publication date, language, and book publisher.
Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Industrial Energy Data Book (IEDB) aggregates and synthesizes information on the trends in industrial energy use, energy prices, economic activity, and water use. The IEDB also estimates county-level industrial energy use and combustion energy use of large energy-using facilities (i.e., facilities required to report greenhouse gas emissions under the EPA's Greenhouse Gas Reporting Program). These estimates are derived from publicly available sources from EPA, Energy Information Administration, Census Bureau, USDA, and USGS. The estimation methodology is meant to be improved over time with input from the energy analysis and developer communities. Please refer to https://github.com/NREL/Industry-energy-data-book.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open access platforms and retail websites have one thing in common: they are trying to present the most relevant offerings possible to their patrons. Retail websites – such as Amazon.com – de-ploy recommender systems based on data collected about their customers. These systems im-prove with the amount of data available: the more is known about the customers, the better it can predict what other merchandise will appeal.Recommender systems are successful, but using open access platforms to track people is not ac-ceptable. Therefore, a different solution is needed. Compared to retail websites, open access plat-forms have an unique advantage: they are able to use the complete contents of the publications they host. So, the question arises if it is possible to create a recommender system based on the contents of freely available documents, instead of personal data.The solution described in this paper is based on standard open source software. It is built using a combination of DSpace 6 and the R programming language. The open access platform – based on DSpace 6 – is the OAPEN Library; the data set used consists of nearly 11,000 open access books and chapters. The OAPEN Library enables data extraction through an API (application programming in-terface). A text mining algorithm written in the R programming language uses the full text of the publications and filters out the most common combinations of three words (trigrams). The next step is finding the publications that have one of more trigrams in common. The more trigrams two books or chapters share, the more closer they are 'connected'. This allows us not just to find relat-ed titles for each publication, but also to quantify how closely they are connected. Date: 2021-02-24 Date Submitted: 2021-02-24
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Instant Weka how-to : implement cutting-edge data mining aspects in Weka to your applications. It features 7 columns including author, publication date, language, and book publisher.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Quantum Tunnel TweetsThe data set contains tweets sourced from @quantum_tunnel and @dt_science as a demo for classifying text using Naive Bayes. The demo is detailed in the book Data Science and Analytics with Python by Dr J Rogel-Salazar.Data contents:Train_QuantumTunnel_Tweets.csv: Labelled tweets for text related to "Data Science" with three features:DataScience: [0/1] indicating whether the text is about "Data Science" or not.Date: Date when the tweet was publishedTweet: Text of the tweetTest_QuantumTunnel_Tweets.csv: Testing data with twitter utterances withouth labels:id: A unique identifier for tweetsDate: Date when the tweet was publishedTweet: Text for the tweetFor further information, please get in touch with Dr J Rogel-Salazar.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Electric Power Use: Manufacturing and Mining: Manufacturing: Nondurable Goods: Newspaper, Periodical, Book, and Directory Publishers (NAICS = 5111) (DISCONTINUED) (KWG5111SQ) from Q1 1972 to Q3 2005 about periodicals, book, used, printing, information, electricity, NAICS, mining, manufacturing, and USA.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Electric Power Use: Manufacturing and Mining: Manufacturing: Nondurable Goods: Newspaper, Periodical, Book, and Directory Publishers (NAICS = 5111) (DISCONTINUED) (KWG5111A) from 1972 to 2004 about periodicals, book, used, printing, information, electricity, NAICS, mining, manufacturing, and USA.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Two Datasets: works_published and works_cited for year 2020 from OpenAlex database.Check license https://github.com/ourresearch/openalex-docs/blob/main/license.md "OpenAlex data is made available under the CC0 license. That means it's in the public domain, and free to use in any way you like. We appreciate attribution where it's convenient, but it's not at all necessary. There is one exception: the MAG Format snapshot is released under ODC-BY, as per the original MAG license applied by Microsoft (it reuses their schema). See the LICENSE.txt file in the MAG format snapshot distribution for attribution requirement details."Data Quality Considerations:OpenAlex has improved the accuracy of the data with helps from algorithms and institutions.Our current data quality assessment showed the precision and recall 95%+.The first dataset "works_published", as constructed in the provided sources, refers to the publications authored by individuals affiliated with the University of Arizona (UArizona). The data is retrieved using the OpenAlexR package by querying the OpenAlex database with UArizona's Research Organization Registry (ROR) ID (03m2x1q45) and specific publication date ranges. Key aspects of this dataset:Scope: It contains records of scholarly works associated with UArizona authors, including various publication types such as journals, repositories (like PubMed and arXiv), and others. It is also possible to filter the results to include only "journal" type publications using the primary_location.source.type = "journal" parameter in the oa_fetch function.Temporal Coverage: The sources demonstrate fetching data for specific years (e.g., 2019, 2020, 2021, 2022, 2023).Data Retrieval: The process involves using the oa_fetch function from the openalexR package with the entity="works" parameter and specifying the institutions.ror.Data Structure: Each record in this dataset represents a publication and includes various fields. Certain fields are data frames.Usage: This dataset is used as a starting point for various data analyses and data mining.The second dataset "works_cited", refers to scholarly works cited by the publications within the works_published dataset. It is created by extracting the OpenAlex IDs from the $referenced_works field of the works_published data and then using the oa_fetch function to retrieve the full metadata for these cited works. Key aspects of this dataset:Scope: It includes metadata for a wide range of scholarly works that have been cited by UArizona-affiliated publications. This can encompass articles, books, preprints, book chapters, and other types of scholarly outputs.Data Derivation: The dataset is derived from the referenced_works field of the works_published dataset.Data Structure: Each record in this dataset represents a cited work and contains various fields retrieved by the OpenAlex API.The third file (institution_publications.r) is the source code to get the above dataset.Note the code retrieves additional years in addition to 2020.Usage: Both datasets are crucial for performing publication and citation analysis and mining, including:Identifying the most frequently cited works and journals.Analyzing the journal usage and publisher distribution of cited works.Understanding the scholarly landscape influencing UArizona research.Identifying potential resources for library collections based on citation frequency.Investigating the presence and frequency of citations from specific publishers or to specific works.For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.eduThis item is part of University of Arizona authors' scholarly works published and cited works
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In accordance with the Mining Act (Official Gazette RS, No. 14/14 – official consolidated text and 61/17-GZ), the Geological Survey of Slovenia, in its role of Public Mining Service, supports the ministry responsible for mining (Ministry of Infrastructure) in terms of sustainable mineral management and mineral policy. The Public Mining Service is authorized to maintain a Mining Register and Mining Cadastre on the national level, including a chronology of mining rights granting (“Mining Registry Book” web application and database). All official data, even on production and reserves/resources are recorded for all mining and exploration areas in a country.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
China Listed Company: BVPS: Mining data was reported at 7.552 RMB in 2023. This records an increase from the previous number of 7.355 RMB for 2022. China Listed Company: BVPS: Mining data is updated yearly, averaging 5.570 RMB from Dec 2001 (Median) to 2023, with 21 observations. The data reached an all-time high of 7.552 RMB in 2023 and a record low of 1.690 RMB in 2001. China Listed Company: BVPS: Mining data remains active status in CEIC and is reported by China Securities Regulatory Commission. The data is categorized under China Premium Database’s Business and Economic Survey – Table CN.OZ: Financial Data of Listed Company: Book Value of Equity per Share (BVPS).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is Learning Data Mining with Python. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.