Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.
For further details, please refer to the description below and to the original paper:
Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760
When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).
Dataset summary
The dataset consists of two versions:
Both versions are split into 13 files (one per year) for ease of downloading and handling.
Dataset details
The following formatting applies to both versions of the dataset:
Version 1: Quotation-centric data
In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.
Quotation-centric data
|-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
|-- quotation: Text of the longest encountered original form of the quotation
|-- date: Earliest occurrence date of any version of the quotation
|-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
|-- probas: Array representing the probabilities of each speaker having uttered the quotation.
The probabilities across different occurrences of the same quotation are summed for
each distinct candidate speaker and then normalized
|-- proba: Probability for a given speaker
|-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
|-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
|-- qids: Wikidata IDs of all aliases that match the selected speaker
|-- numOccurrences: Number of time this quotation occurs in the articles
|-- urls: List of links to the original articles containing the quotation
Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.
Version 2: Article-centric data
In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.
Article-centric data
|-- articleID: Primary key
|-- articleLength: Length of the article in PTB tokens
|-- date: Publication date of the article
|-- phase: Corresponding phase in which the article appeared (A-E)
|-- title: Title of the article
|-- url: Link to the original article
|-- names: List of all extracted speakers that occur in the article
|-- name: Surface form of the first occurrence of each speaker in the article
|-- ids: List of Wikidata IDs that have `name` as a possible alias
|-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
|-- quotations: List of all the quotations that appear in the article
|-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
|-- quotation: Text of the quotation as it occurs in this article
|-- quotationOffset: Index where the quotation starts in the article
|-- leftContext: Text in the left context window of the quotation (used for the attribution)
|-- rightContext: Text in the right context window (used for the attribution)
|-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
|-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID`
|-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
|-- proba: Probability for a given speaker
|-- speaker: Name of the speaker as it first occurs in this article
|-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
|-- numOccurrences: Number of times this quotation occurs in any article
Code repository
The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This databases offers 4564 references dealing with computer-assisted language comparison in a broad sense. In addition, the database offers 8364 distinct quotes collected from 5063 references. The majority of the references in the quote database overlaps with those in the bibliographic database. The quotes are organized by keywords and can browsed with a full text and a keyword search.
The data (references and quotes) underlying each new release are provided here, the data can be browsed at https://evobib.digling.org/.
If you use the database, I would appreciate if you could this in your research:
List, Johann-Mattis (2024): EvoBib: A bibliographic database and quote collection [Database, Version 1.8.0]. Passau: Chair for Multilingual Computational Linguistics. URL: https://evobib.digling.org/
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15886312%2F6afa14cd0d847cb7074243e0e56d804b%2Foldbook-bg.jpg?generation=1696377960854260&alt=media" alt="">
Explore a diverse and inspiring collection of quotes from the Goodreads website with our Goodreads Quotes Dataset. This dataset features a wide range of motivational, thought-provoking, and insightful quotes from various authors, thinkers, and personalities.
Dataset Details:
Format: JSON (JavaScript Object Notation), CSV (Comma-Separated Values) Columns: quote: The text of the quote. author: The author of the quote. tags: A list of tags or categories associated with the quote (e.g., ["inspiration", "motivation", "life"]).
Data Preprocessing:
The dataset has been scraped from the Goodreads website, ensuring the collection of accurate and attributed quotes. The authors' names and tags have been cleaned to remove any unnecessary characters or formatting. Tags are provided as a list of keywords for easy categorization.
Use Cases:
Natural Language Processing (NLP) tasks such as sentiment analysis, text generation, and language modeling. Content creation for websites, social media, and inspirational content. Analyzing trends in quotes, authors, and popular tags. Exploring the wisdom shared by authors and thinkers throughout history.
Acknowledgments:
The dataset was collected and curated by DWS Studio for educational and research purposes. We acknowledge Goodreads for hosting the quotes and providing valuable literary content.
License:
This dataset is provided under the terms of the Apache 2.0, ensuring that it can be used for research, analysis, and educational purposes while respecting the rights and attribution requirements of the original authors.
Disclaimer:
This dataset and its contents are intended for educational and research purposes only. Users are responsible for complying with the terms of service and policies of websites when using this data.
Start your journey with our Goodreads Quotes Dataset and let the power of words inspire your projects and analyses. If you have any questions or feedback, please feel free to contact us.
Dataset Card for English quotes
I-Dataset Summary
english_quotes is a dataset of all the quotes retrieved from goodreads quotes. This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.
II-Supported Tasks and Leaderboards
Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/english_quotes.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data has been scraped from goodreads.com. Quotes from categories like death, inspiration , widom, love and 6 other categories have been scraped. Each one has 3000 quotes with authors who wrote the quote available in the dataset. Other tags from the quote are also mentioned in the dataset.
The data has been combined and shuffled from the 10 different categories. The total number of quotes present is 30,000.
CC0
Original Data Source: Quotes From Goodread
The dataset consists of 375 extracted quotes from 31 community reports relevant to the development of a materials data strategy for the NIST Materials Measurement Laboratory (MML). The dataset is used in the NIST internal report "A Materials Data Strategy." In the past decade, numerous public and private sector documents have highlighted the need for materials data to facilitate advanced technologies in myriad industrial and economic sectors. These documents have been analyzed to identify prevalent gaps in the establishment of an interconnected materials data infrastructure akin to that envisioned in the federal agency-wide Materials Genome Initiative. The internal report uses a uniform schematic format to portray these gaps, illustrate progress in addressing the gaps, and propose an MML roadmap of action items to further address the gaps.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Price quote data (for locally collected data only) and consumption segment indices that underpin consumer price inflation statistics, giving users access to the detailed data that are used in the construction of the UK’s inflation figures. The data are being made available for research purposes only and are not an accredited official statistic. From October 2024, private school fees and part-time education classes have been included in the consumption segment indices file. For more information on the introduction of consumption segments, please see the Consumer Prices Indices Technical Manual, 2019. Note that this dataset was previously called the consumer price inflation item indices and price quotes dataset.
algoseek Trade and Quote (TAQ) data contain all trades and top-of-book intraday quotes for all listed stocks, ETNs, ETFs, ADRs, and funds from 15+ US exchanges and marketplaces. TAQ data files are organized into a single format feed where events are ordered by the time received with nanosecond timestamps starting from 2016, and millisecond timestamps before. The entire trading session includes early and late hours from 04:00 to 20:00 EST
This statistic shows the willingness to share recent driving data for personalized insurance quotes in the United States in 2017, by generation. Millennials were the most likely to share their recent driving data with 93 percent of those respondents saying that they would be willing to do that.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Context: The data was created to build a Content Based Recommendation System using Text Data.
Ideas: Create a content-based recommendation engine based on user preference. Data preprocessing using NLP Methods. Analyze Textual Dataset.
CC0
Original Data Source: Quotes Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive reference table of 33 powerful mindfulness quotes organized by category and author, featuring wisdom from spiritual teachers like Thich Nhat Hanh, Buddha, Jon Kabat-Zinn, and others. Each quote is categorized by practical application including Present Moment Mastery, Inner Peace & Self-Compassion, Understanding & Managing Emotions, Awakening & Awareness, Life Philosophy & Wisdom, and Inner Wisdom & Intuition.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📚 About This Dataset: Stoic Wisdom — Quotes from Prominent Stoic Philosophers
🌟 Overview
This dataset is a comprehensive collection of Stoic quotes sourced from some of the most prominent Stoic philosophers and writers. It includes timeless wisdom from the likes of Marcus Aurelius, Seneca, Epictetus, Zeno of Citium, Musonius Rufus, and others. These quotes cover a wide range of Stoic themes such as resilience, discipline, mindfulness, control, and virtue—offering practical guidance for living a fulfilling and rational life.
The data was scraped from Goodreads using Python.
📄 Dataset Details • Quote: The Stoic quote text. • Author: The philosopher or writer who authored the quote. • Book: The source/book where the quote is found (if available). • Tags: The main themes or topics related to the quote (e.g., “attitude,” “pain,” “stoicism,” “freedom”).
🔍 Potential Use Cases • Sentiment Analysis: Analyze Stoic sentiments related to topics like death, fate, and personal control. • Topic Modeling: Identify core Stoic themes using unsupervised NLP techniques. • Philosophical Comparisons: Compare Stoic quotes to other schools of thought (e.g., Epicureanism, Buddhism). • Machine Learning: Build classifiers to predict the author or theme of a quote based on textual features. • Personal Development Tools: Power applications that deliver daily Stoic reflections and insights.
📝 Acknowledgments • The quotes were sourced from Goodreads. • The dataset is intended for educational and research purposes, respecting the content’s original attribution.
algoseek Futures Trade and Quote data include trades and quotes with condition codes (including Aggressor Side). Both processed TAQ and unprocessed raw file are available. Processed TAQ dataset has millisecond timestamp. The data is from CME, CBOT, NYMEX, and Comex. Data is as far back as January 2010.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mexico Avg Daily Quote Salaries: Expanded data was reported at 373.600 MXN in Feb 2019. This records an increase from the previous number of 372.275 MXN for Jan 2019. Mexico Avg Daily Quote Salaries: Expanded data is updated monthly, averaging 241.011 MXN from Jan 2000 (Median) to Feb 2019, with 230 observations. The data reached an all-time high of 373.600 MXN in Feb 2019 and a record low of 129.283 MXN in Feb 2000. Mexico Avg Daily Quote Salaries: Expanded data remains active status in CEIC and is reported by Secretary of Labor and Social Security. The data is categorized under Global Database’s Mexico – Table MX.G042: Average Daily Quote Salaries: Expanded.
CoinAPI offers digital asset data with crypto quotes from both CEX and DEX sources. Access real-time and historical market information including bid prices, ask prices, trading volumes, and precise timestamps. Our complete crypto data enables informed decisions through accurate market insights.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mexico Avg Daily Quote Salaries: Expanded: Social Services data was reported at 502.300 MXN in Feb 2019. This records an increase from the previous number of 502.200 MXN for Jan 2019. Mexico Avg Daily Quote Salaries: Expanded: Social Services data is updated monthly, averaging 312.028 MXN from Jan 2000 (Median) to Feb 2019, with 230 observations. The data reached an all-time high of 502.300 MXN in Feb 2019 and a record low of 155.822 MXN in Jan 2000. Mexico Avg Daily Quote Salaries: Expanded: Social Services data remains active status in CEIC and is reported by Secretary of Labor and Social Security. The data is categorized under Global Database’s Mexico – Table MX.G042: Average Daily Quote Salaries: Expanded.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
IF YOU FIND THIS CONTENT USEFUL, PLEASE LEAVE AN UPVOTE, COMMENT, AND/OR FOLLOW!
This dataset is a combination of four years of Apple ($AAPL) options end of day quotes ranging from 01-2016 to 03-2023. Each row represents the information associated with one contract's strike price and a given expiration date.
Dates quotes are given in in Unix and in "YYYY-MM-DD HH:MM" formats. Quote frequency is daily at 4:00 pm EST, which corresponds with end of day market closure.
REMEMBER: Apple stock split on August 28, 2020. This will be reflected in the data. Keep this in mind!
What is an option chain?
An option chain can be defined as the listing of all option contracts. It comes with two different sections: call and put. A call option means a contract that gives you the right but does not give you the obligation to buy an underlying asset at a particular price and within the option's expiration date. This means that in this dataset, there will be the entire option chain (all available option contracts for all expirations) for each business day between Q1 2016 and Q1 2023.
This dataset contains data for American options, which can be exercised on or before expiration date. This is unlike European options contracts, which can only be exercised on the expiration date.
I am also continuously working on the associated notebook to give a basic idea of how to load and explore the data. Stay tuned!
Similar Datasets: - $TSLA Option Chains - $SPY Option Chains - $NVDA Option Chains - $QQQ Option Chains
Futures data can be ordered as full month ranges. To control costs it is possible to order Nearest to Expiry (NTE) data with overlap between expiring future and the next future in the month of expiry or with overlap over more than 1 month is needed. Of course you can also select all active expiries if required.
The data is available at tick level with millisecond resolution as well as at regular intervals of 1 Min, 5 Min and so on.
Data is priced separately for Trades (Tx) and Quotes (Qt).
Tick level Tx data consists of a millisecond timestamp and trade price Tx with an option to include the Volume field. Tick level Qt data consists of millisecond timestamp and quote Qt with a flag to indicate whether it is a Bid or an Ask and optionally the Qt size field can be added.
Regular interval data is usually supplied as one of these sets: CloseTx CloseBid, CloseAsk OpenTx, HighTx, LowTx, CloseTx OpenBid, HighBid, LowBid, CloseBid OpenAsk, HighAsk, LowAsk, CloseAsk
Additional Fields: IntervalTxVolume, CloseBidSize, CloseAskSize and some others are available if required.
Timestamps are by default in GMT but data can be in any Time Zone requested.
Pricing depends on frequency and number of fields.
100s of papers in finance and economics have been written since 1986 onwards using our data and several reputed banks and hedge funds use our data for back testing and risk management.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This databases offers 3524 references dealing with computer-assisted language comparison in a broad sense. In addition, the database offers 5298 distinct quotes collected from 2835 references. The majority of the references in the quote database overlaps with those in the bibliographic database. The quotes are organized by keywords and can browsed with a full text and a keyword search.
The data (references and quotes) underlying each new release are provided here, the data can be browsed at https://digling.org/evobib/.
If you use the database, I would appreciate if you could this in your research:
> List, Johann-Mattis (2020): EvoBib: A bibliographical database and quote collection for historical linguistics. Version 1.1.0. Jena: Max Planck Institute for the Science of Human History. URL: https://digling.org/evobib/ DOI: 10.5281/zenodo.3699172
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data set of an embedded study at an airport organization about the transformation towards an multimodal transport hub. Data is result of (in)formal conversations, diary notes, meetings notes and observations. This is a condensed data set, representing quotes and notes that were deemed relevant to the study's topic. The data has been collected over 16 months, and the data set includes (anonymous) quotations and parafrases, en also condensed meaning units (the interpretation of the researchers).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.
For further details, please refer to the description below and to the original paper:
Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760
When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).
Dataset summary
The dataset consists of two versions:
Both versions are split into 13 files (one per year) for ease of downloading and handling.
Dataset details
The following formatting applies to both versions of the dataset:
Version 1: Quotation-centric data
In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.
Quotation-centric data
|-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
|-- quotation: Text of the longest encountered original form of the quotation
|-- date: Earliest occurrence date of any version of the quotation
|-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
|-- probas: Array representing the probabilities of each speaker having uttered the quotation.
The probabilities across different occurrences of the same quotation are summed for
each distinct candidate speaker and then normalized
|-- proba: Probability for a given speaker
|-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
|-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
|-- qids: Wikidata IDs of all aliases that match the selected speaker
|-- numOccurrences: Number of time this quotation occurs in the articles
|-- urls: List of links to the original articles containing the quotation
Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.
Version 2: Article-centric data
In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.
Article-centric data
|-- articleID: Primary key
|-- articleLength: Length of the article in PTB tokens
|-- date: Publication date of the article
|-- phase: Corresponding phase in which the article appeared (A-E)
|-- title: Title of the article
|-- url: Link to the original article
|-- names: List of all extracted speakers that occur in the article
|-- name: Surface form of the first occurrence of each speaker in the article
|-- ids: List of Wikidata IDs that have `name` as a possible alias
|-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
|-- quotations: List of all the quotations that appear in the article
|-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
|-- quotation: Text of the quotation as it occurs in this article
|-- quotationOffset: Index where the quotation starts in the article
|-- leftContext: Text in the left context window of the quotation (used for the attribution)
|-- rightContext: Text in the right context window (used for the attribution)
|-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
|-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID`
|-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
|-- proba: Probability for a given speaker
|-- speaker: Name of the speaker as it first occurs in this article
|-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
|-- numOccurrences: Number of times this quotation occurs in any article
Code repository
The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.