100+ datasets found
  1. Data from: Quotebank: A Corpus of Quotations from a Decade of News

    • zenodo.org
    bz2
    Updated Jun 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West; Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West (2023). Quotebank: A Corpus of Quotations from a Decade of News [Dataset]. http://doi.org/10.5281/zenodo.4277311
    Explore at:
    bz2Available download formats
    Dataset updated
    Jun 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West; Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.

    For further details, please refer to the description below and to the original paper:

    Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
    "Quotebank: A Corpus of Quotations from a Decade of News"
    Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
    https://doi.org/10.1145/3437963.3441760

    When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).

    Dataset summary

    The dataset consists of two versions:

    • Quotation-centric version (quotes-YYYY.json.bz2)
      An aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.
    • Article-centric version (quotebank-YYYY.json.bz2)
      A complete set of all individual quotation mentions with associated speaker as well as the article context in which they are mentioned. This larger version contains one entry per article in the news data. Each entry contains all speakers that appear in the news article as well as the (attributed) quotations, alongside a context window surrounding the quotations.

    Both versions are split into 13 files (one per year) for ease of downloading and handling.

    Dataset details

    The following formatting applies to both versions of the dataset:

    • All data is made available in JSON format that has been compressed using bzip2.
    • The data is split per year (i.e., there is one file for each calendar year).
    • The offsets of quotations, contexts, and speaker annotations are given in units of Penn TreeBank Tokenizer tokens.
    • Offsets are zero-based and are computed from the start of the article.
    • When pairs of offsets are provided, the end offset is non-inclusive (e.g. in Python you can call tokens[start:end] without having to do end+1).
    • The Spinn3r data from which Quotebank was extracted had been collected over the course of over a decade. During this time, the client-side code used for collecting the data changed several times, and various character-encoding-related issues led to different representations of the original text at different times. We thus divide the 12 years spanned by the Spinn3r corpus into five phases (Phases A through E). A detailed description is available on GitHub; the key takeaways are that (1) text was lowercased in Phases A, B, and C, whereas the original capitalization was maintained in Phases D and E, and that (2) non-ASCII characters are properly represented only in Phase E.


    Version 1: Quotation-centric data

    In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.

    Quotation-centric data
     |-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
     |-- quotation: Text of the longest encountered original form of the quotation
     |-- date: Earliest occurrence date of any version of the quotation
     |-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
     |-- probas: Array representing the probabilities of each speaker having uttered the quotation.
       The probabilities across different occurrences of the same quotation are summed for
       each distinct candidate speaker and then normalized
       |-- proba: Probability for a given speaker
       |-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
     |-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
     |-- qids: Wikidata IDs of all aliases that match the selected speaker
     |-- numOccurrences: Number of time this quotation occurs in the articles
     |-- urls: List of links to the original articles containing the quotation 

    Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.

    Version 2: Article-centric data

    In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.

    Article-centric data
     |-- articleID: Primary key
     |-- articleLength: Length of the article in PTB tokens
     |-- date: Publication date of the article
     |-- phase: Corresponding phase in which the article appeared (A-E)
     |-- title: Title of the article
     |-- url: Link to the original article
     |-- names: List of all extracted speakers that occur in the article
       |-- name: Surface form of the first occurrence of each speaker in the article
       |-- ids: List of Wikidata IDs that have `name` as a possible alias
       |-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
     |-- quotations: List of all the quotations that appear in the article
       |-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
       |-- quotation: Text of the quotation as it occurs in this article
       |-- quotationOffset: Index where the quotation starts in the article
       |-- leftContext: Text in the left context window of the quotation (used for the attribution)
       |-- rightContext: Text in the right context window (used for the attribution)
       |-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
       |-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` 
       |-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
          |-- proba: Probability for a given speaker
          |-- speaker: Name of the speaker as it first occurs in this article
       |-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
       |-- numOccurrences: Number of times this quotation occurs in any article 

    Code repository

    The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.

  2. Z

    EvoBib: A Bibliographic Database and Quote Collection for Historical...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johann-Mattis List (2024). EvoBib: A Bibliographic Database and Quote Collection for Historical Linguistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1181952
    Explore at:
    Dataset updated
    Aug 22, 2024
    Dataset authored and provided by
    Johann-Mattis List
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This databases offers 4564 references dealing with computer-assisted language comparison in a broad sense. In addition, the database offers 8364 distinct quotes collected from 5063 references. The majority of the references in the quote database overlaps with those in the bibliographic database. The quotes are organized by keywords and can browsed with a full text and a keyword search.

    The data (references and quotes) underlying each new release are provided here, the data can be browsed at https://evobib.digling.org/.

    If you use the database, I would appreciate if you could this in your research:

    List, Johann-Mattis (2024): EvoBib: A bibliographic database and quote collection [Database, Version 1.8.0]. Passau: Chair for Multilingual Computational Linguistics. URL: https://evobib.digling.org/

  3. Goodreads Quotes Dataset

    • kaggle.com
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dakidarts (2023). Goodreads Quotes Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/6605524
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dakidarts
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15886312%2F6afa14cd0d847cb7074243e0e56d804b%2Foldbook-bg.jpg?generation=1696377960854260&alt=media" alt="">

    Explore a diverse and inspiring collection of quotes from the Goodreads website with our Goodreads Quotes Dataset. This dataset features a wide range of motivational, thought-provoking, and insightful quotes from various authors, thinkers, and personalities.

    Dataset Details:

    Format: JSON (JavaScript Object Notation), CSV (Comma-Separated Values) Columns: quote: The text of the quote. author: The author of the quote. tags: A list of tags or categories associated with the quote (e.g., ["inspiration", "motivation", "life"]).

    Data Preprocessing:

    The dataset has been scraped from the Goodreads website, ensuring the collection of accurate and attributed quotes. The authors' names and tags have been cleaned to remove any unnecessary characters or formatting. Tags are provided as a list of keywords for easy categorization.

    Use Cases:

    Natural Language Processing (NLP) tasks such as sentiment analysis, text generation, and language modeling. Content creation for websites, social media, and inspirational content. Analyzing trends in quotes, authors, and popular tags. Exploring the wisdom shared by authors and thinkers throughout history.

    Acknowledgments:

    The dataset was collected and curated by DWS Studio for educational and research purposes. We acknowledge Goodreads for hosting the quotes and providing valuable literary content.

    License:

    This dataset is provided under the terms of the Apache 2.0, ensuring that it can be used for research, analysis, and educational purposes while respecting the rights and attribution requirements of the original authors.

    Disclaimer:

    This dataset and its contents are intended for educational and research purposes only. Users are responsible for complying with the terms of service and policies of websites when using this data.

    Start your journey with our Goodreads Quotes Dataset and let the power of words inspire your projects and analyses. If you have any questions or feedback, please feel free to contact us.

  4. h

    english_quotes

    • huggingface.co
    • opendatalab.com
    Updated Dec 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abir ELTAIEF (2021). english_quotes [Dataset]. http://doi.org/10.57967/hf/1053
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2021
    Authors
    Abir ELTAIEF
    Description

    Dataset Card for English quotes

      I-Dataset Summary
    

    english_quotes is a dataset of all the quotes retrieved from goodreads quotes. This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.

      II-Supported Tasks and Leaderboards
    

    Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/english_quotes.

  5. o

    Quotes From Goodread

    • opendatabay.com
    .undefined
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Quotes From Goodread [Dataset]. https://www.opendatabay.com/data/ai-ml/f0d86cd4-fc04-46ae-8ef0-860d44d0a3bc
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The data has been scraped from goodreads.com. Quotes from categories like death, inspiration , widom, love and 6 other categories have been scraped. Each one has 3000 quotes with authors who wrote the quote available in the dataset. Other tags from the quote are also mentioned in the dataset.

    The data has been combined and shuffled from the 10 different categories. The total number of quotes present is 30,000.

    License

    CC0

    Original Data Source: Quotes From Goodread

  6. Dataset: Extracted Quotes from Community Reports Relevant to the Development...

    • catalog.data.gov
    • data.nist.gov
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Dataset: Extracted Quotes from Community Reports Relevant to the Development of a NIST-MML Materials Data Strategy [Dataset]. https://catalog.data.gov/dataset/dataset-extracted-quotes-from-community-reports-relevant-to-the-development-of-a-nist-mml-
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The dataset consists of 375 extracted quotes from 31 community reports relevant to the development of a materials data strategy for the NIST Materials Measurement Laboratory (MML). The dataset is used in the NIST internal report "A Materials Data Strategy." In the past decade, numerous public and private sector documents have highlighted the need for materials data to facilitate advanced technologies in myriad industrial and economic sectors. These documents have been analyzed to identify prevalent gaps in the establishment of an interconnected materials data infrastructure akin to that envisioned in the federal agency-wide Materials Genome Initiative. The internal report uses a uniform schematic format to portray these gaps, illustrate progress in addressing the gaps, and propose an MML roadmap of action items to further address the gaps.

  7. Consumer price inflation consumption segment indices and price quotes

    • ons.gov.uk
    • cy.ons.gov.uk
    csv
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2025). Consumer price inflation consumption segment indices and price quotes [Dataset]. https://www.ons.gov.uk/economy/inflationandpriceindices/datasets/consumerpriceindicescpiandretailpricesindexrpiitemindicesandpricequotes
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Office for National Statisticshttp://www.ons.gov.uk/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    Price quote data (for locally collected data only) and consumption segment indices that underpin consumer price inflation statistics, giving users access to the detailed data that are used in the construction of the UK’s inflation figures. The data are being made available for research purposes only and are not an accredited official statistic. From October 2024, private school fees and part-time education classes have been included in the consumption segment indices file. For more information on the introduction of consumption segments, please see the Consumer Prices Indices Technical Manual, 2019. Note that this dataset was previously called the consumer price inflation item indices and price quotes dataset.

  8. d

    AlgoSeek Equity Trade and Quote Data US coverage - nanosecond timestamps...

    • datarade.ai
    Updated Feb 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlgoSeek (2021). AlgoSeek Equity Trade and Quote Data US coverage - nanosecond timestamps since 2016 [Dataset]. https://datarade.ai/data-products/algoseek-equity-trade-and-quote-data-algoseek
    Explore at:
    Dataset updated
    Feb 3, 2021
    Dataset authored and provided by
    AlgoSeek
    Area covered
    United States
    Description

    algoseek Trade and Quote (TAQ) data contain all trades and top-of-book intraday quotes for all listed stocks, ETNs, ETFs, ADRs, and funds from 15+ US exchanges and marketplaces. TAQ data files are organized into a single format feed where events are ordered by the time received with nanosecond timestamps starting from 2016, and millisecond timestamps before. The entire trading session includes early and late hours from 04:00 to 20:00 EST

  9. Willingness to share driving data for personalized insurance quotes U.S....

    • statista.com
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Willingness to share driving data for personalized insurance quotes U.S. 2017, by age [Dataset]. https://www.statista.com/statistics/719039/willingness-to-share-driving-data-for-personalized-insurance-quotes-usa-by-age/
    Explore at:
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2017
    Area covered
    United States
    Description

    This statistic shows the willingness to share recent driving data for personalized insurance quotes in the United States in 2017, by generation. Millennials were the most likely to share their recent driving data with 93 percent of those respondents saying that they would be willing to do that.

  10. o

    Quotes Dataset

    • opendatabay.com
    .undefined
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Quotes Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/5998ae79-6192-483f-b5a5-8075cf335b18
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    Context: The data was created to build a Content Based Recommendation System using Text Data.

    Ideas: Create a content-based recommendation engine based on user preference. Data preprocessing using NLP Methods. Analyze Textual Dataset.

    License

    CC0

    Original Data Source: Quotes Dataset

  11. M

    33 Mindfulness Quotes Reference Table

    • 7chakracolors.com
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mindfulness Wisdom Collection (2024). 33 Mindfulness Quotes Reference Table [Dataset]. https://www.7chakracolors.com/blog/33-powerful-mindfulness-quotes/
    Explore at:
    Dataset updated
    2024
    Dataset provided by
    7 Chakra Colors
    Authors
    Mindfulness Wisdom Collection
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comprehensive reference table of 33 powerful mindfulness quotes organized by category and author, featuring wisdom from spiritual teachers like Thich Nhat Hanh, Buddha, Jon Kabat-Zinn, and others. Each quote is categorized by practical application including Present Moment Mastery, Inner Peace & Self-Compassion, Understanding & Managing Emotions, Awakening & Awareness, Life Philosophy & Wisdom, and Inner Wisdom & Intuition.

  12. Stoic quotes

    • kaggle.com
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tejas Nisar (2025). Stoic quotes [Dataset]. https://www.kaggle.com/datasets/tejasnisar/stoic-quotes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    Kaggle
    Authors
    Tejas Nisar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📚 About This Dataset: Stoic Wisdom — Quotes from Prominent Stoic Philosophers

    🌟 Overview

    This dataset is a comprehensive collection of Stoic quotes sourced from some of the most prominent Stoic philosophers and writers. It includes timeless wisdom from the likes of Marcus Aurelius, Seneca, Epictetus, Zeno of Citium, Musonius Rufus, and others. These quotes cover a wide range of Stoic themes such as resilience, discipline, mindfulness, control, and virtue—offering practical guidance for living a fulfilling and rational life.

    The data was scraped from Goodreads using Python.

    📄 Dataset Details • Quote: The Stoic quote text. • Author: The philosopher or writer who authored the quote. • Book: The source/book where the quote is found (if available). • Tags: The main themes or topics related to the quote (e.g., “attitude,” “pain,” “stoicism,” “freedom”).

    🔍 Potential Use Cases • Sentiment Analysis: Analyze Stoic sentiments related to topics like death, fate, and personal control. • Topic Modeling: Identify core Stoic themes using unsupervised NLP techniques. • Philosophical Comparisons: Compare Stoic quotes to other schools of thought (e.g., Epicureanism, Buddhism). • Machine Learning: Build classifiers to predict the author or theme of a quote based on textual features. • Personal Development Tools: Power applications that deliver daily Stoic reflections and insights.

    📝 Acknowledgments • The quotes were sourced from Goodreads. • The dataset is intended for educational and research purposes, respecting the content’s original attribution.

  13. d

    AlgoSeek Futures Trade and Quote data US coverage - historic data till 2010

    • datarade.ai
    Updated Jan 15, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlgoSeek (2010). AlgoSeek Futures Trade and Quote data US coverage - historic data till 2010 [Dataset]. https://datarade.ai/data-products/algoseek-futures-trade-and-quote-data-algoseek
    Explore at:
    Dataset updated
    Jan 15, 2010
    Dataset authored and provided by
    AlgoSeek
    Area covered
    United States of America
    Description

    algoseek Futures Trade and Quote data include trades and quotes with condition codes (including Aggressor Side). Both processed TAQ and unprocessed raw file are available. Processed TAQ dataset has millisecond timestamp. The data is from CME, CBOT, NYMEX, and Comex. Data is as far back as January 2010.

  14. Mexico Avg Daily Quote Salaries: Expanded

    • ceicdata.com
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). Mexico Avg Daily Quote Salaries: Expanded [Dataset]. https://www.ceicdata.com/en/indicator/mexico/data/avg-daily-quote-salaries-expanded
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    CEIC Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2018 - Feb 1, 2019
    Area covered
    Mexico
    Description

    Mexico Avg Daily Quote Salaries: Expanded data was reported at 373.600 MXN in Feb 2019. This records an increase from the previous number of 372.275 MXN for Jan 2019. Mexico Avg Daily Quote Salaries: Expanded data is updated monthly, averaging 241.011 MXN from Jan 2000 (Median) to Feb 2019, with 230 observations. The data reached an all-time high of 373.600 MXN in Feb 2019 and a record low of 129.283 MXN in Feb 2000. Mexico Avg Daily Quote Salaries: Expanded data remains active status in CEIC and is reported by Secretary of Labor and Social Security. The data is categorized under Global Database’s Mexico – Table MX.G042: Average Daily Quote Salaries: Expanded.

  15. c

    Crypto Quotes: Real-Time & Historical CEX/DEX Data | Crypto Data | Bid Price...

    • dataproducts.coinapi.io
    Updated Oct 10, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CoinAPI (2018). Crypto Quotes: Real-Time & Historical CEX/DEX Data | Crypto Data | Bid Price | Ask Price [Dataset]. https://dataproducts.coinapi.io/products/coinapi-crypto-quotes-data-real-time-historical-quotes-coinapi
    Explore at:
    Dataset updated
    Oct 10, 2018
    Dataset provided by
    Coinapi Ltd
    Authors
    CoinAPI
    Area covered
    Western Sahara, Kenya, Paraguay, Comoros, Niue, Djibouti, Slovakia, Benin, France, Saint Pierre and Miquelon
    Description

    CoinAPI offers digital asset data with crypto quotes from both CEX and DEX sources. Access real-time and historical market information including bid prices, ask prices, trading volumes, and precise timestamps. Our complete crypto data enables informed decisions through accurate market insights.

  16. Mexico Avg Daily Quote Salaries: Expanded: Social Services

    • ceicdata.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com, Mexico Avg Daily Quote Salaries: Expanded: Social Services [Dataset]. https://www.ceicdata.com/en/mexico/average-daily-quote-salaries-expanded/avg-daily-quote-salaries-expanded-social-services
    Explore at:
    Dataset provided by
    CEIC Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2018 - Feb 1, 2019
    Area covered
    Mexico
    Description

    Mexico Avg Daily Quote Salaries: Expanded: Social Services data was reported at 502.300 MXN in Feb 2019. This records an increase from the previous number of 502.200 MXN for Jan 2019. Mexico Avg Daily Quote Salaries: Expanded: Social Services data is updated monthly, averaging 312.028 MXN from Jan 2000 (Median) to Feb 2019, with 230 observations. The data reached an all-time high of 502.300 MXN in Feb 2019 and a record low of 155.822 MXN in Jan 2000. Mexico Avg Daily Quote Salaries: Expanded: Social Services data remains active status in CEIC and is reported by Secretary of Labor and Social Security. The data is categorized under Global Database’s Mexico – Table MX.G042: Average Daily Quote Salaries: Expanded.

  17. $AAPL Option Chains - Q1 2016 to Q1 2023

    • kaggle.com
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Graupe (2023). $AAPL Option Chains - Q1 2016 to Q1 2023 [Dataset]. https://www.kaggle.com/datasets/kylegraupe/aapl-options-data-2016-2020
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Dataset provided by
    Kaggle
    Authors
    Kyle Graupe
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    IF YOU FIND THIS CONTENT USEFUL, PLEASE LEAVE AN UPVOTE, COMMENT, AND/OR FOLLOW!

    This dataset is a combination of four years of Apple ($AAPL) options end of day quotes ranging from 01-2016 to 03-2023. Each row represents the information associated with one contract's strike price and a given expiration date.

    Dates quotes are given in in Unix and in "YYYY-MM-DD HH:MM" formats. Quote frequency is daily at 4:00 pm EST, which corresponds with end of day market closure.

    REMEMBER: Apple stock split on August 28, 2020. This will be reflected in the data. Keep this in mind!

    What is an option chain?

    An option chain can be defined as the listing of all option contracts. It comes with two different sections: call and put. A call option means a contract that gives you the right but does not give you the obligation to buy an underlying asset at a particular price and within the option's expiration date. This means that in this dataset, there will be the entire option chain (all available option contracts for all expirations) for each business day between Q1 2016 and Q1 2023.

    This dataset contains data for American options, which can be exercised on or before expiration date. This is unlike European options contracts, which can only be exercised on the expiration date.

    I am also continuously working on the associated notebook to give a basic idea of how to load and explore the data. Stay tuned!

    Similar Datasets: - $TSLA Option Chains - $SPY Option Chains - $NVDA Option Chains - $QQQ Option Chains

  18. d

    Historical Futures Trade and Quote Data (Europe, China, USA & Canada...

    • datarade.ai
    Updated May 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olsen Data (2021). Historical Futures Trade and Quote Data (Europe, China, USA & Canada covered)⎢Olsen Data [Dataset]. https://datarade.ai/data-products/historical-futures-trade-and-quote-data-olsen-data
    Explore at:
    Dataset updated
    May 1, 2021
    Dataset provided by
    Olsen Ltd.
    Authors
    Olsen Data
    Area covered
    United Kingdom, China, Canada, Japan, United States
    Description

    Futures data can be ordered as full month ranges. To control costs it is possible to order Nearest to Expiry (NTE) data with overlap between expiring future and the next future in the month of expiry or with overlap over more than 1 month is needed. Of course you can also select all active expiries if required.

    The data is available at tick level with millisecond resolution as well as at regular intervals of 1 Min, 5 Min and so on.

    Data is priced separately for Trades (Tx) and Quotes (Qt).

    Tick level Tx data consists of a millisecond timestamp and trade price Tx with an option to include the Volume field. Tick level Qt data consists of millisecond timestamp and quote Qt with a flag to indicate whether it is a Bid or an Ask and optionally the Qt size field can be added.

    Regular interval data is usually supplied as one of these sets: CloseTx CloseBid, CloseAsk OpenTx, HighTx, LowTx, CloseTx OpenBid, HighBid, LowBid, CloseBid OpenAsk, HighAsk, LowAsk, CloseAsk

    Additional Fields: IntervalTxVolume, CloseBidSize, CloseAskSize and some others are available if required.

    Timestamps are by default in GMT but data can be in any Time Zone requested.

    Pricing depends on frequency and number of fields.

    100s of papers in finance and economics have been written since 1986 onwards using our data and several reputed banks and hedge funds use our data for back testing and risk management.

  19. EvoBib: A Bibliographic Database and Quote Collection for Historical...

    • zenodo.org
    bin, csv
    Updated Feb 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johann-Mattis List; Johann-Mattis List (2022). EvoBib: A Bibliographic Database and Quote Collection for Historical Linguistics [Dataset]. http://doi.org/10.5281/zenodo.3699172
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Feb 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johann-Mattis List; Johann-Mattis List
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This databases offers 3524 references dealing with computer-assisted language comparison in a broad sense. In addition, the database offers 5298 distinct quotes collected from 2835 references. The majority of the references in the quote database overlaps with those in the bibliographic database. The quotes are organized by keywords and can browsed with a full text and a keyword search.

    The data (references and quotes) underlying each new release are provided here, the data can be browsed at https://digling.org/evobib/.

    If you use the database, I would appreciate if you could this in your research:

    > List, Johann-Mattis (2020): EvoBib: A bibliographical database and quote collection for historical linguistics. Version 1.1.0. Jena: Max Planck Institute for the Science of Human History. URL: https://digling.org/evobib/ DOI: 10.5281/zenodo.3699172

  20. 4

    Quotation data from an embedded study at an airport organization about the...

    • data.4tu.nl
    zip
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aniek Toet (2025). Quotation data from an embedded study at an airport organization about the transformation to a multimodal transport hub [Dataset]. http://doi.org/10.4121/42897af5-71dd-4896-b083-d5862ef0f7d1.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Aniek Toet
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Time period covered
    2021 - 2023
    Description

    Data set of an embedded study at an airport organization about the transformation towards an multimodal transport hub. Data is result of (in)formal conversations, diary notes, meetings notes and observations. This is a condensed data set, representing quotes and notes that were deemed relevant to the study's topic. The data has been collected over 16 months, and the data set includes (anonymous) quotations and parafrases, en also condensed meaning units (the interpretation of the researchers).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West; Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West (2023). Quotebank: A Corpus of Quotations from a Decade of News [Dataset]. http://doi.org/10.5281/zenodo.4277311
Organization logo

Data from: Quotebank: A Corpus of Quotations from a Decade of News

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bz2Available download formats
Dataset updated
Jun 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West; Timoté Vaucher; Andreas Spitz; Michele Catasta; Robert West
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.

For further details, please refer to the description below and to the original paper:

Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760

When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).

Dataset summary

The dataset consists of two versions:

  • Quotation-centric version (quotes-YYYY.json.bz2)
    An aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.
  • Article-centric version (quotebank-YYYY.json.bz2)
    A complete set of all individual quotation mentions with associated speaker as well as the article context in which they are mentioned. This larger version contains one entry per article in the news data. Each entry contains all speakers that appear in the news article as well as the (attributed) quotations, alongside a context window surrounding the quotations.

Both versions are split into 13 files (one per year) for ease of downloading and handling.

Dataset details

The following formatting applies to both versions of the dataset:

  • All data is made available in JSON format that has been compressed using bzip2.
  • The data is split per year (i.e., there is one file for each calendar year).
  • The offsets of quotations, contexts, and speaker annotations are given in units of Penn TreeBank Tokenizer tokens.
  • Offsets are zero-based and are computed from the start of the article.
  • When pairs of offsets are provided, the end offset is non-inclusive (e.g. in Python you can call tokens[start:end] without having to do end+1).
  • The Spinn3r data from which Quotebank was extracted had been collected over the course of over a decade. During this time, the client-side code used for collecting the data changed several times, and various character-encoding-related issues led to different representations of the original text at different times. We thus divide the 12 years spanned by the Spinn3r corpus into five phases (Phases A through E). A detailed description is available on GitHub; the key takeaways are that (1) text was lowercased in Phases A, B, and C, whereas the original capitalization was maintained in Phases D and E, and that (2) non-ASCII characters are properly represented only in Phase E.


Version 1: Quotation-centric data

In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.

Quotation-centric data
 |-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 |-- quotation: Text of the longest encountered original form of the quotation
 |-- date: Earliest occurrence date of any version of the quotation
 |-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
 |-- probas: Array representing the probabilities of each speaker having uttered the quotation.
   The probabilities across different occurrences of the same quotation are summed for
   each distinct candidate speaker and then normalized
   |-- proba: Probability for a given speaker
   |-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
 |-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
 |-- qids: Wikidata IDs of all aliases that match the selected speaker
 |-- numOccurrences: Number of time this quotation occurs in the articles
 |-- urls: List of links to the original articles containing the quotation 

Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.

Version 2: Article-centric data

In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.

Article-centric data
 |-- articleID: Primary key
 |-- articleLength: Length of the article in PTB tokens
 |-- date: Publication date of the article
 |-- phase: Corresponding phase in which the article appeared (A-E)
 |-- title: Title of the article
 |-- url: Link to the original article
 |-- names: List of all extracted speakers that occur in the article
   |-- name: Surface form of the first occurrence of each speaker in the article
   |-- ids: List of Wikidata IDs that have `name` as a possible alias
   |-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
 |-- quotations: List of all the quotations that appear in the article
   |-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
   |-- quotation: Text of the quotation as it occurs in this article
   |-- quotationOffset: Index where the quotation starts in the article
   |-- leftContext: Text in the left context window of the quotation (used for the attribution)
   |-- rightContext: Text in the right context window (used for the attribution)
   |-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
   |-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` 
   |-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
      |-- proba: Probability for a given speaker
      |-- speaker: Name of the speaker as it first occurs in this article
   |-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
   |-- numOccurrences: Number of times this quotation occurs in any article 

Code repository

The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.

Search
Clear search
Close search
Google apps
Main menu