14 datasets found
  1. Data from: SEC Filings

    • kaggle.com
    zip
    Updated Jun 5, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). SEC Filings [Dataset]. https://www.kaggle.com/datasets/bigquery/sec-filings
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jun 5, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Description

    In the U.S. public companies, certain insiders and broker-dealers are required to regularly file with the SEC. The SEC makes this data available online for anybody to view and use via their Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database. The SEC updates this data every quarter going back to January, 2009. For more information please see this site.

    To aid analysis a quick summary view of the data has been created that is not available in the original dataset. The quick summary view pulls together signals into a single table that otherwise would have to be joined from multiple tables and enables a more streamlined user experience.

    DISCLAIMER: The Financial Statement and Notes Data Sets contain information derived from structured data filed with the Commission by individual registrants as well as Commission-generated filing identifiers. Because the data sets are derived from information provided by individual registrants, we cannot guarantee the accuracy of the data sets. In addition, it is possible inaccuracies or other errors were introduced into the data sets during the process of extracting the data and compiling the data sets. Finally, the data sets do not reflect all available information, including certain metadata associated with Commission filings. The data sets are intended to assist the public in analyzing data contained in Commission filings; however, they are not a substitute for such filings. Investors should review the full Commission filings before making any investment decision.

  2. EDGAR Filings

    • redivis.com
    application/jsonl +7
    Updated Jun 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2025). EDGAR Filings [Dataset]. https://redivis.com/datasets/dq12-4q4st0kjt
    Explore at:
    arrow, spss, stata, parquet, csv, sas, avro, application/jsonlAvailable download formats
    Dataset updated
    Jun 29, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Graduate School of Business Library
    Time period covered
    Feb 11, 1993 - Jun 27, 2025
    Description

    Abstract

    This dataset reflects the current (updated weekly) set of EDGAR filings available on the Yens at /zfs/data/NODR/EDGAR_HTTPS/edgar/.

    Methodology

    A script is run on a weekly basis that pulls the most recent indices of EDGAR filings from this link, downloads new filings to /zfs/data/NODR/EDGAR_HTTPS/edgar/ on the Yens, and then updates the table in this dataset with those filings. You can use the filepath column to access a specific filing on the Yens.

    Note that in order to use filings on the Yens, you will need to have access to the Yens either as a member of the Stanford GSB research community or as a sponsored collaborator.

    Usage

    You may use this dataset to filter through the universe of EDGAR filings by CIK, company name, filing date, etc. and then compile a list of filings that you would like to use on the Yens.

  3. d

    Layline institutional holding reports

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balogh, Attila (2024). Layline institutional holding reports [Dataset]. http://doi.org/10.7910/DVN/TZM1QT
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Balogh, Attila
    Time period covered
    Jan 1, 2013 - Sep 25, 2024
    Description

    This dataset captures the quarterly investment holdings of institutional investment managers and maps the ownership structure of public firms. These Schedule 13F reports are submitted to the Securities and Exchange Commission quarterly by all institutional investment managers with at least $100 million in assets under management. Most academic research examining the common ownership of corporations and the portfolio holdings of large investment managers is based on proprietary commercial databases. This hinders the replication of prior work due to unequal access to these subscriptions and because the data manipulation steps in commercial databases are often opaque. To overcome these limitations, the presented dataset is created from the original regulatory filings; it is updated daily and includes all information reported by investment managers without alteration. Daily updates: https://dx.doi.org/10.34740/kaggle/ds/2973565

  4. US Company Filings Database

    • lseg.com
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LSEG (2025). US Company Filings Database [Dataset]. https://www.lseg.com/en/data-analytics/financial-data/filings/company-filings-database
    Explore at:
    csv,html,json,pdf,python,text,user interface,xmlAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    London Stock Exchange Grouphttp://www.londonstockexchangegroup.com/
    Authors
    LSEG
    License

    https://www.lseg.com/en/policies/website-disclaimerhttps://www.lseg.com/en/policies/website-disclaimer

    Area covered
    United States
    Description

    Browse LSEG's US Company Filings Database, and find a range of filings content and history including annual reports, municipal bonds, and more.

  5. PUDL Raw U.S. Securities and Exchange Commission Form 10-K

    • zenodo.org
    • data.niaid.nih.gov
    bin, json
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Catalyst Cooperative; Catalyst Cooperative (2025). PUDL Raw U.S. Securities and Exchange Commission Form 10-K [Dataset]. http://doi.org/10.5281/zenodo.15161694
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Apr 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Catalyst Cooperative; Catalyst Cooperative
    Area covered
    United States
    Description

    The SEC Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance. The full contents of the SEC 10-K are available through the SEC's EDGAR database. PUDL integrates only some of the 10-K metadata and data extracted from the unstructured Exhibit 21 attachement, which describes the ownershp relationships between the parent company and its subsidiaries. This data is used to create a linkage between EIA utilities and SEC reporting companies, to better understand the relationships between utlities and their affiliates, and the resulting economic and political impacts. This data was originally downloaded from the SEC and processed using a machine learning pipeline found here: https://github.com/catalyst-cooperative/mozilla-sec-eia Archived from https://www.sec.gov/search-filings/edgar-application-programming-interfaces

    This archive contains raw input data for the Public Utility Data Liberation (PUDL) software developed by Catalyst Cooperative. It is organized into "https://specs.frictionlessdata.io/data-package/">Frictionless Data Packages. For additional information about this data and PUDL, see the following resources:

  6. d

    Embeddings of Item 1 from 10-K Filings (1994-2022) using MPNet

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majzoubi, Majid (2024). Embeddings of Item 1 from 10-K Filings (1994-2022) using MPNet [Dataset]. http://doi.org/10.7910/DVN/Y79OCK
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Majzoubi, Majid
    Description

    This dataset contains sentence embeddings generated from Item 1 (Business Description) of 10-K filings submitted to the SEC between 1994 and 2022. The filings' headers were obtained from the SEC EDGAR database via WRDS (Wharton Research Data Services). The text from each Item 1 was extracted, cleaned to remove tables, images, headers, footers and comments, and then tokenized into sentences using NLTK's Punkt tokenizer. The sentences were then embedded using the all-MPNet-base-v2 model from Sentence Transformers, and the embeddings for each filing were averaged to create a single embedding vector per filing. The dataset includes the following fields for each filing: file_id: A unique identifier for the filing embedding: The averaged sentence embedding for the Item 1 text n_words: The number of words in the Item 1 text This dataset was created for a research paper examining the impact of business description similarity on analyst coverage and investment recommendations. It allows similarity comparisons between companies based on the semantic content of their business descriptions. An additional index file in CSV format is provided to map the file_id to the following filing metadata: fdate: Filing date cik: Central Index Key (CIK) of the filing company url: URL of the filing on the SEC EDGAR database

  7. Filings

    • lseg.com
    Updated Mar 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LSEG (2020). Filings [Dataset]. https://www.lseg.com/en/data-analytics/financial-data/filings
    Explore at:
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    London Stock Exchange Grouphttp://www.londonstockexchangegroup.com/
    Authors
    LSEG
    License

    https://www.lseg.com/en/policies/website-disclaimerhttps://www.lseg.com/en/policies/website-disclaimer

    Description

    LSEG global Filings offers extensive coverage of developed and emerging markets, updated in real time. Discover the data.

  8. d

    Insider Transactions Data Sets

    • catalog.data.gov
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Structured Disclosure (2025). Insider Transactions Data Sets [Dataset]. https://catalog.data.gov/dataset/insider-transactions-data-sets
    Explore at:
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Structured Disclosure
    Description

    Under Section 16 of the Securities Exchange Act of 1934, senior executives, directors, and large-block shareholders are required to make ongoing filings about their company stock holdings to report any changes. These filings are made on Form 3, Form 4, and Form 5 and submitted to SECs Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system.

  9. P

    E-NER Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ting Wai Terence Au; Ingemar J. Cox; Vasileios Lampos, E-NER Dataset [Dataset]. https://paperswithcode.com/dataset/e-ner
    Explore at:
    Authors
    Ting Wai Terence Au; Ingemar J. Cox; Vasileios Lampos
    Description

    E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.

  10. H

    Common Ownership Data: Scraped SEC form 13F filings for 1999-2017

    • dataverse.harvard.edu
    Updated Aug 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Backus; Christopher T Conlon; Michael Sinkinson (2020). Common Ownership Data: Scraped SEC form 13F filings for 1999-2017 [Dataset]. http://doi.org/10.7910/DVN/ZRH3EU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Matthew Backus; Christopher T Conlon; Michael Sinkinson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2017
    Description

    Introduction In the course of researching the common ownership hypothesis, we found a number of issues with the Thomson Reuters (TR) "S34" dataset used by many researchers and frequently accessed via Wharton Research Data Services (WRDS). WRDS has done extensive work to improve the database, working with other researchers that have uncovered problems, specifically fixing a lack of records of BlackRock holdings. However, even with the updated dataset posted in the summer of 2018, we discovered a number of discrepancies when accessing data for constituent firms of the S&P 500 Index. We therefore set out to separately create a dataset of 13(f) holdings from the source documents, which are all public and available electronically from the Securities and Exchange Commission (SEC) website. Coverage is good starting in 1999, when electronic filing became mandatory. However, the SEC's Inspector General issued a critical report in 2010 about the information contained in 13(f) filings. The process: We gathered all 13(f) filings from 1999-2017 here. The corpus is over 318,000 filings and occupies ~25GB of space if unzipped. (We do not include the raw filings here as they can be downloaded from EDGAR). We wrote code to parse the filings to extract holding information using regular expressions in Perl. Our target list of holdings was all public firms with a market capitalization of at least $10M. From the header of the file, we first extract the filing date, reporting date, and reporting entity (Central Index Key, or CIK, and CIKNAME). Beginning with the September 30 2013 filing date, all filings were in XML format, which made parsing fairly straightforward, as all values are contained in tags. Prior to that date, the filings are remarkable for the heterogeneity in formatting. Several examples are linked to below. Our approach was to look for any lines containing a CUSIP code that we were interested in, and then attempting to determine the "number of shares" field and the "value" field. To help validate the values we extracted, we downloaded stock price data from CRSP for the filing date, as that allows for a logic check of (price * shares) = value. We do not claim that this will exhaustively extract all holding information. We can provide examples of filings that are formatted in such a way that we are not able to extract the relevant information. In both XML and non-XML filings, we attempt to remove any derivative holdings by looking for phrases such as OPT, CALL, PUT, WARR, etc. We then perform some final data cleaning: in the case of amended filings, we keep an amended level of holdings if the amended report a) occurred within 90 days of the reporting date and b) the initial filing fails our logic check described above. The resulting dataset has around 48M reported holdings (CIK-CUSIP) for all 76 quarters and between 4,000 and 7,000 CUSIPs and between 1,000 and 4,000 investors per quarter. We do not claim that our dataset is perfect; there are undoubtedly errors. As documented elsewhere, there are often errors in the actual source documents as well. However, our method seemed to produce more reliable data in several cases than the TR dataset, as shown in Online Appendix B of the related paper linked above. Included Files Perl Parsing Code (find_holdings_snp.pl). For reference, only needed if you wish to re-parse original filings. Investor holdings for 1999-2017: lightly cleaned. Each CIK-CUSIP-rdate is unique. Over 47M records. The fields are CIK: the central index key assigned by the SEC for this investor. Mapping to names is available below. CUSIP: the identity of the holdings. Consult the SEC's 13(f) listings to identify your CUSIPs of interest. shares: the number of shares reportedly held. Merging in CRSP data on shares outstanding at the CUSIP-Month level allows one to construct \beta. We make no distinction for the sole/shared/none voting discretion fields. If a researcher is interested, we did collect that starting in mid-2013, when filings are in XML format. rdate: reporting date (end of quarter). 8 digit, YYYYMMDD. fdate: filing date. 8 digit, YYYYMMDD. ftype: the form name. Notes: we did not consolidate separate BlackRock entities (or any other possibly related entities). If one wants to do so, use the CIK-CIKname mapping file below. We drop any CUSIP-rdate observation where any investor in that CUSIP reports owning greater than 50% of shares outstanding (even though legitimate cases exist - see, for example, Diamond Offshore and Loews Corporation). We also drop any CUSIP-rdate observation where greater than 120% of shares outstanding are reported to be held by 13(f) investors. Cases where the shares held are listed as zero likely mean the investor filing lists a holding for the firm but that our code could not find the number of shares due to the formatting of the file. We leave these in the data so that any researchers that find a zero know to go back to that source filing to manually gather the...

  11. E

    Data from: Deft

    • live.european-language-grid.eu
    txt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deft [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4982
    Explore at:
    txtAvailable download formats
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset contains annotated content from two different data sources: 1) 2,443 sentences from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences from open source textbooks including topics in biology, history, physics, psychology, economics, sociology, and government.

  12. h

    Summarized_10K-MDA

    • huggingface.co
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    I-Chan Chiu (2025). Summarized_10K-MDA [Dataset]. https://huggingface.co/datasets/ichanchiu/Summarized_10K-MDA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2025
    Authors
    I-Chan Chiu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Summarized 10-K MD&A

      Dataset Description
    

    The Summarized 10-K MD&A dataset provides concise, machine-generated summaries of 10-K filings for publicly traded companies. These filings are sourced from the SEC EDGAR database, and the dataset is designed to facilitate financial text analysis, such as summarization, sentiment analysis, and financial disclosure studies.

      Key Features
    

    Language: English Dataset Size: 98,100 rows License: MIT License Source: SEC EDGAR… See the full description on the dataset page: https://huggingface.co/datasets/ichanchiu/Summarized_10K-MDA.

  13. 21st Century Corporate Financial Fraud, United States, 2005-2010

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). 21st Century Corporate Financial Fraud, United States, 2005-2010 [Dataset]. https://catalog.data.gov/dataset/21st-century-corporate-financial-fraud-united-states-2005-2010-22a9e
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    United States
    Description

    The Corporate Financial Fraud project is a study of company and top-executive characteristics of firms that ultimately violated Securities and Exchange Commission (SEC) financial accounting and securities fraud provisions compared to a sample of public companies that did not. The fraud firm sample was identified through systematic review of SEC accounting enforcement releases from 2005-2010, which included administrative and civil actions, and referrals for criminal prosecution that were identified through mentions in enforcement release, indictments, and news searches. The non-fraud firms were randomly selected from among nearly 10,000 US public companies censused and active during at least one year between 2005-2010 in Standard and Poor's Compustat data. The Company and Top-Executive (CEO) databases combine information from numerous publicly available sources, many in raw form that were hand-coded (e.g., for fraud firms: Accounting and Auditing Enforcement Releases (AAER) enforcement releases, investigation summaries, SEC-filed complaints, litigation proceedings and case outcomes). Financial and structural information on companies for the year leading up to the financial fraud (or around year 2000 for non-fraud firms) was collected from Compustat financial statement data on Form 10-Ks, and supplemented by hand-collected data from original company 10-Ks, proxy statements, or other financial reports accessed via Electronic Data Gathering, Analysis, and Retrieval (EDGAR), SEC's data-gathering search tool. For CEOs, data on personal background characteristics were collected from Execucomp and BoardEx databases, supplemented by hand-collection from proxy-statement biographies.

  14. MarkupMnA

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar; Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar (2023). MarkupMnA [Dataset]. http://doi.org/10.5281/zenodo.8034853
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar; Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MarkupMnA dataset is a corpus of 151 merger and acquisition agreements with annotated sections titles, section numbers, page numbers, and more, based on HTML filings by US public companies retrieved from the SEC EDGAR database. We consider the task of section title annotation as a sequence labeling task, and to that end, use the BEIOS tagging scheme when generating our annotations. There are over 70,000 labels in the entire dataset excluding outside labels and over 465,000 labels including outside labels.

    We add annotations to the contracts in an already widely used dataset, MAUD, which is an expert-annotated reading comprehension dataset. The broad objective of our work is to make progress toward developing computationally efficient hierarchical representations of long documents, specifically for legal contracts. We hope that our annotations can be used in conjunction with MAUD to advance legal NLP research.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google BigQuery (2020). SEC Filings [Dataset]. https://www.kaggle.com/datasets/bigquery/sec-filings
Organization logo

Data from: SEC Filings

BigQuery dataset of all SEC filings

Related Article
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 5, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description

In the U.S. public companies, certain insiders and broker-dealers are required to regularly file with the SEC. The SEC makes this data available online for anybody to view and use via their Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database. The SEC updates this data every quarter going back to January, 2009. For more information please see this site.

To aid analysis a quick summary view of the data has been created that is not available in the original dataset. The quick summary view pulls together signals into a single table that otherwise would have to be joined from multiple tables and enables a more streamlined user experience.

DISCLAIMER: The Financial Statement and Notes Data Sets contain information derived from structured data filed with the Commission by individual registrants as well as Commission-generated filing identifiers. Because the data sets are derived from information provided by individual registrants, we cannot guarantee the accuracy of the data sets. In addition, it is possible inaccuracies or other errors were introduced into the data sets during the process of extracting the data and compiling the data sets. Finally, the data sets do not reflect all available information, including certain metadata associated with Commission filings. The data sets are intended to assist the public in analyzing data contained in Commission filings; however, they are not a substitute for such filings. Investors should review the full Commission filings before making any investment decision.

Search
Clear search
Close search
Google apps
Main menu