17 datasets found

Data from: SEC Filings
kaggle.com
zip
Updated Jun 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). SEC Filings [Dataset]. https://www.kaggle.com/datasets/bigquery/sec-filings
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 5, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description
In the U.S. public companies, certain insiders and broker-dealers are required to regularly file with the SEC. The SEC makes this data available online for anybody to view and use via their Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database. The SEC updates this data every quarter going back to January, 2009. For more information please see this site.

To aid analysis a quick summary view of the data has been created that is not available in the original dataset. The quick summary view pulls together signals into a single table that otherwise would have to be joined from multiple tables and enables a more streamlined user experience.

DISCLAIMER: The Financial Statement and Notes Data Sets contain information derived from structured data filed with the Commission by individual registrants as well as Commission-generated filing identifiers. Because the data sets are derived from information provided by individual registrants, we cannot guarantee the accuracy of the data sets. In addition, it is possible inaccuracies or other errors were introduced into the data sets during the process of extracting the data and compiling the data sets. Finally, the data sets do not reflect all available information, including certain metadata associated with Commission filings. The data sets are intended to assist the public in analyzing data contained in Commission filings; however, they are not a substitute for such filings. Investors should review the full Commission filings before making any investment decision.
d
Layline institutional holding reports
search.dataone.org
dataverse.harvard.edu
+2more
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balogh, Attila (2025). Layline institutional holding reports [Dataset]. http://doi.org/10.7910/DVN/TZM1QT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/TZM1QT
Dataset updated
Oct 29, 2025
Dataset provided by
Harvard Dataverse
Authors
Balogh, Attila
Time period covered
Jan 1, 2013 - Sep 25, 2024
Description
This dataset captures the quarterly investment holdings of institutional investment managers and maps the ownership structure of public firms. These Schedule 13F reports are submitted to the Securities and Exchange Commission quarterly by all institutional investment managers with at least $100 million in assets under management. Most academic research examining the common ownership of corporations and the portfolio holdings of large investment managers is based on proprietary commercial databases. This hinders the replication of prior work due to unequal access to these subscriptions and because the data manipulation steps in commercial databases are often opaque. To overcome these limitations, the presented dataset is created from the original regulatory filings; it is updated daily and includes all information reported by investment managers without alteration. Daily updates: https://dx.doi.org/10.34740/kaggle/ds/2973565
SEC (EDGAR) Company Names & CIK Keys
kaggle.com
zip
Updated Nov 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caesar Lupum (2019). SEC (EDGAR) Company Names & CIK Keys [Dataset]. https://www.kaggle.com/caesarlupum/gov-names
Explore at:
zip(108272 bytes)Available download formats
Dataset updated
Nov 5, 2019
Authors
Caesar Lupum
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The 󾓦 U.S. Securities and Exchange Commission's HTTPS file system allows comprehensive access to the SEC's EDGAR (Electronic Data Gathering, Analysis, and Retrieval system) filings by corporations, funds, and individuals. These filings are disseminated to the public through the EDGAR Dissemination Service, currently operated under contract to Attain, LLC, which markets data directly to subscribers. The dissemination stream also populates the EDGAR public database on sec.gov, which can be researched through a variety of EDGAR public searches.

After early testing in 1992-93, beginning with 450 voluntary filers, companies began filing through EDGAR in 1994-95 with various phase-in periods for different form types. See Electronic Filing and the EDGAR System: A Regulatory Overview for more historical details.

Content

All companies in the SEC EDGAR database. Companies are listed with company name and their unique CIK key. Data-set had 663000 companies listed, which includes all companies in the EDGAR database.

The EDGAR indexes list the following information for each filing: Company Name, Form Type, CIK (Central Index Key), Date Filed, and File Name (including folder path).

Four types of indexes are available. The company, form, and master indexes contain the same information sorted differently.

company — sorted by company name

form — sorted by form type

master — sorted by CIK number

XBRL — list of submissions containing XBRL financial files, sorted by CIK number; these include Voluntary Filer Program submissions

Initial Kernel

Acknowledgements

Contact Information Filer Support Contact List EDGAR Dissemination System: Contact Information — for subscribing to the EDGAR dissemination feed For general SEC Contact information, including electronic mailboxes, telephone numbers, and mailing addresses, see Contact Information. webmaster@sec.gov — connection issues and general questions; see also Webmaster FAQ structureddata@sec.gov — questions about taxonomies, structured data (e.g., XBRL; XML; FpML; FIX), and the DERA Data Library https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm

Inspiration

The mission of the U.S. Securities and Exchange Commission is to protect investors, maintain fair, orderly, and efficient markets, and facilitate capital formation.
Financial Statement Data Sets
kaggle.com
zip
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vadim Vanak (2025). Financial Statement Data Sets [Dataset]. https://www.kaggle.com/datasets/vadimvanak/company-facts-2/suggestions
Explore at:
zip(287789225 bytes)Available download formats
Dataset updated
Nov 14, 2025
Authors
Vadim Vanak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset offers a detailed collection of US-GAAP financial data extracted from the financial statements of exchange-listed U.S. companies, as submitted to the U.S. Securities and Exchange Commission (SEC) via the EDGAR database. Covering filings from January 2009 onwards, this dataset provides key financial figures reported by companies in accordance with U.S. Generally Accepted Accounting Principles (GAAP).

Dataset Features:

Data Scope: The dataset is restricted to figures reported under US-GAAP standards, with the exception of EntityCommonStockSharesOutstanding and EntityPublicFloat.

Currency and Units: The dataset exclusively includes figures reported in USD or shares, ensuring uniformity and comparability. It excludes ratios and non-financial metrics to maintain focus on financial data.

Company Selection: The dataset is limited to companies with U.S. exchange tickers, providing a concentrated analysis of publicly traded firms within the United States.

Submission Types: The dataset only incorporates data from 10-Q, 10-K, 10-Q/A, and 10-K/A filings, ensuring consistency in the type of financial reports analyzed.

Data Sources and Extraction:

This dataset primarily relies on the SEC's Financial Statement Data Sets and EDGAR APIs: - SEC Financial Statement Data Sets - EDGAR Application Programming Interfaces

In instances where specific figures were missing from these sources, data was directly extracted from the companies' financial statements to ensure completeness.

Please note that the dataset presents financial figures exactly as reported by the companies, which may occasionally include errors. A common issue involves incorrect reporting of scaling factors in the XBRL format. XBRL supports two tag attributes related to scaling: 'decimals' and 'scale.' The 'decimals' attribute indicates the number of significant decimal places but does not affect the actual value of the figure, while the 'scale' attribute adjusts the value by a specific factor.

However, there are several instances, numbering in the thousands, where companies have incorrectly used the 'decimals' attribute (e.g., 'decimals="-6"') under the mistaken assumption that it controls scaling. This is not correct, and as a result, some figures may be inaccurately scaled. This dataset does not attempt to detect or correct such errors; it aims to reflect the data precisely as reported by the companies. A future version of the dataset may be introduced to address and correct these issues.

The source code for data extraction is available here
US Company Filings Database
lseg.com
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LSEG (2025). US Company Filings Database [Dataset]. https://www.lseg.com/en/data-analytics/financial-data/filings/company-filings-database
Explore at:
csv,html,json,pdf,python,text,user interface,xmlAvailable download formats
Dataset updated
Oct 13, 2025
Dataset provided by
London Stock Exchange Grouphttp://www.londonstockexchangegroup.com/
Authors
LSEG
License
https://www.lseg.com/en/policies/website-disclaimerhttps://www.lseg.com/en/policies/website-disclaimer
Area covered
United States
Description
Browse LSEG's US Company Filings Database, and find a range of filings content and history including annual reports, municipal bonds, and more.
Z
PUDL Raw U.S. Securities and Exchange Commission Form 10-K
data.niaid.nih.gov
zenodo.org
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Catalyst Cooperative (2025). PUDL Raw U.S. Securities and Exchange Commission Form 10-K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_15161693
Explore at:
Dataset updated
Apr 6, 2025
Dataset authored and provided by
Catalyst Cooperative
Area covered
United States
Description
The SEC Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance.

The full contents of the SEC 10-K are available through the SEC's EDGAR database. PUDL integrates only some of the 10-K metadata and data extracted from the unstructured Exhibit 21 attachement, which describes the ownershp relationships between the parent company and its subsidiaries. This data is used to create a linkage between EIA utilities and SEC reporting companies, to better understand the relationships between utlities and their affiliates, and the resulting economic and political impacts.

This data was originally downloaded from the SEC and processed using a machine learning pipeline found here: https://github.com/catalyst-cooperative/mozilla-sec-eia Archived from https://www.sec.gov/search-filings/edgar-application-programming-interfaces

This archive contains raw input data for the Public Utility Data Liberation (PUDL) software developed by Catalyst Cooperative. It is organized into Frictionless Data Packages. For additional information about this data and PUDL, see the following resources:

The PUDL Repository on GitHub

PUDL Documentation

Other Catalyst Cooperative data archives
SEC Form 4 Filings
kaggle.com
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SecFilingApi (2025). SEC Form 4 Filings [Dataset]. https://www.kaggle.com/datasets/secfilingapi/sec-form-4-filings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2025
Dataset provided by
Kaggle
Authors
SecFilingApi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Historical SEC dataset containing all insider transactions (Form 4 filings). The data is public, and sourced from the SEC's EDGAR database out of their XML filings. Lightly processed for easier consumption. All Form 4 filings from Jan/20 to Jun/25

Why this exists

Form 4s are noisy to work with: amended filings, multiple insiders per transaction, and inconsistent tables. This dataset provides clean, normalized insider-transaction data with a stable schema so you can backtest signals and monitor insider activity without scraping.

What you get:

Company (name, ticker if known), CIK

Insider(s) with role (Director/Officer/10% Owner, etc.)

Trade details: transaction date, type, number of shares, price, total consideration

Holdings before/after the transaction

Update cadence: monthly (moving to daily as we scale). Source: U.S. SEC EDGAR Form 4 filings.

We're building a real-time API for new filings with clean JSON endpoints and low latency. If interested, sign up to our waiting list: 👉 https://secfilingapi.com/?utm_source=kaggle&utm_medium=dataset&utm_campaign=form4

Ideas to try with the data:

Event study: top-decile insider buys vs sector ETF over 30/60/90 days

Cluster insiders by role + transaction size; test persistence

Filter for CEO buys after 20% drawdowns (momentum/contrarian mix)

Find unusual clusters (multiple insiders buying within 10 days)

Limitations & notes:

There are a few missing filings in the historical data (< 1000 total, all very old filings, before the SEC launched the XML format).

Forms 4B (amendments to form 4 - rare - are not included)

Mapping from CIK→ticker can lag for recent IPOs/SPACs.

Always verify edge cases against the EDGAR link when publishing results.

Feedback & requests welcome in the Discussion tab.

DISCLAIMER: It is possible that inaccuracies or other errors were introduced into the data sets during the process of extracting the data and compiling the data sets. The data set is intended to assist the public in analyzing data contained in Commission filings; however, they are not a substitute for such filings. Investors should review the full Commission filings before making any investment decision.
Filings
lseg.com
Updated Mar 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LSEG (2020). Filings [Dataset]. https://www.lseg.com/en/data-analytics/financial-data/filings
Explore at:
Dataset updated
Mar 27, 2020
Dataset provided by
London Stock Exchange Grouphttp://www.londonstockexchangegroup.com/
Authors
LSEG
License
https://www.lseg.com/en/policies/website-disclaimerhttps://www.lseg.com/en/policies/website-disclaimer
Description
LSEG global Filings offers extensive coverage of developed and emerging markets, updated in real time. Discover the data.
d
Embeddings of Item 1 from 10-K Filings (1994-2022) using MPNet
search.dataone.org
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majzoubi, Majid (2024). Embeddings of Item 1 from 10-K Filings (1994-2022) using MPNet [Dataset]. http://doi.org/10.7910/DVN/Y79OCK
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/Y79OCK
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Majzoubi, Majid
Description
This dataset contains sentence embeddings generated from Item 1 (Business Description) of 10-K filings submitted to the SEC between 1994 and 2022. The filings' headers were obtained from the SEC EDGAR database via WRDS (Wharton Research Data Services). The text from each Item 1 was extracted, cleaned to remove tables, images, headers, footers and comments, and then tokenized into sentences using NLTK's Punkt tokenizer. The sentences were then embedded using the all-MPNet-base-v2 model from Sentence Transformers, and the embeddings for each filing were averaged to create a single embedding vector per filing. The dataset includes the following fields for each filing: file_id: A unique identifier for the filing embedding: The averaged sentence embedding for the Item 1 text n_words: The number of words in the Item 1 text This dataset was created for a research paper examining the impact of business description similarity on analyst coverage and investment recommendations. It allows similarity comparisons between companies based on the semantic content of their business descriptions. An additional index file in CSV format is provided to map the file_id to the following filing metadata: fdate: Filing date cik: Central Index Key (CIK) of the filing company url: URL of the filing on the SEC EDGAR database
H
Common Ownership Data: Scraped SEC form 13F filings for 1999-2017
dataverse.harvard.edu
bin, csv +3
Updated Aug 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). Common Ownership Data: Scraped SEC form 13F filings for 1999-2017 [Dataset]. http://doi.org/10.7910/DVN/ZRH3EU
Explore at:
txt(25964), bin(323182551), txt(14847), bin(2934960), text/x-perl-script(21999), csv(2363718396), bin(271859768), txt(3008286), txt(110929), bin(4653090), txt(303881), tsv(11192545), txt(156950), txt(196510)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/ZRH3EU
Dataset updated
Aug 17, 2020
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2017
Description
Introduction In the course of researching the common ownership hypothesis, we found a number of issues with the Thomson Reuters (TR) "S34" dataset used by many researchers and frequently accessed via Wharton Research Data Services (WRDS). WRDS has done extensive work to improve the database, working with other researchers that have uncovered problems, specifically fixing a lack of records of BlackRock holdings. However, even with the updated dataset posted in the summer of 2018, we discovered a number of discrepancies when accessing data for constituent firms of the S&P 500 Index. We therefore set out to separately create a dataset of 13(f) holdings from the source documents, which are all public and available electronically from the Securities and Exchange Commission (SEC) website. Coverage is good starting in 1999, when electronic filing became mandatory. However, the SEC's Inspector General issued a critical report in 2010 about the information contained in 13(f) filings. The process: We gathered all 13(f) filings from 1999-2017 here. The corpus is over 318,000 filings and occupies ~25GB of space if unzipped. (We do not include the raw filings here as they can be downloaded from EDGAR). We wrote code to parse the filings to extract holding information using regular expressions in Perl. Our target list of holdings was all public firms with a market capitalization of at least $10M. From the header of the file, we first extract the filing date, reporting date, and reporting entity (Central Index Key, or CIK, and CIKNAME). Beginning with the September 30 2013 filing date, all filings were in XML format, which made parsing fairly straightforward, as all values are contained in tags. Prior to that date, the filings are remarkable for the heterogeneity in formatting. Several examples are linked to below. Our approach was to look for any lines containing a CUSIP code that we were interested in, and then attempting to determine the "number of shares" field and the "value" field. To help validate the values we extracted, we downloaded stock price data from CRSP for the filing date, as that allows for a logic check of (price * shares) = value. We do not claim that this will exhaustively extract all holding information. We can provide examples of filings that are formatted in such a way that we are not able to extract the relevant information. In both XML and non-XML filings, we attempt to remove any derivative holdings by looking for phrases such as OPT, CALL, PUT, WARR, etc. We then perform some final data cleaning: in the case of amended filings, we keep an amended level of holdings if the amended report a) occurred within 90 days of the reporting date and b) the initial filing fails our logic check described above. The resulting dataset has around 48M reported holdings (CIK-CUSIP) for all 76 quarters and between 4,000 and 7,000 CUSIPs and between 1,000 and 4,000 investors per quarter. We do not claim that our dataset is perfect; there are undoubtedly errors. As documented elsewhere, there are often errors in the actual source documents as well. However, our method seemed to produce more reliable data in several cases than the TR dataset, as shown in Online Appendix B of the related paper linked above. Included Files Perl Parsing Code (find_holdings_snp.pl). For reference, only needed if you wish to re-parse original filings. Investor holdings for 1999-2017: lightly cleaned. Each CIK-CUSIP-rdate is unique. Over 47M records. The fields are CIK: the central index key assigned by the SEC for this investor. Mapping to names is available below. CUSIP: the identity of the holdings. Consult the SEC's 13(f) listings to identify your CUSIPs of interest. shares: the number of shares reportedly held. Merging in CRSP data on shares outstanding at the CUSIP-Month level allows one to construct \beta. We make no distinction for the sole/shared/none voting discretion fields. If a researcher is interested, we did collect that starting in mid-2013, when filings are in XML format. rdate: reporting date (end of quarter). 8 digit, YYYYMMDD. fdate: filing date. 8 digit, YYYYMMDD. ftype: the form name. Notes: we did not consolidate separate BlackRock entities (or any other possibly related entities). If one wants to do so, use the CIK-CIKname mapping file below. We drop any CUSIP-rdate observation where any investor in that CUSIP reports owning greater than 50% of shares outstanding (even though legitimate cases exist - see, for example, Diamond Offshore and Loews Corporation). We also drop any CUSIP-rdate observation where greater than 120% of shares outstanding are reported to be held by 13(f) investors. Cases where the shares held are listed as zero likely mean the investor filing lists a holding for the firm but that our code could not find the number of shares due to the formatting of the file. We leave these in the data so that any researchers that find a zero know to go back to that source filing to manually gather the holdings for the securities they are interested in. Processed 13f holdings (airlines.parquet, cereal.parquet, out_scrape.parquet). These are used in our related AEJ:Microeconomics paper. The files contain all firms within the airline industry, RTE cereal industry, and all large cap firms (a superset of the S&P 500) respectively. These are a merged version of the scrape_parsed.csv file described above, that include the shares outstanding and percent ownership used to calculate measures of common ownership. These are distributed as brotli compressed Apache Parquet (binary) files. This preserves date information correctly. mgrno: manager number (which is actually CIK in the scraped data) rdate: reporting date ncusip: cusip rrdate: reportaing date in stata format mgrname: manager name shares: shares sole: shares with sole authority shared: shares with shared authority none: shares with no authority isbr/isfi/iss/isba/isvg: is this blackrock, statestreet, vanguard, barclay, fidelity numowners: how many owners prc: price at reporting date shares_out: shares outstanding at reporting date value: reported value in 13(f) beta: shares/shares_out permno: permno Profit weight values (i.e. \kappa) for all firms in the sample. (public_scrape_kappas_XXXX.parquet). Each file represents one year of data and is around 200MB and distributed as a compressed (brotli) parquet file. Fields are simply CUSIP_FROM, CUSIP_TO, KAPPA, QUARTER. Note that these have not been adjusted for multi-class share firms, insider holdings, etc. If looking at a particular market, some additional data cleaning on the investor holdings (above) followed by recomputing profit weights is recommended. For this, we did merge the separate BlackRock entities prior to computing \kappa. CIKmap.csv (~250K observations) Mapping is from CIK-rdate to CIKname. Use this if you want to consolidate holdings across reporting entities or explore the identities of reporting firms. In the case of amended filings that use different names than original ones, we keep the earliest name. Example of Parsing Challenge Prior to the XML era, filings were far from uniform, which creates a notable challenge for parsing them for holdings. In the examples directory we include several example text files of raw 13f filings. Example 1 is a "well behaved" filing, with CUSIP, followed by value, followed by number of shares, as recommended by the SEC. Example 2 shows a case where the ordering is changed: CUSIP, then shares, then value. The column headers show "item 5" coming before "item 4". Example 3 shows a case of a fixed width table, which in principle could be parsed very easily using the tags at the top, although not all filings consistently use these tags. Example 4 shows a case with a fixed width table, with no tag for the CUSIP column. Also, notice that if the firm holds more than 10M shares of a firm, that number occupies the entire width of the column and there is no longer a column separator (i.e. Cisco Systems on line 374). Example 5 shows a comma-separated table format. Example 6 shows a case of changing the column ordering, but also adding an (unrequired) column for share price. Example 7 shows a case where the table is split across subsequent pages, and so the CUSIP appears on a different line than the number of shares.
E
Data from: Deft
live.european-language-grid.eu
txt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deft [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4982
Explore at:
txtAvailable download formats
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset contains annotated content from two different data sources: 1) 2,443 sentences from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences from open source textbooks including topics in biology, history, physics, psychology, economics, sociology, and government.
SEC 13F-HR Institutional Investment Data
kaggle.com
zip
Updated Jan 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aneesh (2020). SEC 13F-HR Institutional Investment Data [Dataset]. https://www.kaggle.com/datasets/aneeshpanoli/sec-13fhr-institutional-investment-data/code
Explore at:
zip(58909779 bytes)Available download formats
Dataset updated
Jan 13, 2020
Authors
Aneesh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The dataset consists of quarterly filings of institutions scraped from the SEC EDGAR database.

Content

The Insititutions.csv contains the names of institutions and their unique ids. The stock_names.csv contains the names of companies and their unique ids.

13Fdata.csv Includes three years of quarterly filings from 2015 - 2017 for all the institutions in the SEC database.

Acknowledgements

Data courtesy: SEC

Inspiration

Interesting questions to ask: Can you pick the most promising securities to invest in based on how institutions are trading it? How about some cool dataviz?
MarkupMnA
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar; Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar (2023). MarkupMnA [Dataset]. http://doi.org/10.5281/zenodo.8034853
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8034853
Dataset updated
Jun 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar; Sukrit Rao; Pranab Islam; Rohith Bollineni; Shaan Khosla; Tingyi Fei; Qian Wu; Kyunghyun Cho; Vladimir Kobzar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MarkupMnA dataset is a corpus of 151 merger and acquisition agreements with annotated sections titles, section numbers, page numbers, and more, based on HTML filings by US public companies retrieved from the SEC EDGAR database. We consider the task of section title annotation as a sequence labeling task, and to that end, use the BEIOS tagging scheme when generating our annotations. There are over 70,000 labels in the entire dataset excluding outside labels and over 465,000 labels including outside labels.

We add annotations to the contracts in an already widely used dataset, MAUD, which is an expert-annotated reading comprehension dataset. The broad objective of our work is to make progress toward developing computationally efficient hierarchical representations of long documents, specifically for legal contracts. We hope that our annotations can be used in conjunction with MAUD to advance legal NLP research.
H
A database for blockholders in US-listed firms including all Form 13D and...
dataverse.harvard.edu
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Philipp Harries (2021). A database for blockholders in US-listed firms including all Form 13D and Form 13G filings. [Dataset]. http://doi.org/10.7910/DVN/61Z64Q
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/61Z64Q
Dataset updated
Aug 17, 2021
Dataset provided by
Harvard Dataverse
Authors
Jan Philipp Harries
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/61Z64Qhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/61Z64Q
Description
The Dataset contains structured, parsed data from 758,666 Form 13D and Form 13G blockholder filings from November 1993 to May 2021, downloaded from SEC Edgar. The data is made available as a RAR-compressed CSV file, in which each row represents a single filing and the 76 columns contain parsed information for each filing. Please see the accompanying paper "Determinants of Blockholdership - A new Dataset for Blockholder Analysis" for more information and cite it when using the data for your research. As stated by the SEC, "Information presented on www.sec.gov is considered public information and may be copied or further distributed by users of the web site without the SEC’s permission." (see https://www.sec.gov/privacy.htm#dissemination).
21st Century Corporate Financial Fraud, United States, 2005-2010
catalog.data.gov
icpsr.umich.edu
Updated Nov 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). 21st Century Corporate Financial Fraud, United States, 2005-2010 [Dataset]. https://catalog.data.gov/dataset/21st-century-corporate-financial-fraud-united-states-2005-2010-22a9e
Explore at:
Dataset updated
Nov 14, 2025
Dataset provided by
National Institute of Justicehttp://nij.ojp.gov/
Area covered
United States
Description
The Corporate Financial Fraud project is a study of company and top-executive characteristics of firms that ultimately violated Securities and Exchange Commission (SEC) financial accounting and securities fraud provisions compared to a sample of public companies that did not. The fraud firm sample was identified through systematic review of SEC accounting enforcement releases from 2005-2010, which included administrative and civil actions, and referrals for criminal prosecution that were identified through mentions in enforcement release, indictments, and news searches. The non-fraud firms were randomly selected from among nearly 10,000 US public companies censused and active during at least one year between 2005-2010 in Standard and Poor's Compustat data. The Company and Top-Executive (CEO) databases combine information from numerous publicly available sources, many in raw form that were hand-coded (e.g., for fraud firms: Accounting and Auditing Enforcement Releases (AAER) enforcement releases, investigation summaries, SEC-filed complaints, litigation proceedings and case outcomes). Financial and structural information on companies for the year leading up to the financial fraud (or around year 2000 for non-fraud firms) was collected from Compustat financial statement data on Form 10-Ks, and supplemented by hand-collected data from original company 10-Ks, proxy statements, or other financial reports accessed via Electronic Data Gathering, Analysis, and Retrieval (EDGAR), SEC's data-gathering search tool. For CEOs, data on personal background characteristics were collected from Execucomp and BoardEx databases, supplemented by hand-collection from proxy-statement biographies.
u
Board Leadership Database (U.S. Public Firms) + ML Script for Scaling Human...
iro.uiowa.edu
zenodo.org
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph S Harrison; Matthew Josefy; Matias Kalm; Ryan Krause (2022). Board Leadership Database (U.S. Public Firms) + ML Script for Scaling Human Coded Data [Dataset]. https://iro.uiowa.edu/esploro/outputs/dataset/Board-Leadership-Database-US-Public-Firms/9984937786602771
Explore at:
Dataset updated
Dec 7, 2022
Dataset provided by
Zenodo
Authors
Joseph S Harrison; Matthew Josefy; Matias Kalm; Ryan Krause
Time period covered
Dec 7, 2022
Area covered
United States
Description
Files include: (1) an open sourced database of CEO duality and board chair orientations developed by scaling human coded data using supervised machine learning techniques (in both .dta and .csv formats), as well as (2) the accompanying training and scoring scripts to scale human coded data. Users may apply the scoring script to score the same variables from company proxy statements, or may adapt the training/scoring scripts and retrain models to scale human coded data of other constructs or measures. We note that early steps in the process to develop our database and script required web-scraping of company filings from SEC Edgar and text extraction from collected filings. We relied on other publicly available scripts to develop our own fetcher and extraction scripts. Users seeking to duplicate those parts of the process may benefit from the following resources from Kai Chen and pipy.org: For resources from Kai Chen: see https://www.kaichen.work/?p=681 and https://www.kaichen.work/?p=946 For resources from pipy.org, see sec-edgar-downloader and sec-api
O
DEFT Corpus
opendatalab.com
zip
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of California (2023). DEFT Corpus [Dataset]. https://opendatalab.com/OpenDataLab/DEFT_Corpus
Explore at:
zip(54370260 bytes)Available download formats
Dataset updated
Mar 24, 2023
Dataset provided by
Adobe
University of California
Adobe Research
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The DEFT corpus1 consists of annotated content from two different data sources: 1) 2,443 sentences (5,324,430 tokens) from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences (409,253 tokens) from the https://cnx.org/ open source textbooks (by various authors, licensed under CC BY 4.0) including topics in biology, history, physics, psychology, economics, sociology, and government. 22% of SEC sentences contain definitions and 28% of textbook sentences contain definitions. Our entire corpus, including both datasets, is significantly larger and more complex than any existing definition extraction dataset (see Table 1).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2020). SEC Filings [Dataset]. https://www.kaggle.com/datasets/bigquery/sec-filings

Data from: SEC Filings

BigQuery dataset of all SEC filings

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Jun 5, 2020

Dataset provided by

BigQueryhttps://cloud.google.com/bigquery

Authors

Google BigQuery

Description

In the U.S. public companies, certain insiders and broker-dealers are required to regularly file with the SEC. The SEC makes this data available online for anybody to view and use via their Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database. The SEC updates this data every quarter going back to January, 2009. For more information please see this site.

To aid analysis a quick summary view of the data has been created that is not available in the original dataset. The quick summary view pulls together signals into a single table that otherwise would have to be joined from multiple tables and enables a more streamlined user experience.

DISCLAIMER: The Financial Statement and Notes Data Sets contain information derived from structured data filed with the Commission by individual registrants as well as Commission-generated filing identifiers. Because the data sets are derived from information provided by individual registrants, we cannot guarantee the accuracy of the data sets. In addition, it is possible inaccuracies or other errors were introduced into the data sets during the process of extracting the data and compiling the data sets. Finally, the data sets do not reflect all available information, including certain metadata associated with Commission filings. The data sets are intended to assist the public in analyzing data contained in Commission filings; however, they are not a substitute for such filings. Investors should review the full Commission filings before making any investment decision.

Clear search

Close search

Google apps

Main menu

Data from: SEC Filings

Layline institutional holding reports

SEC (EDGAR) Company Names & CIK Keys

Context

Content

Acknowledgements

Inspiration

Financial Statement Data Sets

Dataset Features:

Data Sources and Extraction:

US Company Filings Database

PUDL Raw U.S. Securities and Exchange Commission Form 10-K

SEC Form 4 Filings

Why this exists

What you get:

Ideas to try with the data:

Limitations & notes:

Filings

Embeddings of Item 1 from 10-K Filings (1994-2022) using MPNet

Common Ownership Data: Scraped SEC form 13F filings for 1999-2017

Data from: Deft

SEC 13F-HR Institutional Investment Data

Context

Content

Acknowledgements

Inspiration

MarkupMnA

A database for blockholders in US-listed firms including all Form 13D and...

21st Century Corporate Financial Fraud, United States, 2005-2010

Board Leadership Database (U.S. Public Firms) + ML Script for Scaling Human...

DEFT Corpus

Data from: SEC Filings

BigQuery dataset of all SEC filings