EDGAR-CORPUS: Billions of Tokens Make The World Go Round
In the Proceedings of the Workshop on Economics and Natural Language Processing (ECONLP) - co-located with EMNLP 2021
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years.
All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format.
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Data from 2010 Q1 to 2025 Q1
The data is created with this Jupyter Notebook:
The data format is documented in the Readme. The Sec data documentation can be found here.
Json structure:
{"quarter": "Q1", "country": "Italy", "data": {"cf": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "bs": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "ic": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}]}, "year": 0, "name": "B", "startDate": "2009-12-31", "endDate": "2010-12-30", "symbol": "GM", "city": "York"}
An example Json:
{"year": 2023, "data": {"cf": [{"value": -1834000000, "concept": "NetCashProvidedByUsedInFinancingActivities", "unit": "USD", "label": "Amount of cash inflow (outflow) from financing … Amount of cash inflow (outflow) from financing …", "info": "Net cash used in financing activities"}], "ic":[{"value": 1000000, "concept": "IncreaseDecreaseInDueFromRelatedParties", "unit": "USD", "label": "The increase (decrease) during the reporting pe… The increase (decrease) during the reporting pe…", "info": "Receivables from related parties"}], "bs": [{"value": 2779000000, "concept": "AccountsPayableCurrent", "unit": "USD", "label": "Carrying value as of the balance sheet date of … Carrying value as of the balance sheet date of …", "info": "Accounts payable"}]}, "quarter": "Q2", "city": "SANTA CLARA", "startDate": "2023-06-30", "name": "ADVANCED MICRO DEVICES INC", "endDate": "2023-09-29", "country": "US", "symbol": "AMD"}
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
this data is 10Q and 10K reports downloaded as JSON files, i then did tern them to parquet files for efficiency with every data frame there is a market cap column that is a masseur of the market cap of that day you can always get the data from the SEC website the latest update for the data should be here https://www.sec.gov/Archives/edgar/daily-index/xbrl/companyfacts.zip A.csv is just an example of what the rest of the data is going to look like. enjoy it if you can
Not seeing a result you expected?
Learn how you can add new datasets to our index.
EDGAR-CORPUS: Billions of Tokens Make The World Go Round
In the Proceedings of the Workshop on Economics and Natural Language Processing (ECONLP) - co-located with EMNLP 2021
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years.
All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format.