Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL v2025.2.0 Data Release
This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.
One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.
Some potentially breaking changes to be aware of:
In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.
We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.
Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.
New Data
EIA 176
Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.
Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.
EIA 860
Added EIA 860 Multifuel table. See #3438 and #3946.
FERC 1
Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:
out_ferc1_yearly_detailed_income_statements
out_ferc1_yearly_detailed_balance_sheet_assets
out_ferc1_yearly_detailed_balance_sheet_liabilities
SEC Form 10-K Parent-Subsidiary Ownership
We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.
See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:
out_sec10k_parents_and_subsidiaries
core_sec10k_quarterly_filings
core_sec10k_quarterly_exhibit_21_company_ownership
core_sec10k_quarterly_company_information
Expanded Data Coverage
EPA CEMS
Added 2024 Q4 of CEMS data. See #4041 and #4052.
EPA CAMD EIA Crosswalk
In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.
The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.
EIA 860M
Added EIA 860m through December 2024. See #4038 and #4047.
EIA 923
Added EIA 923 monthly data through September 2024. See #4038 and #4047.
EIA Bulk Electricity Data
Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.
EIA 930
Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.
Bug Fixes
Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.
Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.
Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.
Quality of Life Improvements
We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.
Other PUDL v2025.2.0 Resources
PUDL v2025.2.0 Data Dictionary
PUDL v2025.2.0 Documentation
PUDL in the AWS Open Data Registry
PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/
PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/
Zenodo archive of the PUDL GitHub repo for this release
PUDL v2025.2.0 release on GitHub
PUDL v2025.2.0 package in the Python Package Index (PyPI)
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Follow us on GitHub
Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter
GitHub Discussions is where we provide user support.
Watch our GitHub Project to see what we're working on.
Email us at hello@catalyst.coop for private communications.
On Mastodon: @CatalystCoop@mastodon.energy
On BlueSky: @catalyst.coop
On Twitter: @CatalystCoop
Connect with us on LinkedIn
Play with our data and notebooks on Kaggle
Combine our data with ML models on HuggingFace
Learn more about us on our website: https://catalyst.coop
Subscribe to our announcements list for email updates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is our regular quarterly release for 2024Q3. It includes quarterly updates to all datasets that are updated with quarterly or higher frequency by their publishers, including EIA-860M, EIA-923 (YTD data), EIA-930, the EIA’s bulk electricity API data (used to fill in missing fuel prices), and the EPA CEMS hourly emissions data.
Annual datasets which have been published since our last quarterly release have also been integrated. These include FERC Forms 1, 2, 6, 60, and 714, and the NREL ATB.
This release also includes provisional versions of the annual 2023 EIA-860 and EIA-923 datasets, whose final release will not happen until the fall.
Integrated FERC Form 1 data from 2023 into the main PUDL SQLite DB. See issue #3700 and PR #3701. This required updating to a new version of the catalystcoop.ferc_xbrl_extractor package because there are now multiple XBRL taxonomies in use by FERC in different years, or even within the same year. See this PR for more details, as well as issue #3544 and PR #3710.
Updated the ferc_to_sqlite settings to extract 2023 XBRL data for FERC Forms 2, 6 60, and 714 and add them to their respective SQLite databases. Note that this data is not yet being processed beyond the conversion from XBRL to SQLite. See PR #3710
Added new tables from EIA AEO table 54:
core_eiaaeo_yearly_projected_fuel_cost_in_electric_sector_by_type contains fuel costs for the electric power sector. These are broken out by fuel type, and include both nominal USD per MMBtu as well as real 2022 USD per MMBtu. See issue #3649 and PR #3656.
Added EIA 860 early release data from 2023. This included adding a new tab with proposed energy storage generators as well as adding a number of new columns regarding energy storage and solar generators. See issue #3676 and PR #3681.
Added EIA 860m data through June 2024. See issue #3759 and PR #3767.
Added EIA 923 early release data from 2023. See #3719 and PR #3721.
Added EIA 923 monthly data through May as part of the Q2 quarterly release. See #3760 and #3768.
Added EIA 930 hourly data through the end of July as part of the Q2 quarterly release. See #3761 and #3789.
Updated the EIA Bulk Electricity data archive to include data that was available as of 2024-08-01, which covers up through 2024-05-01 (3 months more than the previously used archive). See #3763 and PR #3785.
Added core_ferc714_yearly_planning_area_demand_forecast based on FERC Form 714, Part III, Schedule 2b. Data includes forecasted demand and net energy load. See issue #3519 and PR #3670.
Added 2024 NREL ATB data. This includes adding a new tax credit case, model_tax_credit_case_nrelatb, a breakout of capex_grid_connection_per_kw for all technologies, and more detailed nuclear breakdowns of fuel_cost_per_mwh. Simultaneously, updated the docs.dev.existing_data_updates documentation to make it easier to add future years of data. See #3706 and #3719.
Updated NREL ATB data to include error corrections in the 2024 data. See #3777 and PR #3778.
When generator_operating_date values are too inconsistent to be harvested successfully, we now take the last reported date in EIA 860 and 860M. See #423 and PR #3967.
Added the generator_operating_date field into core_eia860m_changelog_generators, adding 860M reported generator operating dates into the changelog table. This table is not harvested, and thus does not affect the generator_operating_date values reported in other core EIA tables. See #3722 and PR #3751.
Disabled filling of missing values using rolling averages for the fuel_cost_per_mmbtu column in the out_eia923_fuel_receipts_costs table, as it was resulting in some anomlously high fuel prices. See #3716. This results in about 2% more records in the table being left NA after filling with the average prices for that fuel type for the state and month found in the bulk EIA API data.
The full ETL settings are now read directly from etl_full.yml instead of using default values defined in the settings classes. This also results in the settings showing up in the Dagster UI Launchpad, which previously they didn’t, leading to confusion when trying to re-run the FERC to SQLite conversions. See #3710.
mlflow experiment tracking has been disabled by default when running the DAG, since it is only really helpful during development of new record linkage or other ML workflows. See #3710.
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a regular quarterly release of PUDL. It includes new 2024 annual updates for a number of datasets (FERC Forms 2, 6, 60, & 714), and a minor update to the 2024 FERC Form 1 data that includes late filings & revisions. It also includes year-to-date updates for the monthly and quarterly datasets, including EIA-860M, EIA-923, EIA-930, and the EPA CEMS hourly emissions. There were also a number of data processing bug fixes and data usability improvements. See the full notes below for details.
Thanks to contributions from @alexclippinger, we’ve added cleaned EIA923 Schedule 8A Byproduct Disposition to the PUDL database as _core_eia923_yearly_byproduct_disposition. Once harvested, this table will be replaced with a well-normalized version of the same data, but it is being published in this form until then. See #4100 and #2448, and #4502.
Updated EIA-860M monthly generator report with newly published data for May and June of 2025. See issue #4379 and PR #4536.
Updated the EIA Bulk Electricity data to include data published up through the beginning of August 2025. See #4519 and PR #4523.
Updated our extraction of FERC Forms 2, 6, and 60 to raw SQLite databases to include 2024 data. See #4418 and #4433.
Extracted 2023 and 2024 PHMSA distribution and transmission data to raw assets. This data is not currently published to the PUDL database. See #4449 and #4470.
Extracted 1970 through 1989 PHMSA transmission data to raw assets. This data is not currently published to the PUDL database. See #3290 and #4500.
The output of dbt_helper update-tables now conforms to the format that our pre-commit hooks expect, reducing annoying back-and-forth and diffs. See #4119 and #4401.
Improved behavior of dbt_helper when interacting with row count test definitions as well as updating the row counts stored in dbt seed tables: the logic for writing a new table dbt schema no longer includes automatically adding a row count test. Also, the logic for updating row counts now depends on whether a test has been defined in the dbt schema, whether any existing row counts for that table are present in the seed table, as well as user provided settings such as --clobber.
Stopped running code checks in CI when only the documentation has changed. See issue #4410 and PR #4429.
Added utility_id_ferc1_dbf and utility_id_ferc1_xbrl columns into all ferc1 output tables. See #4365 and PR #4528.
Fixed bug in how we were labeling the data_maturity of EIA 923. See issue #4328 and PR #4392.
Fixed bug in how we were repairing a misfiled EIA code in core_ferc714_respondent_id. See issue #4439 and PR #4497.
Fixed bug in how we were removing duplicates in core_eia923_monthly_generation resulting in ~400 more records in this table over several years. See details in PR #4538
Migrated table description metadata into new format; see epic #4358 for issues & PRs for all source groups.
This included renaming two of the preliminarily published _core tables to better conform with our table naming conventions. Table _core_eia923_cooling_system_information is now _core_eia923_monthly_cooling_system_information and _core_eia923_fgd_operation_maintenance is now _core_eia923_yearly_fgd_operation_maintenance. See #4422.
Added data source pages for:
EPA CAMD to EIA Power Sector Data Crosswalk; see issue #4376 and PR #4403
Added checks which ensure that only hourly electricity demand values which are flagged for imputation change significantly from their reported values before and after the imputation. Check that the missingness of various columns in the hourly reported demand and imputed demand are within expected ranges. Explicitly flag years of which are dropped due to insufficient data for meaningful imputation with BAD_YEAR. Affected tables include out_eia930_hourly_operations, out_eia930_hourly_subregion_demand, and out_ferc714_hourly_planning_area_demand. See PR #4334.
Previously we had a data validation check that ensured there were no entirely null columns applied to a handful of tables. Such columns were typically the result of typos or failures to update column names, or application
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We've just completed our quarterly integration of EIA data sources for 2024Q2 (in support of RMI's Utility Transition Hub) and have also added a bunch of new tables over the last few months in an effort to better support energy system modelers (with support from GridLab).
gridpathratoolkit data source containing hourly wind and solar generation profiles from the GridPath Resoure Adequacy Toolkit. See our documentation and the new Zenodo archive, PR #3489 and this PUDL archiver issue.generator_operating_date values are too inconsistent to be harvested successfully, we now take the max date within a year and attempt to harvest again, to rescue records lost because of inconsistent month reporting in EIA 860 and 860M. See issue #3340 and PR #3419. This change also fixed a bug that was preventing other columns harvested with a special process from being saved.We merged in a refactor of the EIA plant parts to FERC1 plants record linkage model, which was generously supported by a CCAI Innovation Grant. This replaced the linear regression model with a model built
Facebook
Twitterhttps://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the publisher's website. The overarching empirical research question of the paper is “why did states recognize Bangladesh as a state?” and, more specifically, “why did (most of) the international community first condemn and then accept Bangladesh as a state?”. The goal of the empirical section of the paper was to do theory-building process-tracing of the decisions to recognize Bangladesh, that is, to build a theoretical explanation from the empirical evidence of a particular case, and then inferring that an analytically general mechanism exists. Data generation After immersing myself in the secondary literature and the archival material that I had collected for the prior doctoral project, I had an idea for a skeleton causal mechanism, i.e. that the withdrawal of Indian troops from Bangladesh had somehow changed the status of recognition, i.e. legitimated recognition. In order to assess this idea, I then consulted some theory from international relations, psychology, sociology, and cognitive science, on how decisions are made and how arguments work, in order to hypothesize a causal mechanism. This causal mechanism, elucidated in the paper, was rhetorical adduction; basically that states try to win arguments (thus changing the behavior of relatively uncommitted audiences relative to some policy) by linking some empirical state of affairs with their argument and then bringing that empirical state of affairs about. In this Bangladesh case, this meant that some actors argued that although India’s invasion and occupation of East Pakistan made recognition of Bangladesh problematic, the withdrawal of Indian troops from Bangladesh would dismiss or undercut the critique. At this point, I formulated some observable implications of this idea, such as that if this is what had actually been going on, the states making the argument (e.g. Bangladesh and India) would have to actually have made the argument, and states would have explicitly conditioned their recognition policy decision on the withdrawal of Indian troops. In order to find out whether there was any evidence for these observable implications, I consulted three main types of evidence; 1) public statements by state representatives in the press and at the UN (using the UN verbatim meeting records), 2) UK political and diplomatic archives and 3) US political and diplomatic archives. As it happens, the UK was heavily involved in discussions surrounding recognition and the US was not (US President Richard Nixon and National Security Adviser Henry Kissinger were more concerned with other issues, like supporting West Pakistan and also organizing the historic visit to the People’s Republic of China), so that almost all of the relevant evidence came from UK archives. A clear limitation of this sampling frame is that it relies on 3rd party evaluations of internal deliberations of most of the states involved. This is less of a problem than it might otherwise be because there seems little reason to explicitly condition recognition on troop withdrawal in private and secret/confidential bilateral communication with the UK if it is irrelevant to internal deliberations. If there had been some clear self-interest in misrepresenting, in this type of communication, then it would affect the plausibility of the causal claims. I collected most of the documents used in the paper from the National Archives at Kew in the UK during two visits, one in January 2011 and another in July 2013. The first visit was to collect data for my doctoral dissertation, which was a prior, separate project from this paper. While I was finishing the Bangladesh case for my dissertation, I began to have another idea about the material. That is, I started to think that a slightly different type of conceptual/theoretical argument was relevant to a different empirical aspect of the Bangladesh case. However, as I had not had that in mind when initially collecting archival documents, I arranged a second visit to search for more information more directly relevant to this second puzzle. The documents primarily come from a series of folders from the Foreign and Commonwealth Offices’ archives and the Premiers’ Archives that I found via two methods. First, I used the citations in Musson 2008 to identify potentially important or relevant material and then made a list of all the folders that that material was contained in. Second, I performed keyword searches for recognition and Bangladesh in the National Archives database search engine. While I was in the archives, I made copies of almost every single document in the folders that I had previously identified. I excluded documents that were obvious duplicates or that had no readable text. Data analysis Data Analysis for this paper involved reading through all of the documents, constructing a detailed timeline of who said what when and who did what when, and then...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL Data Release 4.0.0
This is a data release from the Public Utility Data Liberation (PUDL) project.
Using This Data
The data in this archive is stored in a combination of SQLite database files, and Apache Parquet datasets. It can be used as a standalone resource, or in conjunction with the PUDL software. The PUDL documentation contains data dictionaries for many of the data tables.
If you want to use the data in conjunction with the PUDL software, we've included a Docker image within the archive that will run a Jupyter Notebook Server containing examples of use based on our PUDL Examples repository. This Docker image contains all of the required software, and can access the associated archived data.
Make sure that you've got Docker installed and running, and also have docker-compose. You'll want to allocate at least 8GB of memory to Docker.
To use the Docker container to access and work with the data, download and extract the compressed tar archive on you computer.
Inside the directory that is created when you extract the archive, you will find a Docker image. Load that image into your Docker environment locally with:
docker load -i pudl-jupyter.tar
Then within that same directory, run:
docker-compose up
This should start a Jupyter Notebook Server, and provide you with a link to connect to the server running on your local computer, beginning with https://127.0.0.1:48512 or https://localhost:48512
You can select the tutorial notebooks from within the notebook interface. The README file contained in the archive and the PUDL Examples repository both provide more details on how to access and work with the data.
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL Data Release 3.0.0
This is a data release from the Public Utility Data Liberation (PUDL) project.
Using This Data
The data in this archive is stored in a combination of SQLite database files, and Apache Parquet datasets. It can be used as a standalone resource, or in conjunction with the PUDL software. The PUDL documentation contains data dictionaries for many of the data tables.
If you want to use the data in conjunction with the PUDL software, we've included a Docker image within the archive that will run a Jupyter Notebook Server containing examples of use based on our PUDL Examples repository. This Docker image contains all of the required software, and can access the associated archived data.
Make sure that you've got Docker installed and running, and also have docker-compose. You'll want to allocate at least 8GB of memory to Docker.
To use the Docker container to access and work with the data, download and extract the compressed tar archive on you computer.
Inside the directory that is created when you extract the archive, you will find a Docker image. Load that image into your Docker environment locally with:
docker load -i pudl-jupyter.tar
Then within that same directory, run:
docker-compose up
This should start a Jupyter Notebook Server, and provide you with a link to connect to the server running on your local computer, beginning with https://127.0.0.1:48512 or https://localhost:48512
You can select the tutorial notebooks from within the notebook interface. The README file contained in the archive and the PUDL Examples repository both provide more details on how to access and work with the data.
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL Data Release 2.0.0
This is a data release from the Public Utility Data Liberation (PUDL) project.
Using This Data
The data in this archive is stored in a combination of SQLite database files, and Apache Parquet datasets. It can be used as a standalone resource, or in conjunction with the PUDL software. The PUDL documentation contains data dictionaries for many of the data tables.
If you want to use the data in conjunction with the PUDL software, we've included a Docker image within the archive that will run a Jupyter Notebook Server containing examples of use based on our PUDL Examples repository. This Docker image contains all of the required software, and can access the associated archived data.
Make sure that you've got Docker installed and running, and also have docker-compose. You'll want to allocate at least 8GB of memory to Docker.
To use the Docker container to access and work with the data, download and extract the compressed tar archive on you computer.
Inside the directory that is created when you extract the archive, you will find a Docker image. Load that image into your Docker environment locally with:
docker load -i pudl-jupyter.tar
Then within that same directory, run:
docker-compose up
This should start a Jupyter Notebook Server, and provide you with a link to connect to the server running on your local computer, beginning with https://127.0.0.1:48512 or https://localhost:48512
You can select the tutorial notebooks from within the notebook interface. The README file contained in the archive and the PUDL Examples repository both provide more details on how to access and work with the data.
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a data release from the Public Utility Data Liberation (PUDL) project. It's the first data-only release we've published. All of the tables which were previously only available by using the PUDL software package to process the data we previously published in the PUDL SQLite database are now being written into the database itself. This should make it easier for people to access with minimal setup, using a variety of different tools: Python, R, DuckDB, and many others! We are still committed to keeping the data processing pipeline behind this data free and open and transparent, we just don't want everyone to have to install and work with that software it if all they want is the output data!
We are about to do a major reorganization of the database, renaming almost every table and number of columns. This data release is a snapshot of the database before all that change happens, and is meant to provide continuity for users who are already working with the database, so that they can access to all the final 2022 data and migrate to the new database structure at a time of their own choosing over the coming months. We will do another data release soon, containing data through 2022, but with the new table and column names.
This is the software that was used to produce the data release. It is not necessary to work with the data, but it's linked here to provide transparency and provenance:
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL v2025.2.0 Data Release
This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.
One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.
Some potentially breaking changes to be aware of:
In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.
We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.
Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.
New Data
EIA 176
Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.
Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.
EIA 860
Added EIA 860 Multifuel table. See #3438 and #3946.
FERC 1
Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:
out_ferc1_yearly_detailed_income_statements
out_ferc1_yearly_detailed_balance_sheet_assets
out_ferc1_yearly_detailed_balance_sheet_liabilities
SEC Form 10-K Parent-Subsidiary Ownership
We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.
See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:
out_sec10k_parents_and_subsidiaries
core_sec10k_quarterly_filings
core_sec10k_quarterly_exhibit_21_company_ownership
core_sec10k_quarterly_company_information
Expanded Data Coverage
EPA CEMS
Added 2024 Q4 of CEMS data. See #4041 and #4052.
EPA CAMD EIA Crosswalk
In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.
The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.
EIA 860M
Added EIA 860m through December 2024. See #4038 and #4047.
EIA 923
Added EIA 923 monthly data through September 2024. See #4038 and #4047.
EIA Bulk Electricity Data
Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.
EIA 930
Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.
Bug Fixes
Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.
Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.
Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.
Quality of Life Improvements
We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.
Other PUDL v2025.2.0 Resources
PUDL v2025.2.0 Data Dictionary
PUDL v2025.2.0 Documentation
PUDL in the AWS Open Data Registry
PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/
PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/
Zenodo archive of the PUDL GitHub repo for this release
PUDL v2025.2.0 release on GitHub
PUDL v2025.2.0 package in the Python Package Index (PyPI)
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Follow us on GitHub
Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter
GitHub Discussions is where we provide user support.
Watch our GitHub Project to see what we're working on.
Email us at hello@catalyst.coop for private communications.
On Mastodon: @CatalystCoop@mastodon.energy
On BlueSky: @catalyst.coop
On Twitter: @CatalystCoop
Connect with us on LinkedIn
Play with our data and notebooks on Kaggle
Combine our data with ML models on HuggingFace
Learn more about us on our website: https://catalyst.coop
Subscribe to our announcements list for email updates.