Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
What follows is research code. It is by no means optimized for speed, efficiency, or readability.
Data loading, tokenizing and sharding
import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc
fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.
polyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.
I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
```python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
```
For example, compute the description of data set
```python
df_describe = ddf.describe().compute()
df_describe
```
PSMILES strings only
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159
For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl
python package.
Included Data Packages
This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database.
pudl-eia860-eia923:
pudl-eia860-eia923-epacems:
pudl-eia860-eia923
package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records.pudl-ferc1
:catalystcoop.pudl
Python package and the original source data files archived as part of this data release.Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:
Using the Data
The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like.
Make sure conda
is installed
None of these commands will work without the conda
Python package manager installed, either via Anaconda or miniconda
:
Download the data
First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them.
pudl-input-data.tgz
.pudl-eia860-eia923-epacems.tgz
.Load All of PUDL in a Single Line
Use cd
to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows.
If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh
:
bash pudl-load.sh
This will do the following:
sqlite/pudl.sqlite
.parquet/epacems
.sqlite/ferc1.sqlite
.Selectively Load PUDL Data
If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately.
Create the PUDL conda
Environment
This installs the PUDL software locally, and a couple of other useful packages:
conda create --yes --name pudl --channel conda-forge \
--strict-channel-priority \
python=3.7 catalystcoop.pudl=0.3.1 dask jupyter jupyterlab seaborn pip
conda activate pudl
Create a PUDL data management workspace
Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet
, sqlite
, data
etc.
pudl_setup ./
Extract and load the FERC Form 1 and EIA 860/923 data
If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz
and pudl-eia860-eia923.tgz
. Then extract them in the same directory where you ran pudl_setup
:
tar -xzf pudl-ferc1.tgz
tar -xzf pudl-eia860-eia923.tgz
To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite
script that comes with PUDL will do that for you:
datapkg_to_sqlite \
datapkg/pudl-data-release/pudl-ferc1/datapackage.json \
datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \
-o datapkg/pudl-data-release/pudl-merged/
Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite
.
Extract EPA CEMS and convert to Apache Parquet
If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet
script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet()
method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more:
tar -xzf pudl-eia860-eia923-epacems.tgz
epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json
You should find the Parquet dataset (~5 GB) under parquet/epacems
, partitioned by year and state for easier querying.
Clone the raw FERC Form 1 Databases
If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite
script using the same settings file that was used to generate the data release:
tar -xzf pudl-input-data.tgz
ferc1_to_sqlite data-release-settings.yml
You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite
.
Data Quality Control
We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
What follows is research code. It is by no means optimized for speed, efficiency, or readability.
Data loading, tokenizing and sharding
import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc
fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.