3 datasets found

h
roots-tsne-data
huggingface.co
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Christopher Akiki
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
What follows is research code. It is by no means optimized for speed, efficiency, or readability.

Data loading, tokenizing and sharding

import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.
polyOne Data Set - 100 million hypothetical polymers including 29 properties...
zenodo.org
bin, txt
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. http://doi.org/10.5281/zenodo.7766806
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7766806
Dataset updated
Mar 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.

I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask
```python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
```

For example, compute the description of data set
```python
df_describe = ddf.describe().compute()
df_describe

```

PSMILES strings only

generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.

generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
PUDL Data Release v1.0.0
zenodo.org
explore.openaire.eu
application/gzip, bin +1
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell (2023). PUDL Data Release v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.3653159
Explore at:
application/gzip, bin, shAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3653159
Dataset updated
Aug 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159

For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl python package.

Included Data Packages

This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database.

pudl-eia860-eia923:
Data originally collected and published by the US Energy Information Administration (US EIA). The data from EIA Form 860 covers the years 2011-2018. The Form 923 data covers 2009-2018. A large majority of the data published in the original data sources has been included, but some parts, like fuel stocks on hand, and EIA 923 schedules 6, 7, & 8 have not yet been integrated.

pudl-eia860-eia923-epacems:
This data package contains all of the same data as the pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records.

pudl-ferc1:
Seven data tables from FERC Form 1 are included, primarily relating to individual power plants, and covering the years 1994-2018 (the entire span of time for which FERC provides this data). These tables are the only ones which have been subjected to any cleaning or organization for programmatic use within PUDL. The complete, raw FERC Form 1 database contains 116 different tables with many thousands of columns of mostly financial data. We will archive a complete copy of the multi-year FERC Form 1 Database as a file-based SQLite database at Zenodo, independent of this data release. It can also be re-generated using the catalystcoop.pudl Python package and the original source data files archived as part of this data release.

Contact Us

If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:

Subscribe to our announcements list for email updates.

Use the Github issue tracker to file bugs, suggest improvements, or ask for help.

Email the project team at pudl@catalyst.coop for private communications.

Follow @CatalystCoop on Twitter.

Using the Data

The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like.

Make sure conda is installed

None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda:

Install Anaconda

Install miniconda

Download the data

First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them.

If you don't want to recreate the data release from scratch by re-running the entire ETL process yourself, and you don't want to create a full clone of the original FERC Form 1 database, including all of the data that has not yet been integrated into PUDL, then you don't need to download pudl-input-data.tgz.

If you don't need the EPA CEMS Hourly Emissions data, you do not need to download pudl-eia860-eia923-epacems.tgz.

Load All of PUDL in a Single Line

Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows.

If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh:

bash pudl-load.sh

This will do the following:

Load the FERC Form 1, EIA Form 860, and EIA Form 923 data packages into an SQLite database which can be found at sqlite/pudl.sqlite.

Convert the EPA CEMS data package into an Apache Parquet dataset which can be found at parquet/epacems.

Clone all of the FERC Form 1 annual databases into a single SQLite database which can be found at sqlite/ferc1.sqlite.

Selectively Load PUDL Data

If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately.

Create the PUDL conda Environment

This installs the PUDL software locally, and a couple of other useful packages:

conda create --yes --name pudl --channel conda-forge \ --strict-channel-priority \ python=3.7 catalystcoop.pudl=0.3.1 dask jupyter jupyterlab seaborn pip conda activate pudl

Create a PUDL data management workspace

Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet, sqlite, data etc.

pudl_setup ./

Extract and load the FERC Form 1 and EIA 860/923 data

If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz and pudl-eia860-eia923.tgz. Then extract them in the same directory where you ran pudl_setup:

tar -xzf pudl-ferc1.tgz tar -xzf pudl-eia860-eia923.tgz

To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite script that comes with PUDL will do that for you:

datapkg_to_sqlite \ datapkg/pudl-data-release/pudl-ferc1/datapackage.json \ datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \ -o datapkg/pudl-data-release/pudl-merged/

Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite.

Extract EPA CEMS and convert to Apache Parquet

If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet() method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more:

tar -xzf pudl-eia860-eia923-epacems.tgz epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json

You should find the Parquet dataset (~5 GB) under parquet/epacems, partitioned by year and state for easier querying.

Clone the raw FERC Form 1 Databases

If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite script using the same settings file that was used to generate the data release:

tar -xzf pudl-input-data.tgz ferc1_to_sqlite data-release-settings.yml

You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite.

Data Quality Control

We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data

roots-tsne-data

christopher/roots-tsne-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 16, 2023

Authors

Christopher Akiki

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

What follows is research code. It is by no means optimized for speed, efficiency, or readability.

  Data loading, tokenizing and sharding

import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.

Clear search

Close search

Google apps

Main menu

roots-tsne-data

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

PUDL Data Release v1.0.0

roots-tsne-data

christopher/roots-tsne-data