3 datasets found
  1. h

    roots-tsne-data

    • huggingface.co
    Updated May 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 16, 2023
    Authors
    Christopher Akiki
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    What follows is research code. It is by no means optimized for speed, efficiency, or readability.

      Data loading, tokenizing and sharding
    

    import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

    fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.

  2. polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • zenodo.org
    bin, txt
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. http://doi.org/10.5281/zenodo.7766806
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.

    I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask
    ```python
    import dask.dataframe as dd
    ddf = dd.read_parquet("*.parquet", engine="pyarrow")
    ```

    For example, compute the description of data set
    ```python
    df_describe = ddf.describe().compute()
    df_describe

    ```

    PSMILES strings only

    • generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
    • generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
  3. PUDL Data Release v1.0.0

    • zenodo.org
    • explore.openaire.eu
    application/gzip, bin +1
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell (2023). PUDL Data Release v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.3653159
    Explore at:
    application/gzip, bin, shAvailable download formats
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159

    For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl python package.

    Included Data Packages

    This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database.

    • pudl-eia860-eia923:
      Data originally collected and published by the US Energy Information Administration (US EIA). The data from EIA Form 860 covers the years 2011-2018. The Form 923 data covers 2009-2018. A large majority of the data published in the original data sources has been included, but some parts, like fuel stocks on hand, and EIA 923 schedules 6, 7, & 8 have not yet been integrated.
    • pudl-eia860-eia923-epacems:
      This data package contains all of the same data as the pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records.
    • pudl-ferc1:
      Seven data tables from FERC Form 1 are included, primarily relating to individual power plants, and covering the years 1994-2018 (the entire span of time for which FERC provides this data). These tables are the only ones which have been subjected to any cleaning or organization for programmatic use within PUDL. The complete, raw FERC Form 1 database contains 116 different tables with many thousands of columns of mostly financial data. We will archive a complete copy of the multi-year FERC Form 1 Database as a file-based SQLite database at Zenodo, independent of this data release. It can also be re-generated using the catalystcoop.pudl Python package and the original source data files archived as part of this data release.

    Contact Us

    If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:

    Using the Data

    The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like.

    Make sure conda is installed

    None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda:

    Download the data

    First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them.

    • If you don't want to recreate the data release from scratch by re-running the entire ETL process yourself, and you don't want to create a full clone of the original FERC Form 1 database, including all of the data that has not yet been integrated into PUDL, then you don't need to download pudl-input-data.tgz.
    • If you don't need the EPA CEMS Hourly Emissions data, you do not need to download pudl-eia860-eia923-epacems.tgz.

    Load All of PUDL in a Single Line

    Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows.

    If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh:

    bash pudl-load.sh
    

    This will do the following:

    • Load the FERC Form 1, EIA Form 860, and EIA Form 923 data packages into an SQLite database which can be found at sqlite/pudl.sqlite.
    • Convert the EPA CEMS data package into an Apache Parquet dataset which can be found at parquet/epacems.
    • Clone all of the FERC Form 1 annual databases into a single SQLite database which can be found at sqlite/ferc1.sqlite.

    Selectively Load PUDL Data

    If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately.

    Create the PUDL conda Environment

    This installs the PUDL software locally, and a couple of other useful packages:

    conda create --yes --name pudl --channel conda-forge \
      --strict-channel-priority \
      python=3.7 catalystcoop.pudl=0.3.1 dask jupyter jupyterlab seaborn pip
    conda activate pudl
    

    Create a PUDL data management workspace

    Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet, sqlite, data etc.

    pudl_setup ./
    

    Extract and load the FERC Form 1 and EIA 860/923 data

    If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz and pudl-eia860-eia923.tgz. Then extract them in the same directory where you ran pudl_setup:

    tar -xzf pudl-ferc1.tgz
    tar -xzf pudl-eia860-eia923.tgz
    

    To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite script that comes with PUDL will do that for you:

    datapkg_to_sqlite \
      datapkg/pudl-data-release/pudl-ferc1/datapackage.json \
      datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \
      -o datapkg/pudl-data-release/pudl-merged/
    

    Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite.

    Extract EPA CEMS and convert to Apache Parquet

    If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet() method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more:

    tar -xzf pudl-eia860-eia923-epacems.tgz
    epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json
    

    You should find the Parquet dataset (~5 GB) under parquet/epacems, partitioned by year and state for easier querying.

    Clone the raw FERC Form 1 Databases

    If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite script using the same settings file that was used to generate the data release:

    tar -xzf pudl-input-data.tgz
    ferc1_to_sqlite data-release-settings.yml
    

    You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite.

    Data Quality Control

    We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data

roots-tsne-data

christopher/roots-tsne-data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Christopher Akiki
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

What follows is research code. It is by no means optimized for speed, efficiency, or readability.

  Data loading, tokenizing and sharding

import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.

Search
Clear search
Close search
Google apps
Main menu