1 dataset found

Z
polyOne Data Set - 100 million hypothetical polymers including 29 properties...
data.niaid.nih.gov
zenodo.org
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
Explore at:
Dataset updated
Mar 24, 2023
Dataset provided by
Rampi Ramprasad
Christopher Kuenneth
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

PSMILES strings only generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line. generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187

polyOne Data Set - 100 million hypothetical polymers including 29 properties

Explore at:

Dataset updated

Mar 24, 2023

Dataset provided by

Rampi Ramprasad
Christopher Kuenneth

Description

polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe



PSMILES strings only



  
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
  
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.

Clear search

Close search

Google apps

Main menu

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

polyOne Data Set - 100 million hypothetical polymers including 29 properties