This dataset was created by Shail_2604
Released under Other (specified in description)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
polyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet
.
I recommend using dask (pip install dask
) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a combined math dataset with only int as solution. Code for the first dataset:
Required libraries
from datasets import load_dataset import pandas as pd import numpy as np
Load the dataset from Hugging Face
dataset = load_dataset("AI-MO/NuminaMath-1.5")
Convert to pandas DataFrame
df = pd.DataFrame(dataset['train']) def is_valid_integer(x): try: # Convert to string and strip whitespace val = str(x).strip() # Check if it's a… See the full description on the dataset page: https://huggingface.co/datasets/purefalcon/aminox_gpro.
Wolong National Nature Reserve (hereafter Wolong) is an internationally renowned giant panda (Ailuropoda melanoleuca) reserve. Meanwhile, the reserve is also a popular tourist destination in Giant Panda National Park. It encompasses two major towns, Wolong and Gengda, home to approximately 5,000 residents. Agriculture, tourist economy, and livestock grazing remain important income sources for the residents. Here, we combine ongoing survey data on human activity areas in Wolong and make two layers of major human activities in Wolong. The two datasets are human pressure and livestock grazing. Among them, the human pressure layer combines roads and settlement data. We use this dataset to reflect the extent of human activities in Wolong and as an important indicator to assess its interference with wildlife., , , # Human Activities Data in Wolong National Nature Reserve
https://doi.org/10.5061/dryad.kh18932fm
This dataset is human activities in Wolong National Nature Reserve (hereafter Wolong), including a human pressure layer and a livestock grazing layer. We use these two datasets to reflect the human activity areas in Wolong.
The human pressure factor is formed by integrating road points and residential points. Road points are derived from placing points every 1,000 m on major roads. The residential points are derived from the GPS data we collected from Wolong in 2016, and also through data we obtained during surveys in Wolong. In addition, we used Google Earth to align the residential points.
The livestock grazing points were extracted from the 4th National Giant Panda Survey. Furthermore, we also used data obtained during surveys in Wolong to compare the livestock grazing distribution points to ens...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.
One can find a description of the provided dataset partitions in Section 3 of Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
Every dataset container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and a corresponding binarized SIMP solution. Every file of the form {i}.csv contains all voxel-wise information about the sample i. Every file of the form {i}_info.csv file contains scalar parameters of the topology optimization problem, such as material parameters.
This dataset represents topology optimization problems and solutions on the bases of voxels. We define all spatially varying quantities via the voxels' centers -- rather than via the vertices or surfaces of the voxels.
In {i}.csv files, each row corresponds to one voxel in the design space. The columns correspond to ['x', 'y', 'z', 'design_space', 'dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'].
Any of these files with the index i can be imported using pandas by executing:
import pandas as pd
directory = ...
file_path = f'{directory}/{i}.csv'
column_names = ['x', 'y', 'z', 'design_space','dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density']
data = pd.read_csv(file_path, names=column_names)
From this pandas dataframe one can extract the torch tensors of forces F, Dirichlet conditions ωDirichlet, and design space information ωdesign using the following functions:
import torch
def get_shape_and_voxels(data):
shape = data[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
vox_x = data['x'].values
vox_y = data['y'].values
vox_z = data['z'].values
voxels = [vox_x, vox_y, vox_z]
return shape, voxels
def get_forces_boundary_conditions_and_design_space(data, shape, voxels):
F = torch.zeros(3, *shape, dtype=torch.float32)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_x'].values, dtype=torch.float32)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_y'].values, dtype=torch.float32)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_z'].values, dtype=torch.float32)
ω_Dirichlet = torch.zeros(3, *shape, dtype=torch.float32)
ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_x'].values, dtype=torch.float32)
ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_y'].values, dtype=torch.float32)
ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_z'].values, dtype=torch.float32)
ω_design = torch.zeros(1, *shape, dtype=int)
ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['design_space'].values.astype(int))
return F, ω_Dirichlet, ω_design
The corresponding {i}_info.csv files only have one row with column labels ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'].
Analogously to above, one can import any {i}_info.csv file by executing:
file_path = f'{directory}/{i}_info.csv'
data_info_column_names = ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z']
data_info = pd.read_csv(file_path, names=data_info_column_names)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The relationship between the parameter of interest and the covariate is assumed to be linear (on the logit scale) unless specified otherwise.
This dataset was created by ChienYiChi
Released under Other (specified in description)
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Overview
Original dataset page here and dataset available here.
Dataset curation
Added new column label with encoded labels with the following mapping {"entailment": 0, "neutral": 1, "contradiction": 2}
and the columns with parse information are dropped as they are not well formatted. Also, the name of the file from which each instance comes is added in the column dtype.
Code to create the dataset
import pandas as pd from datasets import Dataset… See the full description on the dataset page: https://huggingface.co/datasets/pietrolesci/stress_tests_nli.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource serves as a template for creating a curve number grid raster file which could be used to create corresponding maps or for further utilization, soil data and reclassified land-use raster files are created along the process, user has to provided or connect to a set of shape-files including boundary of watershed, soil data and land-use containing this watershed, land-use reclassification and curve number look up table. Script contained in this resource mainly uses PyQGIS through Jupyter Notebook for majority of the processing with a touch of Pandas for data manipulation. Detailed description of procedure are commented in the script.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the relevant data for the algorithms described in the paper "Irradiance and cloud optical properties from solar photovoltaic systems", which were developed within the framework of the MetPVNet project.
Input data:
COSMO weather model data (DWD) as NetCDF files (cosmo_d2_2018(9).tar.gz)
COSMO atmospheres for libRadtran (cosmo_atmosphere_libradtran_input.tar.gz)
COSMO surface data for calibration (cosmo_pvcal_output.tar.gz)
Aeronet data as text files (MetPVNet_Aeronet_Input_Data.zip)
Measured data from the MetPVNet measurement campaigns as text files (MetPVNet_Messkampagne_2018(9).tar.gz)
PV power data
Horizontal and tilted irradiance from pyranometers
Longwave irradiance from pyrgeometer
MYSTIC-based lookup table for translated tilted to horizontal irradiance (gti2ghi_lut_v1.nc)
Output data:
Global tilted irradiance (GTI) inferred from PV power plants (with calibration parameters in comments)
Linear temperature model: MetPVNet_gti_cf_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_gti_cf_inversion_results_faiman.tar.gz
Global horizontal irradiance (GHI) inferred from PV power plants
Linear temperature model: MetPVNet_ghi_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_ghi_inversion_results_faiman.tar.gz
Combined GHI averaged to 60 minutes and compared with COSMO data
Linear temperature model: MetPVNet_ghi_inversion_combo_60min_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_ghi_inversion_combo_60min_results_faiman.tar.gz
Cloud optical depth inferred from PV power plants
Linear temperature model: MetPVNet_cod_cf_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_cod_cf_inversion_results_faiman.tar.gz
Combined COD averaged to 60 minutes and compared with COSMO and APOLLO_NG data
Linear temperature model: MetPVNet_cod_inversion_combo_60min_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_cod_inversion_combo_60min_results_faiman.tar.gz
Validation data:
COSMO cloud optical depth (cosmo_cod_output.tar.gz)
APOLLO_NG cloud optical depth (MetPVNet_apng_extract_all_stations_2018(9).tar.gz)
COSMO irradiance data for validation (cosmo_irradiance_output.tar.gz)
CAMS irradiance data for validation (CAMS_irradiation_detailed_MetPVNet_MK_2018(9).zip)
How to import results:
The results files are stored as text files ".dat", using Python multi-index columns. In order to import the data into a Pandas dataframe, use the following lines of code (replace [filename] with the relevant file name):
import pandas as pd data = pd.read_csv("[filename].dat",comment='#',header=[0,1],delimiter=';',index_col=0,parse_dates=True)
This gives a multi-index Dataframe with the index column the timestamp, the first column label corresponds to the measured variable and the second column to the relevant sensor
Note:
The output data has been updated to match the latest version of the paper, whereas the input and validation data remains the same as in Version 1.0.0
Citation
@article{DBLP:journals/corr/abs-2101-04775, author = {Bingchen Liu and Yizhe Zhu and Kunpeng Song and Ahmed Elgammal}, title = {Towards Faster and Stabilized {GAN} Training for High-fidelity Few-shot Image Synthesis}, journal = {CoRR}, volume = {abs/2101.04775}, year = {2021}, url = {https://arxiv.org/abs/2101.04775}, eprinttype = {arXiv}, eprint = {2101.04775}… See the full description on the dataset page: https://huggingface.co/datasets/huggan/few-shot-panda.
This dataset was created by Shail_2604
Released under Other (specified in description)