Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Dataset Card for Python-DPO
This dataset is the smaller version of Python-DPO-Large dataset and has been created using Argilla.
Load with datasets
To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset
ds = load_dataset("NextWealth/Python-DPO")
Data Fields
Each data instance contains:
instruction: The problem description/requirements… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning for 3D Topology Optimization
This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.
The following paper provides full documentation and examples:
Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ
file container consists of multiple enumerated pairs of CSV
files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i
-th sample is stored in the files i.csv
and i_info.csv
, where i.csv
contains all voxel-wise information and i_info.csv
contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.
For the i
-th sample, the columns of i_info.csv
correspond to the following scalar information:
E
- Young's modulus [Pa]ν
- Poisson's ratio [-]σ_ys
- a yield stress [Pa]h
- discretization size of the voxel grid [m]The columns of i.csv
correspond to the following voxel-wise information:
x
, y
, z
- the indices that state the location of the voxel within the voxel meshΩ_design
- design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0
and 1
indicate that the density is fixed at 0 or 1, respectively. -1
indicates the absence of constraints, i.e., the density in that voxel can be freely optimizedΩ_dirichlet_x
, Ω_dirichlet_y
, Ω_dirichlet_z
- homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimensionF_x
, F_y
, F_z
- floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]density
- defines the binary voxel-wise density of the ground truth solution to the topology optimization problem
How to Import the Dataset
with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset
object. As shown in the tutorial this can be done via:
from dl4to.datasets import SELTODataset
dataset = SELTODataset(root=root, name=name, train=train)
Here, root
is the path where the dataset should be saved. name
is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train
is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset
class.
without DL4TO: After downloading and unzipping, any of the i.csv
files can be manually imported into Python as a Pandas dataframe object:
import pandas as pd
root = ...
file_path = f'{root}/{i}.csv'
columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
df = pd.read_csv(file_path, names=columns)
Similarly, we can import a i_info.csv
file via:
file_path = f'{root}/{i}_info.csv'
info_column_names = ['E', 'ν', 'σ_ys', 'h']
df_info = pd.read_csv(file_path, names=info_columns)
We can extract PyTorch tensors from the Pandas dataframe df
using the following function:
import torch
def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
voxels = [df['x'].values, df['y'].values, df['z'].values]
Ω_design = torch.zeros(1, *shape, dtype=int)
Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
F = torch.zeros(3, *shape, dtype=dtype)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
density = torch.zeros(1, *shape, dtype=dtype)
density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
return Ω_design, Ω_Dirichlet, F, density
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. FeltonDate: 5/5/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably in this project.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a particular function:
01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source()
function.
02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source()
function in the 01_start.R script.
03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.
04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source()
function.
05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source()
function.
06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Context (https://opencontext.org) publishes free and open access research data for archaeology and related disciplines. An open source (but bespoke) Django (Python) application supports these data publishing services. The software repository is here: https://github.com/ekansa/open-context-py
The Open Context team runs ETL (extract, transform, load) workflows to import data contributed by researchers from various source relational databases and spreadsheets. Open Context uses PostgreSQL (https://www.postgresql.org) relational database to manage these imported data in a graph style schema. The Open Context Python application interacts with the PostgreSQL database via the Django Object-Relational-Model (ORM).
This database dump includes all published structured data organized used by Open Context (table names that start with 'oc_all_'). The binary media files referenced by these structured data records are stored elsewhere. Binary media files for some projects, still in preparation, are not yet archived with long term digital repositories.
These data comprehensively reflect the structured data currently published and publicly available on Open Context. Other data (such as user and group information) used to run the Website are not included.
IMPORTANT
This database dump contains data from roughly 190+ different projects. Each project dataset has its own metadata and citation expectations. If you use these data, you must cite each data contributor appropriately, not just this Zenodo archived database dump.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.
The wheel file for installing datatable v0.11.0
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null
import datatable as dt
data = dt.fread("filename").to_pandas()
https://github.com/h2oai/datatable
https://datatable.readthedocs.io/en/latest/index.html
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('dolma', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
THUDM/webglm-qa in ChatML format. Python code used for conversion: from datasets import load_dataset import pandas import re import random from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" )
dataset = load_dataset("THUDM/webglm-qa", split="train")
def format(columns): references = " ".join( [ f"- {columns['references'][i].strip()}" for i in… See the full description on the dataset page: https://huggingface.co/datasets/Felladrin/ChatML-WebGLM-QA.
Dataset Card for "Magicoder-Evol-Instruct-110K-python"
from datasets import load_dataset
dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split
def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Ballroom Python South Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Dataset Card for Census Income (Adult)
This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.
Real dataset of 14 long horizon manipulation tasks. A mix of human play data and single robot arm data performing the same tasks.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mimic_play', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
The StudyAbroadGPT-Dataset is a collection of conversational data focused on university application requirements for various programs, including MBA, MS in Computer Science, Data Science, and Bachelor of Medicine. The dataset includes interactions between humans asking questions about application processes (e.g., "How do I write a strong SOP for MS in Data Science at MIT?") and an assistant providing detailed responses. Covering prestigious institutions such as MIT, Oxford, Cambridge, and Stanford, this dataset serves as a valuable resource for understanding the informational needs of prospective students applying to study abroad.
Dataset Structure The dataset is organized as a list of JSON objects, where each object represents a single conversation. Each conversation contains an array of turns, structured as follows:
"from": Specifies the speaker, either "human" or "assistant". "value": Contains the text of the query or response.
Example json { "conversations": [ {"from": "human", "value": "What documents do I need for applying to MBA?"}, {"from": "assistant", "value": "## Introduction To embark on your MBA journey, it's crucial to gather the necessary documents..."} ] }
Data Collection This dataset was synthetically generated to simulate realistic conversations about study abroad applications. It is designed to reflect common queries and provide detailed, informative responses related to university application requirements.
Preprocessing The dataset is provided in its raw form, consisting of unprocessed conversational text. Depending on their specific use case, such as natural language processing (NLP) tasks, users may need to perform additional preprocessing steps like tokenization or stopword removal.
Usage You can load the dataset using the Hugging Face Datasets library with the following code:
from datasets import load_dataset
dataset = load_dataset("millat/StudyAbroadGPT-Dataset")
Accessing Conversations To access and iterate through the conversations, use this example:
python for conversation in dataset["train"]: for turn in conversation["conversations"]: print(f"{turn['from']}: {turn['value']}")
Examples Here’s a sample conversation excerpt from the dataset:
json { "conversations": [ {"from": "human", "value": "What are the GMAT requirements for Oxford?"}, {"from": "assistant", "value": "## Introduction Oxford University, one of the world's most prestigious institutions, requires..."} ] }
Use Cases
Training Conversational Agents: Build chatbots to assist with university application queries. Analyzing Trends: Study application requirements across different programs and institutions. NLP Development: Create natural language understanding models tailored to educational domains.
License This dataset is licensed under the MIT License.
Citation If you use this dataset in your research, please cite it as follows:
bibtex @misc{StudyAbroadGPT-Dataset, author = {MD MILLAT HOSEN}, title = {StudyAbroadGPT-Dataset}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/millat/StudyAbroadGPT-Dataset}} }
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('spoc_robot', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.
Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.
Steps:
Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:
bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp
Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)
bash pip install -r requirements.txt
copy fwdrequestingducdata.zip into the OpenAsp repo directory
run the prepare script command:
bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'
load the dataset using huggingface datasets
from glob import glob
import os
import gzip
import shutil
from datasets import load_dataset
openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
data_files = {
os.path.basename(fname).split('.')[0]: fname
for fname in glob(openasp_files)
}
for ftype, fname in data_files.copy().items():
with gzip.open(fname, 'rb') as gz_file:
with open(fname[:-3], 'wb') as output_file:
shutil.copyfileobj(gz_file, output_file)
data_files[ftype] = fname[:-3]
load OpenAsp as huggingface's dataset
openasp = load_dataset('json', data_files=data_files)
print first sample from every split
for split in ['train', 'valid', 'test']:
sample = openasp[split][0]
# print title, aspect_label, summary and documents for the sample
title = sample['title']
aspect_label = sample['aspect_label']
summary = '
'.join(sample['summary_text'])
input_docs_text = ['
'.join(d['text']) for d in sample['documents']]
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
print(f'Sample from {split}
Split title={title}
Aspect label={aspect_label}')
print(f'
aspect-based summary:
{summary}')
print('
input documents:
')
for i, doc_txt in enumerate(input_docs_text):
print(f'---- doc #{i} ----')
print(doc_txt[:256] + '...')
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
')
Troubleshooting
Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.
Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.
License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.
OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.