100+ datasets found

Python Import Data India – Buyers & Importers List
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
d
Randomized Hourly Load Data for use with Taxonomy Distribution Feeders.
datadiscoverystudio.org
data.wu.ac.at
Updated Aug 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/bc873dbf6a1f44c190153d3345fbbafd/html
Explore at:
Dataset updated
Aug 29, 2017
Description
description: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.; abstract: This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1]. The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language. This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000. For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2. Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed. For questions about this dataset, contact andy.hoke@nrel.gov. If you find this dataset useful, please mention NREL and cite [1] in your work. References: [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders, IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 . [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, Modern Grid Initiative Distribution Taxonomy Final Report, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, Distribution power flow for smart grid technologies, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
MNIST - HDF5
kaggle.com
Updated Feb 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benedict Wilkins (2020). MNIST - HDF5 [Dataset]. https://www.kaggle.com/benedictwilkinsai/mnist-hd5f/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Benedict Wilkins
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The MNIST dataset in HDF5 format.

Data can be loaded with the h5py package: pip install h5py, see demo
h
Python-DPO
huggingface.co
Updated Jul 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NextWealth Entrepreneurs Private Limited (2024). Python-DPO [Dataset]. https://huggingface.co/datasets/NextWealth/Python-DPO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2024
Dataset authored and provided by
NextWealth Entrepreneurs Private Limited
Description
Dataset Card for Python-DPO

This dataset is the smaller version of Python-DPO-Large dataset and has been created using Argilla.

Load with datasets

To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset

ds = load_dataset("NextWealth/Python-DPO")

Data Fields

Each data instance contains:

instruction: The problem description/requirements… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO.
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009758
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
CIFAR-10 Python in CSV
kaggle.com
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fedesoriano (2021). CIFAR-10 Python in CSV [Dataset]. https://www.kaggle.com/fedesoriano/cifar10-python-in-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
fedesoriano
Description
Context

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.

The batches.meta file contains the label names of each class.

The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.

Content

Here is the list of the 10 classes in the CIFAR-10:

Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck

Acknowledgements

Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. Link

How to load the batches.meta file (Python)

The function used to open the file: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

Example of how to read the file: metadata_path = './cifar-10-python/batches.meta' # change this path metadata = unpickle(metadata_path)
T
cifar10
tensorflow.org
opendatalab.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
Explore at:
Dataset updated
Jun 1, 2024
Description
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cifar10', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
h
Magicoder-Evol-Instruct-110K-python
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2024
Authors
pxy
Description
Dataset Card for "Magicoder-Evol-Instruct-110K-python"

from datasets import load_dataset

Load your dataset

dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

Define a filter function

def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.
P
StudyAbroadGPT Dataset Dataset
paperswithcode.com
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Millat; Md Motiur (2025). StudyAbroadGPT Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/studyabroadgpt-dataset
Explore at:
Dataset updated
Apr 21, 2025
Authors
Md Millat; Md Motiur
Description
The StudyAbroadGPT-Dataset is a collection of conversational data focused on university application requirements for various programs, including MBA, MS in Computer Science, Data Science, and Bachelor of Medicine. The dataset includes interactions between humans asking questions about application processes (e.g., "How do I write a strong SOP for MS in Data Science at MIT?") and an assistant providing detailed responses. Covering prestigious institutions such as MIT, Oxford, Cambridge, and Stanford, this dataset serves as a valuable resource for understanding the informational needs of prospective students applying to study abroad.

Dataset Structure The dataset is organized as a list of JSON objects, where each object represents a single conversation. Each conversation contains an array of turns, structured as follows:

"from": Specifies the speaker, either "human" or "assistant". "value": Contains the text of the query or response.

Example json { "conversations": [ {"from": "human", "value": "What documents do I need for applying to MBA?"}, {"from": "assistant", "value": "## Introduction To embark on your MBA journey, it's crucial to gather the necessary documents..."} ] }

Data Collection This dataset was synthetically generated to simulate realistic conversations about study abroad applications. It is designed to reflect common queries and provide detailed, informative responses related to university application requirements.

Preprocessing The dataset is provided in its raw form, consisting of unprocessed conversational text. Depending on their specific use case, such as natural language processing (NLP) tasks, users may need to perform additional preprocessing steps like tokenization or stopword removal.

Usage You can load the dataset using the Hugging Face Datasets library with the following code:

from datasets import load_dataset dataset = load_dataset("millat/StudyAbroadGPT-Dataset")

Accessing Conversations To access and iterate through the conversations, use this example:

python for conversation in dataset["train"]: for turn in conversation["conversations"]: print(f"{turn['from']}: {turn['value']}")

Examples Here’s a sample conversation excerpt from the dataset:

json { "conversations": [ {"from": "human", "value": "What are the GMAT requirements for Oxford?"}, {"from": "assistant", "value": "## Introduction Oxford University, one of the world's most prestigious institutions, requires..."} ] }

Use Cases

Training Conversational Agents: Build chatbots to assist with university application queries. Analyzing Trends: Study application requirements across different programs and institutions. NLP Development: Create natural language understanding models tailored to educational domains.

License This dataset is licensed under the MIT License.

Citation If you use this dataset in your research, please cite it as follows:

bibtex @misc{StudyAbroadGPT-Dataset, author = {MD MILLAT HOSEN}, title = {StudyAbroadGPT-Dataset}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/millat/StudyAbroadGPT-Dataset}} }
H
open-pii-masking-500k-ai4privacy
dataverse.harvard.edu
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Anthony (2025). open-pii-masking-500k-ai4privacy [Dataset]. http://doi.org/10.7910/DVN/4H11OA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/4H11OA
Dataset updated
Mar 17, 2025
Dataset provided by
Harvard Dataverse
Authors
Michael Anthony
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. # Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy ## p5y Data Analytics - Total Entries: 580,227 - Total Tokens: 19,199,982 - Average Source Text Length: 17.37 words - Total PII Labels: 5,705,973 - Number of Unique PII Classes: 20 (Open PII Labelset) - Unique Identity Values: 704,215 --- ## Language Distribution Analytics Number of Unique Languages: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages --- ## Region Distribution Analytics Number of Unique Regions: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% | --- ## Machine Learning Task Analytics | Split | Count | Percentage | |-------------|----------|------------| | Train | 464,150 | 79.99% | | Validate| 116,077 | 20.01% | --- # Usage Option 1: Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy") # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's guide on token classification. - ALBERT, BERT, BigBird, BioGpt, BLOOM, BROS, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LiLT, Longformer, LUKE, MarkupLM, MEGA, Megatron-BERT, MobileBERT,...
Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms
zenodo.org
data.niaid.nih.gov
bin, png
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Garske; Samuel Garske; Yiwei Mao; Yiwei Mao (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. http://doi.org/10.5281/zenodo.13370800
Explore at:
bin, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13370800
Dataset updated
Aug 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Samuel Garske; Samuel Garske; Yiwei Mao; Yiwei Mao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

How to Get Started

All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

import numpy as np # Load image file hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.") # Load image mask mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

Citing the Datasets

If you use any of these datasets, please cite the following paper:

@article{garske2024erx,
title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning},
author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC},
journal={arXiv preprint arXiv:2408.14947},
year={2024},
}

If you use the beach dataset please cite the following paper as well (original source):

@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }
Zero Modes and Classification of Combinatorial Metamaterials
zenodo.org
zip
Updated Nov 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan van Mastrigt; Ryan van Mastrigt; Marjolein Dijkstra; Marjolein Dijkstra; Martin van Hecke; Martin van Hecke; Corentin Coulais; Corentin Coulais (2022). Zero Modes and Classification of Combinatorial Metamaterials [Dataset]. http://doi.org/10.5281/zenodo.7070963
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7070963
Dataset updated
Nov 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan van Mastrigt; Ryan van Mastrigt; Marjolein Dijkstra; Marjolein Dijkstra; Martin van Hecke; Martin van Hecke; Corentin Coulais; Corentin Coulais
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the simulation data of the combinatorial metamaterial as used for the paper 'Machine Learning of Implicit Combinatorial Rules in Mechanical Metamaterials', as published in Physical Review Letters.

In this paper, the data is used to classify each \(k \times k\) unit cell design into one of two classes (C or I) based on the scaling (linear or constant) of the number of zero modes \(M_k(n)\) for metamaterials consisting of an \(n\times n\) tiling of the corresponding unit cell. Additionally, a random walk through the design space starting from class C unit cells was performed to characterize the boundary between class C and I in design space. A more detailed description of the contents of the dataset follows below.

Modescaling_raw_data.zip

This file contains uniformly sampled unit cell designs for metamaterial M2 and \(M_k(n)\) for \(1\leq n\leq 4\), which was used to classify the unit cell designs for the data set. There is a small subset of designs for \(k=\{3, 4, 5\}\) that do not neatly fall into the class C and I classification, and instead require additional simulation for \(4 \leq n \leq 6\) before either saturating to a constant number of zero modes (class I) or linearly increasing (class C). This file contains the simulation data of size \(3 \leq k \leq 8\) unit cells. The data is organized as follows.

Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4.npy", and contain a [Nsim, 1+k*k+4] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:

col 0: label number to keep track

col 1 - k*k+1: flattened unit cell design, numpy.reshape should bring it back to its original \(k \times k\) form.

col k*k+1 - k*k+5: number of zero modes \(M_k(n)\) in ascending order of \(n\), so: \(\{M_k(1), M_k(2), M_k(3), M_k(4)\}\).

Note: the unit cell design uses the numbers \(\{0, 1, 2, 3\}\) to refer to each building block orientation. The building block orientations can be characterized through the orientation of the missing diagonal bar (see Fig. 2 in the paper), which can be Left Up (LU), Left Down (LD), Right Up (RU), or Right Down (RD). The numbers correspond to the building block orientation \(\{0, 1, 2, 3\} = \{\mathrm{LU, RU, RD, LD}\}\).

Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 6\) for unit cells that cannot be classified as class C or I for \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4_classX_extend.npy", and contain a [Nsim, 1+k*k+6] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:

col 0: label number to keep track

col 1 - k*k+1: flattened unit cell design, numpy.reshape should bring it back to its original \(k \times k\) form.

col k*k+1 - k*k+5: number of zero modes \(M_k(n)\) in ascending order of \(n\), so: \(\{M_k(1), M_k(2), M_k(3), M_k(4), M_k(5), M_k(6)\}\).

Simulation data for \(6 \leq k \leq 8\) unit cells are stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. Note that the number of modes is now calculated for \(n_x \times n_y\) metamaterials, where we calculate \((n_x, n_y) = \{(1,1), (2, 2), (3, 2), (4,2), (2, 3), (2, 4)\}\) rather than \(n_x=n_y=n\) to save computation time. These files are named "data_new_rrQR_i_n_Mx_My_n4_kxk(_extended).npy", and contain a [Nsim, 1+k*k+8] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:

col 0: label number to keep track

col 1 - k*k+1: flattened unit cell design, numpy.reshape should bring it back to its original \(k \times k\) form.

col k*k+1 - k*k+9: number of zero modes \(M_k(n_x, n_y)\) in order: \(\{M_k(1, 1), M_k(2, 2), M_k(3, 2), M_k(4, 2), M_k(1, 1), M_k(2, 2), M_k(2, 3), M_k(2, 4)\}\).

Simulation data of metamaterial M1 for \(k_x \times k_y\) metamaterials are stored in compressed numpy array format (.npz) and can be loaded in Python with the Numpy package using the numpy.load command. These files are named "smiley_cube_x_y_\(k_x\)x\(k_y\).npz", which contain all possible metamaterial designs, and "smiley_cube_uniform_sample_x_y_\(k_x\)x\(k_y\).npz", which contain uniformly sampled metamaterial designs. The configurations are accessed with the keyword argument 'configs'. The classification is accessed with the keyword argument 'compatible'. The configurations array is of shape [Nsim, \(k_x\), \(k_y\)], the classification array is of shape [Nsim]. The building blocks in the configuration are denoted by 0 or 1, which correspond to the red/green and white/dashed building blocks respectively. Classification is 0 or 1, which corresponds to I and C respectively.

Modescaling_classification_results.zip

This file contains the classification, slope, and offset of the scaling of the number of zero modes \(M_k(n)\) for the unit cells of metamaterial M2 in Modescaling_raw_data.zip. The data is organized as follows.

The results for \(3 \leq k \leq 5\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:

col 0: label number to keep track

col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 4\))

col 2: slope from \(n \geq 2\) onward (undefined for class X)

col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)

col 4: \(M_k(1)\)

The results for \(3 \leq k \leq 5\) based on the extended \(1 \leq n \leq 6\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4_classC_extend.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:

col 0: label number to keep track

col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 6\))

col 2: slope from \(n \geq 2\) onward (undefined for class X)

col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)

col 4: \(M_k(1)\)

The results for \(6 \leq k \leq 8\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scenx_Sceny_slopex_slopey_offsetx_offsety_M1k_kxk(_extended).txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:

col 0: label number to keep track

col 1: the class_x based on \(M_k(n_x, 2)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_x \leq 4\))

col 2: the class_y based on \(M_k(2, n_y)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_y \leq 4\))

col 3: slope_x from \(n_x \geq 2\) onward (undefined for class X)

col 4: slope_y from \(n_y \geq 2\) onward (undefined for class X)

col 5: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_x}\)

col 6: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_y}\)

col 7: (M_k(1,
P
OpenAsp Dataset
paperswithcode.com
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan (2023). OpenAsp Dataset [Dataset]. https://paperswithcode.com/dataset/openasp
Explore at:
Dataset updated
Dec 6, 2023
Authors
Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan
Description
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.

Steps:

Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:

bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp

Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)

bash pip install -r requirements.txt

copy fwdrequestingducdata.zip into the OpenAsp repo directory

run the prepare script command:

bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'

load the dataset using huggingface datasets

from glob import glob import os import gzip import shutil from datasets import load_dataset openasp_files = os.path.join('openasp-v1', '*.jsonl.gz') data_files = { os.path.basename(fname).split('.')[0]: fname for fname in glob(openasp_files) } for ftype, fname in data_files.copy().items(): with gzip.open(fname, 'rb') as gz_file: with open(fname[:-3], 'wb') as output_file: shutil.copyfileobj(gz_file, output_file) data_files[ftype] = fname[:-3] load OpenAsp as huggingface's dataset openasp = load_dataset('json', data_files=data_files) print first sample from every split for split in ['train', 'valid', 'test']: sample = openasp[split][0] # print title, aspect_label, summary and documents for the sample title = sample['title'] aspect_label = sample['aspect_label'] summary = ' '.join(sample['summary_text']) input_docs_text = [' '.join(d['text']) for d in sample['documents']] print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *') print(f'Sample from {split} Split title={title} Aspect label={aspect_label}') print(f' aspect-based summary: {summary}') print(' input documents: ') for i, doc_txt in enumerate(input_docs_text): print(f'---- doc #{i} ----') print(doc_txt[:256] + '...') print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ')

Troubleshooting

Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.

Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.

License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.

OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.
T
food101
tensorflow.org
paperswithcode.com
+3more
Updated Nov 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). food101 [Dataset]. https://www.tensorflow.org/datasets/catalog/food101
Explore at:
Dataset updated
Nov 23, 2022
Description
This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('food101', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">
Data from: Multidimensional Data Exploration with Glue
figshare.com
pdf
Updated Jan 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Openproceedings Bot (2016). Multidimensional Data Exploration with Glue [Dataset]. http://doi.org/10.6084/m9.figshare.935503.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.935503.v1
Dataset updated
Jan 18, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Openproceedings Bot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue is an interactive environment built on top of the standard Python science stack to visualize relationships within and between datasets. With Glue, users can load and visualize multiple related datasets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images. The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accommodate this and simplify the "data munging" process, so that researchers can more naturally explore what their data have to say. The result is a cleaner scientific workflow, faster interaction with data, and an easier avenue to insight.
PUDL Data Release v1.0.0
zenodo.org
explore.openaire.eu
application/gzip, bin +1
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell (2023). PUDL Data Release v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.3653159
Explore at:
application/gzip, bin, shAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3653159
Dataset updated
Aug 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159

For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl python package.

Included Data Packages

This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database.

pudl-eia860-eia923:
Data originally collected and published by the US Energy Information Administration (US EIA). The data from EIA Form 860 covers the years 2011-2018. The Form 923 data covers 2009-2018. A large majority of the data published in the original data sources has been included, but some parts, like fuel stocks on hand, and EIA 923 schedules 6, 7, & 8 have not yet been integrated.

pudl-eia860-eia923-epacems:
This data package contains all of the same data as the pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records.

pudl-ferc1:
Seven data tables from FERC Form 1 are included, primarily relating to individual power plants, and covering the years 1994-2018 (the entire span of time for which FERC provides this data). These tables are the only ones which have been subjected to any cleaning or organization for programmatic use within PUDL. The complete, raw FERC Form 1 database contains 116 different tables with many thousands of columns of mostly financial data. We will archive a complete copy of the multi-year FERC Form 1 Database as a file-based SQLite database at Zenodo, independent of this data release. It can also be re-generated using the catalystcoop.pudl Python package and the original source data files archived as part of this data release.

Contact Us

If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:

Subscribe to our announcements list for email updates.

Use the Github issue tracker to file bugs, suggest improvements, or ask for help.

Email the project team at pudl@catalyst.coop for private communications.

Follow @CatalystCoop on Twitter.

Using the Data

The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like.

Make sure conda is installed

None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda:

Install Anaconda

Install miniconda

Download the data

First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them.

If you don't want to recreate the data release from scratch by re-running the entire ETL process yourself, and you don't want to create a full clone of the original FERC Form 1 database, including all of the data that has not yet been integrated into PUDL, then you don't need to download pudl-input-data.tgz.

If you don't need the EPA CEMS Hourly Emissions data, you do not need to download pudl-eia860-eia923-epacems.tgz.

Load All of PUDL in a Single Line

Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows.

If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh:

bash pudl-load.sh

This will do the following:

Load the FERC Form 1, EIA Form 860, and EIA Form 923 data packages into an SQLite database which can be found at sqlite/pudl.sqlite.

Convert the EPA CEMS data package into an Apache Parquet dataset which can be found at parquet/epacems.

Clone all of the FERC Form 1 annual databases into a single SQLite database which can be found at sqlite/ferc1.sqlite.

Selectively Load PUDL Data

If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately.

Create the PUDL conda Environment

This installs the PUDL software locally, and a couple of other useful packages:

conda create --yes --name pudl --channel conda-forge \ --strict-channel-priority \ python=3.7 catalystcoop.pudl=0.3.1 dask jupyter jupyterlab seaborn pip conda activate pudl

Create a PUDL data management workspace

Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet, sqlite, data etc.

pudl_setup ./

Extract and load the FERC Form 1 and EIA 860/923 data

If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz and pudl-eia860-eia923.tgz. Then extract them in the same directory where you ran pudl_setup:

tar -xzf pudl-ferc1.tgz tar -xzf pudl-eia860-eia923.tgz

To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite script that comes with PUDL will do that for you:

datapkg_to_sqlite \ datapkg/pudl-data-release/pudl-ferc1/datapackage.json \ datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \ -o datapkg/pudl-data-release/pudl-merged/

Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite.

Extract EPA CEMS and convert to Apache Parquet

If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet() method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more:

tar -xzf pudl-eia860-eia923-epacems.tgz epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json

You should find the Parquet dataset (~5 GB) under parquet/epacems, partitioned by year and state for easier querying.

Clone the raw FERC Form 1 Databases

If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite script using the same settings file that was used to generate the data release:

tar -xzf pudl-input-data.tgz ferc1_to_sqlite data-release-settings.yml

You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite.

Data Quality Control

We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
T
mnist
tensorflow.org
universe.roboflow.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
The MNIST database of handwritten digits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
T
Data from: dices
tensorflow.org
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). dices [Dataset]. https://www.tensorflow.org/datasets/catalog/dices
Explore at:
Dataset updated
Sep 3, 2024
Description
The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

Machine learning approaches are often trained and evaluated with datasets that require a clear separation between positive and negative examples. This approach overly simplifies the natural subjectivity present in many tasks and content items. It also obscures the inherent diversity in human perceptions and opinions. Often tasks that attempt to preserve the variance in content and diversity in humans are quite expensive and laborious. To fill in this gap and facilitate more in-depth model performance analyses we propose the DICES dataset - a unique dataset with diverse perspectives on safety of AI generated conversations. We focus on the task of safety evaluation of conversational AI systems. The DICES dataset contains detailed demographics information about each rater, extremely high replication of unique ratings per conversation to ensure statistical significance of further analyses and encodes rater votes as distributions across different demographics to allow for in-depth explorations of different rating aggregation strategies.

This dataset is well suited to observe and measure variance, ambiguity and diversity in the context of safety of conversational AI. The dataset is accompanied by a paper describing a set of metrics that show how rater diversity influences the safety perception of raters from different geographic regions, ethnicity groups, age groups and genders. The goal of the DICES dataset is to be used as a shared benchmark for safety evaluation of conversational AI systems.

CONTENT WARNING: This dataset contains adversarial examples of conversations that may be offensive.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dices', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD

Explore at:

21 scholarly articles cite this dataset (View in Google Scholar)

.bin, .xml, .csv, .xlsAvailable download formats

Dataset provided by

Seair Exim Solutions

Authors

Seair Exim

Area covered

India

Description

Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Clear search

Close search

Google apps

Main menu

Python Import Data India – Buyers & Importers List

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders.

MNIST - HDF5

Python-DPO

Storage and Transit Time Data and Code

CIFAR-10 Python in CSV

Context

Content

Acknowledgements

How to load the batches.meta file (Python)

cifar10

Magicoder-Evol-Instruct-110K-python

Load your dataset

Define a filter function

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

StudyAbroadGPT Dataset Dataset

open-pii-masking-500k-ai4privacy

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Summary

How to Get Started

Citing the Datasets

Zero Modes and Classification of Combinatorial Metamaterials

OpenAsp Dataset

food101

Data from: Multidimensional Data Exploration with Glue

PUDL Data Release v1.0.0

openai_humaneval

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

mnist

Data from: dices

The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD