Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
With system performance on existing reading comprehension benchmarks nearing or surpassing human performance, we need a new, hard dataset that improves systems' capabilities to actually read paragraphs of text. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('drop', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.
Our corresponding paper (published at ITSC 2022) is available here.
Further, we have applied 3DHD CityScenes to map deviation detection here.
Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:
The DevKit is available here:
https://github.com/volkswagen/3DHD_devkit.
The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.
When using our dataset, you are welcome to cite:
@INPROCEEDINGS{9921866,
author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and
Fingscheidt, Tim},
booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)},
title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds},
year={2022},
pages={627-634}}
Acknowledgements
We thank the following interns for their exceptional contributions to our work.
The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.
The Dataset
After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.
1. Dataset
This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.
During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.
To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.
import json
json_path = r"E:\3DHD_CityScenes\Dataset\train.json"
with open(json_path) as jf:
data = json.load(jf)
print(data)
2. HD_Map
Map items are stored as lists of items in JSON format. In particular, we provide:
3. HD_Map_MetaData
Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.
Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.
4. HD_PointCloud_Tiles
The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.
After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.
import numpy as np
import pptk
file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin"
pc_dict = {}
key_list = ['x', 'y', 'z', 'intensity', 'is_ground']
type_list = ['
5. Trajectories
We provide 15 real-world trajectories recorded during a measurement campaign covering the whole HD map. Trajectory samples are provided approx. with 30 Hz and are encoded in JSON.
These trajectories were used to provide the samples in train.json, val.json. and test.json with realistic geolocations and orientations of the ego vehicle.
- OP1 – OP5 cover the majority of the map with 5 trajectories.
- RH1 – RH10 cover the majority of the map with 10 trajectories.
Note that OP5 is split into three separate parts, a-c. RH9 is split into two parts, a-b. Moreover, OP4 mostly equals OP1 (thus, we speak of 14 trajectories in our paper). For completeness, however, we provide all recorded trajectories here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
redspot replay
from redspot import database
from redspot.notebook import Notebook
nbk = Notebook()
for signal in database.get("path-to-db"):
time, panel, kind, args = signal
nbk.apply(kind, args) # apply change
print(nbk) # print notebook
redspot record
docker run --rm -it -p8888:8888
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Project Gutenberg Temporal Corpus
Repository Updates
02.09.2025 Fix the unsafe issue in the retrieved contents files. Add the detailed Generes-Super_Generes Mapping in metadata files.
Usage
To use this dataset, we suggest cloning the repository and accessing the files directly. The dataset is organized into several zip files and CSV files, which can be easily extracted and read using standard data processing libraries in Python or other programming… See the full description on the dataset page: https://huggingface.co/datasets/Texttechnologylab/project-gutenberg-temporal-corpus.
This Python program calculates CFI for each patient from analytic data files containing information on patient identifiers, ICD-9-CM diagnosis codes (version 32), ICD-10-CM Diagnosis Codes (version 2020), CPT codes, and HCPCS codes. NOTE: When downloading, store "CFI_ICD9CM_V32.tab" and "CFI_ICD10CM_V2020.tab" as csv files (these files are originally stored as csv files, but Dataverse automatically converts them to tab files). Please read "Frailty-Index-PYTHON-code-Guide" before proceeding. Interpretation, validation data, and annotated references are provided in "Research Background - Claims-Based Frailty Index".
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.
QA4MRE dataset was created for the CLEF 2011/2012/2013 shared tasks to promote research in question answering and reading comprehension. The dataset contains a supporting passage and a set of questions corresponding to the passage. Multiple options for answers are provided for each question, of which only one is correct. The training and test datasets are available for the main track. Additional gold standard documents are available for two pilot studies: one on alzheimers data, and the other on entrance exams data.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qa4mre', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.
This dataset contains five JSON files and the user agreement. More information about those files can be found here.
in Python, you can read the JSON files like this (using the json and pandas libraries):
import json
import pandas as pd
data_file = open("yelp_academic_dataset_checkin.json")
data = []
for line in data_file:
data.append(json.loads(line))
checkin_df = pd.DataFrame(data)
data_file.close()
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets generated for the Physical Review E article with title: "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" by Paredes, Guerra-Carmenate, Salgueiro, Tommasini and Michinel. In particular, we provide the data needed to generate the figures in the publication, which illustrate the numerical results found during this work.
We also include python code in the file "plot_from_data_for_repository.py" that generates a version of the figures of the paper from .pt data sets. Data can be read and plots can be produced with a simple modification of the python code.
Figure 1: Data are in fig1.csv
The csv file has four columns separated by comas. The four columns correspond to values of r (first column) and the function psi(r) for the three cases depicted in the figure (columns 2-4).
Figures 2 and 4: Data are in data_figs_2_and_4.pt
This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 2 and 4 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 2 is the square of the modulus and figure 4 is the argument, both are obtained from the same data sets.
Figure 3: Data are in fig3.csv
The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), energy E (second column) and velocity U (third column).
Figure 5: Data are in fig5.csv
The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), the minimum value of |psi|^2 (second column) and the value of |psi|^2 at the center (third column).
Figure 6: Data are in data_fig_6.pt
This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 6 ("psia", "psib", "psic", "psid").
Figure 7: Data are in data_fig_7.pt
This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 7 ("psia", "psib", "psic", "psid").
Figures 8 and 10: Data are in data_figs_8_and_10.pt
This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 8 and 10 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 8 is the square of the modulus and figure 10 is the argument, both are obtained from the same data sets.
Figure 9: Data are in fig9.csv
The csv file has two columns separated by comas. The two columns correspond to values of momentum p (first column) and energy (second column).
Figure 11: Data are in data_fig_11.pt
This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the two cases, four instants of time for each case, depicted in figure 11 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").
Figure 12: Data are in data_fig_12.pt
This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six instants of time depicted in figure 12 ("psia", "psib", "psic", "psid", "psie", "psif").
Figure 13: Data are in data_fig_13.pt
This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the eight instants of time depicted in figure 13 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
2,121,458 records
I used Google Colab to check out this dataset and pull the column names using Pandas.
Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe
Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']
I did not modify the dataset.
Use it to practice with dataframes - Pandas or PySpark on Google Colab:
!unzip complaints.csv.zip
import pandas as pd df = pd.read_csv('complaints.csv') df.columns
df.head() etc.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data and code in Python for reproducing the results in "A Computational Approach to Urban Space in Science Fiction". (For more information see: https://github.com/federicabologna/thesis_space_scifi)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599
Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
Geostationary Operational Environmental Satellite (GOES) Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is a machine learning model that produces composite radar reflectivity using data from the Advanced Baseline Imager (ABI) and Geostationary Lightning Mapper (GLM). GREMLIN is useful for observing severe weather and providing information during convective initialization especially over regions without ground-based radars. Previous research found good skill compared to ground-based radar products, however, the analysis was done over a dataset with similar climatic and precipitation characteristics as the training dataset: warm season Eastern CONUS in 2019. This study expands the analysis to the entire contiguous United States, during all seasons, and covering the period 2020-2022. Several validation metrics including root-mean-square difference (RMSD), probability of detection (POD), and false alarm ratio (FAR) are plotted over CONUS by season, day-of-year, and time-of-da..., The methodology is described in detail by Hilburn et al. (2021). The ABI, GLM, and MRMS data sets were resampled to a common 3 km grid. A cloud height of 10 km was used for removing parallax displacements. Satellite and radar samples were matched in time with a maximum time difference of 2.5 minutes. GLM lightning groups were accumulated over 15-minute time periods., Code for reading the data: The data can be read using the provided Python code ("read_conus3_file.py" and "test_read.py") or the provided Fortran code ("gzmodule.f90", "test_read.f90", "compile.sh"). Note that this code reads the data without having to unzip the files. The “test_read.py†and “test_read.f90†use one sample file (ABI_C13_202001010000.bin.gz) to verify the code is reading correctly by checking against seven points within the image and against the minimum and maximum over the full image. This file has been included in the software.tar package. To use the Python code, at the command line simply invoke:  python test_read.py The function that actually reads a data file, in “read_conus3_file.py†, makes use of the gzip module, which is part of the Python Standard Library. If your code is reading correctly, you should see this output: testfile= ABI_C13_202001010000.bin.gzmin data, expected, isclose= -1e+30 -1e+30 Truemax data, expected, isclose= 296.77463 296.77463 Truedata[  0,...
Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('squad_question_generation', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz
. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())
data
is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}
where mp_id
is the Materials Project materials ID for the material, and icsd_id
is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']
. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen
REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API
key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)
A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id
and structure
from this file for all materials. The structure
in this case is a cif
file. Users can use again pymatgen
to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')
For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip
contains two datasets in json
format. 2.1 G4MP2.json
contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}
2.2 qm7b.json
contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}
Geostationary Operational Environmental Satellite (GOES) Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is a machine learning model that produces composite radar reflectivity using data from the Advanced Baseline Imager (ABI) and Geostationary Lightning Mapper (GLM). GREMLIN is useful for observing severe weather and providing information during convective initialization especially over regions without ground-based radars. Previous research found good skill compared to ground-based radar products, however the analysis was done over a dataset with similar climatic and precipitation characteristics as the training dataset: warm season Eastern CONUS in 2019. This study expands the analysis to the entire contiguous United States, during all seasons, and covering the period 2020-2022. Several validation metrics including root-mean-square difference (RMSD), probability of detection (POD), and false alarm ratio (FAR) are plotted over CONUS by season, day-of-year, and time-of-day..., The methodology is described in detail by Hilburn et al. (2021). The ABI, GLM, and MRMS data sets were resampled to a common 3 km grid. A cloud height of 10 km was used for removing parallax displacements. Satellite and radar samples were matched in time with a maximum time difference of 2.5 minutes. GLM lightning groups were accumulated over 15-minute time periods., Code for reading the data: The data can be read using the provided Python code ("read_conus3_file.py" and "test_read.py") or the provided Fortran code ("gzmodule.f90", "test_read.f90", "compile.sh"). Note that this code reads the data without having to unzip the files.The “test_read.py†and “test_read.f90†use one sample file (ABI_C13_202001010000.bin.gz) to verify the code is reading correctly by checking against seven points within the image and against the minimum and maximum over the full image. This file has been included in the software.tar package.To use the Python code, at the command line simply invoke: python test_read.py The function that actually reads a data file, in “read_conus3_file.py†, makes use of the gzip module, which is part of the Python Standard Library. If your code is reading correctly, you should see this output: testfile= ABI_C13_202001010000.bin.gzmin data, expected, isclose= -1e+30 -1e+30 Truemax data, expected, isclose= 296.77463 296.77463 Truedata[  0,  ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)