100+ datasets found

All_files_dataset
figshare.com
bin
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quang Dien Duong (2020). All_files_dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12164295.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12164295.v1
Dataset updated
Apr 21, 2020
Dataset provided by
figshare
Authors
Quang Dien Duong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)

Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset [Dataset]. http://doi.org/10.5281/zenodo.1297925

Explore at:

text/x-python, zip, bin, application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1297925

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

T
drop
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). drop [Dataset]. https://www.tensorflow.org/datasets/catalog/drop
Explore at:
Dataset updated
Dec 6, 2022
Description
With system performance on existing reading comprehension benchmarks nearing or surpassing human performance, we need a new, hard dataset that improves systems' capabilities to actually read paragraphs of text. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('drop', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
d
Multi-task Deep Learning for Water Temperature and Streamflow Prediction...
catalog.data.gov
Updated Sep 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Multi-task Deep Learning for Water Temperature and Streamflow Prediction (ver. 1.1, June 2022) [Dataset]. https://catalog.data.gov/dataset/multi-task-deep-learning-for-water-temperature-and-streamflow-prediction-ver-1-1-june-2022
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:

input_data_processing.zip: A zip file containing the scripts used to collate the observations, input weather drivers, and catchment attributes for the multi-task modeling experiments

flow_observations.zip: A zip file containing collated daily streamflow data for the sites used in multi-task modeling experiments. The streamflow data were originally accessed from the CAMELs dataset. The data are stored in csv and Zarr formats.

temperature_observations.zip: A zip file containing collated daily water temperature data for the sites used in multi-task modeling experiments. The data were originally accessed via NWIS. The data are stored in csv and Zarr formats.

temperature_sites.geojson: Geojson file of the locations of the water temperature and streamflow sites used in the analysis.

model_drivers.zip: A zip file containing the daily input weather driver data for the multi-task deep learning models. These data are from the Daymet drivers and were collated from the CAMELS dataset. The data are stored in csv and Zarr formats.

catchment_attrs.csv: Catchment attributes collatted from the CAMELS dataset. These data are used for the Random Forest modeling. For full metadata regarding these data see CAMELS dataset.

experiment_workflow_files.zip: A zip file containing workflow definitions used to run multi-task deep learning experiments. These are Snakemake workflows. To run a given experiment, one would run (for experiment A) 'snakemake -s expA_Snakefile --configfile expA_config.yml'

river-dl-paper_v0.zip: A zip file containing python code used to run multi-task deep learning experiments. This code was called by the Snakemake workflows contained in 'experiment_workflow_files.zip'.

random_forest_scripts.zip: A zip file containing Python code and a Python Jupyter Notebook used to prepare data for, train, and visualize feature importance of a Random Forest model.

plotting_code.zip: A zip file containing python code and Snakemake workflow used to produce figures showing the results of multi-task deep learning experiments.

results.zip: A zip file containing results of multi-task deep learning experiments. The results are stored in csv and netcdf formats. The netcdf files were used by the plotting libraries in 'plotting_code.zip'. These files are for five experiments, 'A', 'B', 'C', 'D', and 'AuxIn'. These experiment names are shown in the file name.

sample_scripts.zip: A zip file containing scripts for creating sample output to demonstrate how the modeling workflow was executed.

sample_output.zip: A zip file containing sample output data. Similar files are created by running the sample scripts provided.

A. Newman; K. Sampson; M. P. Clark; A. Bock; R. J. Viger; D. Blodgett, 2014. A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

N. Addor, A. Newman, M. Mizukami, and M. P. Clark, 2017. Catchment attributes for large-sample studies. Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6G73C3Q

Sadler, J. M., Appling, A. P., Read, J. S., Oliver, S. K., Jia, X., Zwart, J. A., & Kumar, V. (2022). Multi-Task Deep Learning of Daily Streamflow and Water Temperature. Water Resources Research, 58(4), e2021WR030138. https://doi.org/10.1029/2021WR030138

U.S. Geological Survey, 2016, National Water Information System data available on the World Wide Web (USGS Water Data for the Nation), accessed Dec. 2020.
Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...
zenodo.org
data.niaid.nih.gov
+1more
bin, pdf
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt; Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt (2024). 3DHD CityScenes: High-Definition Maps in High-Density Point Clouds [Dataset]. http://doi.org/10.5281/zenodo.7085090
Explore at:
bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7085090
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt; Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.

Our corresponding paper (published at ITSC 2022) is available here.
Further, we have applied 3DHD CityScenes to map deviation detection here.

Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:

Python tools to read, generate, and visualize the dataset,

3DHDNet deep learning pipeline (training, inference, evaluation) for
map deviation detection and 3D object detection.

The DevKit is available here:

https://github.com/volkswagen/3DHD_devkit.

The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.

When using our dataset, you are welcome to cite:

@INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}}

Acknowledgements

We thank the following interns for their exceptional contributions to our work.

Benjamin Sertolli: Major contributions to our DevKit during his master thesis

Niels Maier: Measurement campaign for data collection and data preparation

The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.

The Dataset

After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.

1. Dataset

This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.

During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.

To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.

import json json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data)

2. HD_Map

Map items are stored as lists of items in JSON format. In particular, we provide:

traffic signs,

traffic lights,

pole-like objects,

construction site locations,

construction site obstacles (point-like such as cones, and line-like such as fences),

line-shaped markings (solid, dashed, etc.),

polygon-shaped markings (arrows, stop lines, symbols, etc.),

lanes (ordinary and temporary),

relations between elements (only for construction sites, e.g., sign to lane association).

3. HD_Map_MetaData

Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.

Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.

4. HD_PointCloud_Tiles

The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.

x-coordinates: 4 byte integer

y-coordinates: 4 byte integer

z-coordinates: 4 byte integer

intensity of reflected beams: 2 byte unsigned integer

ground classification flag: 1 byte unsigned integer

After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.

import numpy as np import pptk file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['
5. Trajectories We provide 15 real-world trajectories recorded during a measurement campaign covering the whole HD map. Trajectory samples are provided approx. with 30 Hz and are encoded in JSON. These trajectories were used to provide the samples in train.json, val.json. and test.json with realistic geolocations and orientations of the ego vehicle. OP1 – OP5 cover the majority of the map with 5 trajectories. RH1 – RH10 cover the majority of the map with 10 trajectories. Note that OP5 is split into three separate parts, a-c. RH9 is split into two parts, a-b. Moreover, OP4 mostly equals OP1 (thus, we speak of 14 trajectories in our paper). For completeness, however, we provide all recorded trajectories here.
Jupyter Notebook Activity Dataset (rsds-20241113)
zenodo.org
application/gzip, zip
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomoki Nakamaru; Tomoki Nakamaru; Tomomasa Matsunaga; Tetsuro Yamazaki; Tomomasa Matsunaga; Tetsuro Yamazaki (2025). Jupyter Notebook Activity Dataset (rsds-20241113) [Dataset]. http://doi.org/10.5281/zenodo.13357570
Explore at:
zip, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13357570
Dataset updated
Jan 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tomoki Nakamaru; Tomoki Nakamaru; Tomomasa Matsunaga; Tetsuro Yamazaki; Tomomasa Matsunaga; Tetsuro Yamazaki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of data

rsds-20241113.zip: Collection of SQLite database files

image.tar.gz: Docker image provided in our data collection experiment

redspot-341ffa5.zip: Redspot source code (redspot@341ffa5)

Extended version of Section 2D of our paper
Redspot is a Jupyter extension (i.e., Python package) that records activity signals. However, it also offers interfaces to read recorded signals. The following shows the most basic usage of its command-line interface:

redspot replay

This command generates snapshots (.ipynb files) restored from the signal records. Note that this command does not produce a snapshot for every signal. Since the change represented by a single signal is typically minimal (e.g., one keystroke), generating a snapshot for each signal results in a meaninglessly large number of snapshots. However, we want to obtain signal-level snapshots for some analyses. In such cases, one can analyze them using the application programming interfaces:

from redspot import database

from redspot.notebook import Notebook

nbk = Notebook()

for signal in database.get("path-to-db"):

time, panel, kind, args = signal

nbk.apply(kind, args) # apply change

print(nbk) # print notebook

To record activities, one needs to run the Redspot command in the recording mode as follows:

redspot record

This command launches Jupyter Notebook with Redspot enabled. Activities made in the launched environment are stored in an SQLite file named ``redspot.db'' under the current path.

To launch the environment we provided to the participants, one first needs to download and import the image (image.tar.gz). One can then run the image with the following command:

docker run --rm -it -p8888:8888

Note that the SQLite file is generated in the running container. The file can be downloaded into the host machine via the file viewer of Jupyter Notebook.
h
project-gutenberg-temporal-corpus
huggingface.co
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texttechnologylab (2025). project-gutenberg-temporal-corpus [Dataset]. https://huggingface.co/datasets/Texttechnologylab/project-gutenberg-temporal-corpus
Explore at:
Dataset updated
Sep 2, 2025
Dataset authored and provided by
Texttechnologylab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Project Gutenberg Temporal Corpus

Repository Updates

02.09.2025 Fix the unsafe issue in the retrieved contents files. Add the detailed Generes-Super_Generes Mapping in metadata files.

Usage

To use this dataset, we suggest cloning the repository and accessing the files directly. The dataset is organized into several zip files and CSV files, which can be easily extracted and read using standard data processing libraries in Python or other programming… See the full description on the dataset page: https://huggingface.co/datasets/Texttechnologylab/project-gutenberg-temporal-corpus.
d
Python Programs - Claims-Based Frailty Index
search.dataone.org
dataverse.harvard.edu
+1more
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bedell, Douglas; Tambellini, Vincent (2024). Python Programs - Claims-Based Frailty Index [Dataset]. http://doi.org/10.7910/DVN/JSKVTB
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/JSKVTB
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Bedell, Douglas; Tambellini, Vincent
Description
This Python program calculates CFI for each patient from analytic data files containing information on patient identifiers, ICD-9-CM diagnosis codes (version 32), ICD-10-CM Diagnosis Codes (version 2020), CPT codes, and HCPCS codes. NOTE: When downloading, store "CFI_ICD9CM_V32.tab" and "CFI_ICD10CM_V2020.tab" as csv files (these files are originally stored as csv files, but Dataverse automatically converts them to tab files). Please read "Frailty-Index-PYTHON-code-Guide" before proceeding. Interpretation, validation data, and annotated references are provided in "Research Background - Claims-Based Frailty Index".
H
Data from: Critical Search: A procedure for guided reading in large-scale...
dataverse.harvard.edu
search.dataone.org
Updated Jan 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Guldi (2019). Critical Search: A procedure for guided reading in large-scale textual corpora [Dataset]. http://doi.org/10.7910/DVN/BJNAPD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/BJNAPD
Dataset updated
Jan 4, 2019
Dataset provided by
Harvard Dataverse
Authors
Jo Guldi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.
T
qa4mre
tensorflow.org
Updated Dec 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). qa4mre [Dataset]. https://www.tensorflow.org/datasets/catalog/qa4mre
Explore at:
Dataset updated
Dec 20, 2022
Description
QA4MRE dataset was created for the CLEF 2011/2012/2013 shared tasks to promote research in question answering and reading comprehension. The dataset contains a supporting passage and a set of questions corresponding to the passage. Multiple options for answers are provided for each question, of which only one is correct. The training and test datasets are available for the main track. Additional gold standard documents are available for two pilot studies: one on alzheimers data, and the other on entrance exams data.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('qa4mre', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Yelp Dataset
kaggle.com
zip
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/yelp-dataset/yelp-dataset
Explore at:
zip(4374983563 bytes)Available download formats
Dataset updated
Mar 17, 2022
Dataset provided by
Yelphttp://yelp.com/
Authors
Yelp, Inc.
Description
Context

This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

Content

This dataset contains five JSON files and the user agreement. More information about those files can be found here.

Code snippet to read the files

in Python, you can read the JSON files like this (using the json and pandas libraries):

import json import pandas as pd data_file = open("yelp_academic_dataset_checkin.json") data = [] for line in data_file: data.append(json.loads(line)) checkin_df = pd.DataFrame(data) data_file.close()
Data from: Data corresponding to the paper "Traveling Bubbles and Vortex...
zenodo.org
portalcientifico.uvigo.gal
bin, csv +1
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Humberto Michinel; Humberto Michinel (2025). Data corresponding to the paper "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" [Dataset]. http://doi.org/10.5281/zenodo.15362872
Explore at:
text/x-python, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15362872
Dataset updated
May 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Humberto Michinel; Humberto Michinel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets generated for the Physical Review E article with title: "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" by Paredes, Guerra-Carmenate, Salgueiro, Tommasini and Michinel. In particular, we provide the data needed to generate the figures in the publication, which illustrate the numerical results found during this work.

We also include python code in the file "plot_from_data_for_repository.py" that generates a version of the figures of the paper from .pt data sets. Data can be read and plots can be produced with a simple modification of the python code.

Figure 1: Data are in fig1.csv

The csv file has four columns separated by comas. The four columns correspond to values of r (first column) and the function psi(r) for the three cases depicted in the figure (columns 2-4).

Figures 2 and 4: Data are in data_figs_2_and_4.pt

This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 2 and 4 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 2 is the square of the modulus and figure 4 is the argument, both are obtained from the same data sets.

Figure 3: Data are in fig3.csv

The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), energy E (second column) and velocity U (third column).

Figure 5: Data are in fig5.csv

The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), the minimum value of |psi|^2 (second column) and the value of |psi|^2 at the center (third column).

Figure 6: Data are in data_fig_6.pt

This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 6 ("psia", "psib", "psic", "psid").

Figure 7: Data are in data_fig_7.pt

This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 7 ("psia", "psib", "psic", "psid").

Figures 8 and 10: Data are in data_figs_8_and_10.pt

This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 8 and 10 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 8 is the square of the modulus and figure 10 is the argument, both are obtained from the same data sets.

Figure 9: Data are in fig9.csv

The csv file has two columns separated by comas. The two columns correspond to values of momentum p (first column) and energy (second column).

Figure 11: Data are in data_fig_11.pt

This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the two cases, four instants of time for each case, depicted in figure 11 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

Figure 12: Data are in data_fig_12.pt

This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six instants of time depicted in figure 12 ("psia", "psib", "psic", "psid", "psie", "psif").

Figure 13: Data are in data_fig_13.pt

This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the eight instants of time depicted in figure 13 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").
US Consumer Complaints Against Businesses
kaggle.com
Updated Oct 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 9, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jeffery Mandrake
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
2,121,458 records

I used Google Colab to check out this dataset and pull the column names using Pandas.

Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

I did not modify the dataset.

Use it to practice with dataframes - Pandas or PySpark on Google Colab:

!unzip complaints.csv.zip

import pandas as pd df = pd.read_csv('complaints.csv') df.columns

df.head() etc.
H
Replication Data for: "A Computational Approach to Urban Space in Science...
dataverse.harvard.edu
Updated Nov 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Bologna (2020). Replication Data for: "A Computational Approach to Urban Space in Science Fiction" [Dataset]. http://doi.org/10.7910/DVN/VXEK7A
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/VXEK7A
Dataset updated
Nov 25, 2020
Dataset provided by
Harvard Dataverse
Authors
Federica Bologna
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data and code in Python for reproducing the results in "A Computational Approach to Urban Space in Science Fiction". (For more information see: https://github.com/federicabologna/thesis_space_scifi)
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
b2find.eudat.eu
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
d
GREMLIN CONUS3 Dataset for 2022
dataone.org
data.niaid.nih.gov
+1more
Updated Jul 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle Hilburn (2025). GREMLIN CONUS3 Dataset for 2022 [Dataset]. http://doi.org/10.5061/dryad.2jm63xstt
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2jm63xstt
Dataset updated
Jul 21, 2025
Dataset provided by
Dryad Digital Repository
Authors
Kyle Hilburn
Time period covered
Apr 4, 2023
Description
Geostationary Operational Environmental Satellite (GOES) Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is a machine learning model that produces composite radar reflectivity using data from the Advanced Baseline Imager (ABI) and Geostationary Lightning Mapper (GLM). GREMLIN is useful for observing severe weather and providing information during convective initialization especially over regions without ground-based radars. Previous research found good skill compared to ground-based radar products, however, the analysis was done over a dataset with similar climatic and precipitation characteristics as the training dataset: warm season Eastern CONUS in 2019. This study expands the analysis to the entire contiguous United States, during all seasons, and covering the period 2020-2022. Several validation metrics including root-mean-square difference (RMSD), probability of detection (POD), and false alarm ratio (FAR) are plotted over CONUS by season, day-of-year, and time-of-da..., The methodology is described in detail by Hilburn et al. (2021). The ABI, GLM, and MRMS data sets were resampled to a common 3 km grid. A cloud height of 10 km was used for removing parallax displacements. Satellite and radar samples were matched in time with a maximum time difference of 2.5 minutes. GLM lightning groups were accumulated over 15-minute time periods., Code for reading the data: The data can be read using the provided Python code ("read_conus3_file.py" and "test_read.py") or the provided Fortran code ("gzmodule.f90", "test_read.f90", "compile.sh"). Note that this code reads the data without having to unzip the files. The â€œtest_read.pyâ€ and â€œtest_read.f90â€ use one sample file (ABI_C13_202001010000.bin.gz) to verify the code is reading correctly by checking against seven points within the image and against the minimum and maximum over the full image. This file has been included in the software.tar package. To use the Python code, at the command line simply invoke: Â python test_read.py The function that actually reads a data file, in â€œread_conus3_file.pyâ€ , makes use of the gzip module, which is part of the Python Standard Library. If your code is reading correctly, you should see this output: testfile= ABI_C13_202001010000.bin.gzmin data, expected, isclose= -1e+30 -1e+30 Truemax data, expected, isclose= 296.77463 296.77463 Truedata[ Â 0,...
T
squad_question_generation
tensorflow.org
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). squad_question_generation [Dataset]. http://doi.org/10.18653/v1/P17-1123
Explore at:
Unique identifier
https://doi.org/10.18653/v1/P17-1123
Dataset updated
Dec 6, 2022
Description
Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('squad_question_generation', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Data from: Learning Properties of Ordered and Disordered Materials from...
figshare.com
application/gzip
Updated Oct 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chi Chen (2020). Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data [Dataset]. http://doi.org/10.6084/m9.figshare.13040330.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13040330.v1
Dataset updated
Oct 1, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chi Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())data is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}where mp_id is the Materials Project materials ID for the material, and icsd_id is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id and structure from this file for all materials. The structure in this case is a cif file. Users can use again pymatgen to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip contains two datasets in json format. 2.1 G4MP2.json contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}2.2 qm7b.json contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}
d
GREMLIN CONUS3 Dataset for 2021
dataone.org
search.dataone.org
+2more
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle Hilburn (2025). GREMLIN CONUS3 Dataset for 2021 [Dataset]. http://doi.org/10.5061/dryad.zs7h44jf2
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.zs7h44jf2
Dataset updated
Jul 22, 2025
Dataset provided by
Dryad Digital Repository
Authors
Kyle Hilburn
Time period covered
Apr 4, 2023
Description
Geostationary Operational Environmental Satellite (GOES) Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is a machine learning model that produces composite radar reflectivity using data from the Advanced Baseline Imager (ABI) and Geostationary Lightning Mapper (GLM). GREMLIN is useful for observing severe weather and providing information during convective initialization especially over regions without ground-based radars. Previous research found good skill compared to ground-based radar products, however the analysis was done over a dataset with similar climatic and precipitation characteristics as the training dataset: warm season Eastern CONUS in 2019. This study expands the analysis to the entire contiguous United States, during all seasons, and covering the period 2020-2022. Several validation metrics including root-mean-square difference (RMSD), probability of detection (POD), and false alarm ratio (FAR) are plotted over CONUS by season, day-of-year, and time-of-day..., The methodology is described in detail by Hilburn et al. (2021). The ABI, GLM, and MRMS data sets were resampled to a common 3 km grid. A cloud height of 10 km was used for removing parallax displacements. Satellite and radar samples were matched in time with a maximum time difference of 2.5 minutes. GLM lightning groups were accumulated over 15-minute time periods., Code for reading the data: The data can be read using the provided Python code ("read_conus3_file.py" and "test_read.py") or the provided Fortran code ("gzmodule.f90", "test_read.f90", "compile.sh"). Note that this code reads the data without having to unzip the files.The â€œtest_read.pyâ€ and â€œtest_read.f90â€ use one sample file (ABI_C13_202001010000.bin.gz) to verify the code is reading correctly by checking against seven points within the image and against the minimum and maximum over the full image. This file has been included in the software.tar package.To use the Python code, at the command line simply invoke: python test_read.py The function that actually reads a data file, in â€œread_conus3_file.pyâ€ , makes use of the gzip module, which is part of the Python Standard Library. If your code is reading correctly, you should see this output: testfile= ABI_C13_202001010000.bin.gzmin data, expected, isclose= -1e+30 -1e+30 Truemax data, expected, isclose= 296.77463 296.77463 Truedata[Â Â 0,Â Â ...
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Roig, Gemma
Choksi, Bhavin
Schaumlöffel, Timothy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Facebook

Twitter

Click to copy link

Link copied

Cite

Quang Dien Duong (2020). All_files_dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12164295.v1

All_files_dataset

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12164295.v1

Dataset updated

Apr 21, 2020

Dataset provided by

figshare

Authors

Quang Dien Duong

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)

Clear search

Close search

Google apps

Main menu

All_files_dataset

Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

drop

Multi-task Deep Learning for Water Temperature and Streamflow Prediction...

Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...

Jupyter Notebook Activity Dataset (rsds-20241113)

List of data

Extended version of Section 2D of our paper

project-gutenberg-temporal-corpus

Python Programs - Claims-Based Frailty Index

Data from: Critical Search: A procedure for guided reading in large-scale...

qa4mre

Yelp Dataset

Context

Content

Code snippet to read the files

Data from: Data corresponding to the paper "Traveling Bubbles and Vortex...

US Consumer Complaints Against Businesses

Replication Data for: "A Computational Approach to Urban Space in Science...

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

GREMLIN CONUS3 Dataset for 2022

squad_question_generation

Data from: Learning Properties of Ordered and Disordered Materials from...

GREMLIN CONUS3 Dataset for 2021

Multimodal Vision-Audio-Language Dataset

All_files_dataset