100+ datasets found
  1. All_files_dataset

    • figshare.com
    bin
    Updated Apr 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quang Dien Duong (2020). All_files_dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12164295.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 21, 2020
    Dataset provided by
    figshare
    Authors
    Quang Dien Duong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)

  2. Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset [Dataset]. http://doi.org/10.5281/zenodo.1297925
    Explore at:
    text/x-python, zip, bin, application/gzipAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  3. T

    drop

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). drop [Dataset]. https://www.tensorflow.org/datasets/catalog/drop
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    With system performance on existing reading comprehension benchmarks nearing or surpassing human performance, we need a new, hard dataset that improves systems' capabilities to actually read paragraphs of text. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('drop', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  4. d

    Multi-task Deep Learning for Water Temperature and Streamflow Prediction...

    • catalog.data.gov
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Multi-task Deep Learning for Water Temperature and Streamflow Prediction (ver. 1.1, June 2022) [Dataset]. https://catalog.data.gov/dataset/multi-task-deep-learning-for-water-temperature-and-streamflow-prediction-ver-1-1-june-2022
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:

    1. input_data_processing.zip: A zip file containing the scripts used to collate the observations, input weather drivers, and catchment attributes for the multi-task modeling experiments
    2. flow_observations.zip: A zip file containing collated daily streamflow data for the sites used in multi-task modeling experiments. The streamflow data were originally accessed from the CAMELs dataset. The data are stored in csv and Zarr formats.
    3. temperature_observations.zip: A zip file containing collated daily water temperature data for the sites used in multi-task modeling experiments. The data were originally accessed via NWIS. The data are stored in csv and Zarr formats.
    4. temperature_sites.geojson: Geojson file of the locations of the water temperature and streamflow sites used in the analysis.
    5. model_drivers.zip: A zip file containing the daily input weather driver data for the multi-task deep learning models. These data are from the Daymet drivers and were collated from the CAMELS dataset. The data are stored in csv and Zarr formats.
    6. catchment_attrs.csv: Catchment attributes collatted from the CAMELS dataset. These data are used for the Random Forest modeling. For full metadata regarding these data see CAMELS dataset.
    7. experiment_workflow_files.zip: A zip file containing workflow definitions used to run multi-task deep learning experiments. These are Snakemake workflows. To run a given experiment, one would run (for experiment A) 'snakemake -s expA_Snakefile --configfile expA_config.yml'
    8. river-dl-paper_v0.zip: A zip file containing python code used to run multi-task deep learning experiments. This code was called by the Snakemake workflows contained in 'experiment_workflow_files.zip'.
    9. random_forest_scripts.zip: A zip file containing Python code and a Python Jupyter Notebook used to prepare data for, train, and visualize feature importance of a Random Forest model.
    10. plotting_code.zip: A zip file containing python code and Snakemake workflow used to produce figures showing the results of multi-task deep learning experiments.
    11. results.zip: A zip file containing results of multi-task deep learning experiments. The results are stored in csv and netcdf formats. The netcdf files were used by the plotting libraries in 'plotting_code.zip'. These files are for five experiments, 'A', 'B', 'C', 'D', and 'AuxIn'. These experiment names are shown in the file name.
    12. sample_scripts.zip: A zip file containing scripts for creating sample output to demonstrate how the modeling workflow was executed.
    13. sample_output.zip: A zip file containing sample output data. Similar files are created by running the sample scripts provided.
    A. Newman; K. Sampson; M. P. Clark; A. Bock; R. J. Viger; D. Blodgett, 2014. A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

    N. Addor, A. Newman, M. Mizukami, and M. P. Clark, 2017. Catchment attributes for large-sample studies. Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6G73C3Q

    Sadler, J. M., Appling, A. P., Read, J. S., Oliver, S. K., Jia, X., Zwart, J. A., & Kumar, V. (2022). Multi-Task Deep Learning of Daily Streamflow and Water Temperature. Water Resources Research, 58(4), e2021WR030138. https://doi.org/10.1029/2021WR030138

    U.S. Geological Survey, 2016, National Water Information System data available on the World Wide Web (USGS Water Data for the Nation), accessed Dec. 2020.

  5. Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, pdf
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt; Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt (2024). 3DHD CityScenes: High-Definition Maps in High-Density Point Clouds [Dataset]. http://doi.org/10.5281/zenodo.7085090
    Explore at:
    bin, pdfAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt; Christopher Plachetka; Benjamin Sertolli; Jenny Fricke; Marvin Klingner; Tim Fingscheidt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.

    Our corresponding paper (published at ITSC 2022) is available here.
    Further, we have applied 3DHD CityScenes to map deviation detection here.

    Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:

    • Python tools to read, generate, and visualize the dataset,
    • 3DHDNet deep learning pipeline (training, inference, evaluation) for
      map deviation detection and 3D object detection.

    The DevKit is available here:

    https://github.com/volkswagen/3DHD_devkit.

    The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.

    When using our dataset, you are welcome to cite:

    @INPROCEEDINGS{9921866,
      author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and 
      Fingscheidt, Tim},
      booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, 
      title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, 
      year={2022},
      pages={627-634}}

    Acknowledgements

    We thank the following interns for their exceptional contributions to our work.

    • Benjamin Sertolli: Major contributions to our DevKit during his master thesis
    • Niels Maier: Measurement campaign for data collection and data preparation

    The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.

    The Dataset

    After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.

    1. Dataset

    This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.

    During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.

    To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.

    import json
    
    json_path = r"E:\3DHD_CityScenes\Dataset\train.json"
    with open(json_path) as jf:
      data = json.load(jf)
    print(data)

    2. HD_Map

    Map items are stored as lists of items in JSON format. In particular, we provide:

    • traffic signs,
    • traffic lights,
    • pole-like objects,
    • construction site locations,
    • construction site obstacles (point-like such as cones, and line-like such as fences),
    • line-shaped markings (solid, dashed, etc.),
    • polygon-shaped markings (arrows, stop lines, symbols, etc.),
    • lanes (ordinary and temporary),
    • relations between elements (only for construction sites, e.g., sign to lane association).

    3. HD_Map_MetaData

    Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.

    Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.

    4. HD_PointCloud_Tiles

    The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.

    • x-coordinates: 4 byte integer
    • y-coordinates: 4 byte integer
    • z-coordinates: 4 byte integer
    • intensity of reflected beams: 2 byte unsigned integer
    • ground classification flag: 1 byte unsigned integer

    After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.

    import numpy as np
    import pptk
    
    file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin"
    pc_dict = {}
    key_list = ['x', 'y', 'z', 'intensity', 'is_ground']
    type_list = ['

    5. Trajectories

    We provide 15 real-world trajectories recorded during a measurement campaign covering the whole HD map. Trajectory samples are provided approx. with 30 Hz and are encoded in JSON.

    These trajectories were used to provide the samples in train.json, val.json. and test.json with realistic geolocations and orientations of the ego vehicle.

    • OP1 – OP5 cover the majority of the map with 5 trajectories.
    • RH1 – RH10 cover the majority of the map with 10 trajectories.

    Note that OP5 is split into three separate parts, a-c. RH9 is split into two parts, a-b. Moreover, OP4 mostly equals OP1 (thus, we speak of 14 trajectories in our paper). For completeness, however, we provide all recorded trajectories here.

  6. Jupyter Notebook Activity Dataset (rsds-20241113)

    • zenodo.org
    application/gzip, zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomoki Nakamaru; Tomoki Nakamaru; Tomomasa Matsunaga; Tetsuro Yamazaki; Tomomasa Matsunaga; Tetsuro Yamazaki (2025). Jupyter Notebook Activity Dataset (rsds-20241113) [Dataset]. http://doi.org/10.5281/zenodo.13357570
    Explore at:
    zip, application/gzipAvailable download formats
    Dataset updated
    Jan 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tomoki Nakamaru; Tomoki Nakamaru; Tomomasa Matsunaga; Tetsuro Yamazaki; Tomomasa Matsunaga; Tetsuro Yamazaki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of data

    • rsds-20241113.zip: Collection of SQLite database files
    • image.tar.gz: Docker image provided in our data collection experiment
    • redspot-341ffa5.zip: Redspot source code (redspot@341ffa5)

    Extended version of Section 2D of our paper

    Redspot is a Jupyter extension (i.e., Python package) that records activity signals. However, it also offers interfaces to read recorded signals. The following shows the most basic usage of its command-line interface:
    redspot replay


    This command generates snapshots (.ipynb files) restored from the signal records. Note that this command does not produce a snapshot for every signal. Since the change represented by a single signal is typically minimal (e.g., one keystroke), generating a snapshot for each signal results in a meaninglessly large number of snapshots. However, we want to obtain signal-level snapshots for some analyses. In such cases, one can analyze them using the application programming interfaces:

    from redspot import database
    from redspot.notebook import Notebook
    nbk = Notebook()
    for signal in database.get("path-to-db"):
    time, panel, kind, args = signal
    nbk.apply(kind, args) # apply change
    print(nbk) # print notebook

    To record activities, one needs to run the Redspot command in the recording mode as follows:

    redspot record

    This command launches Jupyter Notebook with Redspot enabled. Activities made in the launched environment are stored in an SQLite file named ``redspot.db'' under the current path.

    To launch the environment we provided to the participants, one first needs to download and import the image (image.tar.gz). One can then run the image with the following command:

    docker run --rm -it -p8888:8888

    Note that the SQLite file is generated in the running container. The file can be downloaded into the host machine via the file viewer of Jupyter Notebook.

  7. h

    project-gutenberg-temporal-corpus

    • huggingface.co
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texttechnologylab (2025). project-gutenberg-temporal-corpus [Dataset]. https://huggingface.co/datasets/Texttechnologylab/project-gutenberg-temporal-corpus
    Explore at:
    Dataset updated
    Sep 2, 2025
    Dataset authored and provided by
    Texttechnologylab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Project Gutenberg Temporal Corpus

      Repository Updates
    

    02.09.2025 Fix the unsafe issue in the retrieved contents files. Add the detailed Generes-Super_Generes Mapping in metadata files.

      Usage
    

    To use this dataset, we suggest cloning the repository and accessing the files directly. The dataset is organized into several zip files and CSV files, which can be easily extracted and read using standard data processing libraries in Python or other programming… See the full description on the dataset page: https://huggingface.co/datasets/Texttechnologylab/project-gutenberg-temporal-corpus.

  8. d

    Python Programs - Claims-Based Frailty Index

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bedell, Douglas; Tambellini, Vincent (2024). Python Programs - Claims-Based Frailty Index [Dataset]. http://doi.org/10.7910/DVN/JSKVTB
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Bedell, Douglas; Tambellini, Vincent
    Description

    This Python program calculates CFI for each patient from analytic data files containing information on patient identifiers, ICD-9-CM diagnosis codes (version 32), ICD-10-CM Diagnosis Codes (version 2020), CPT codes, and HCPCS codes. NOTE: When downloading, store "CFI_ICD9CM_V32.tab" and "CFI_ICD10CM_V2020.tab" as csv files (these files are originally stored as csv files, but Dataverse automatically converts them to tab files). Please read "Frailty-Index-PYTHON-code-Guide" before proceeding. Interpretation, validation data, and annotated references are provided in "Research Background - Claims-Based Frailty Index".

  9. H

    Data from: Critical Search: A procedure for guided reading in large-scale...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jan 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jo Guldi (2019). Critical Search: A procedure for guided reading in large-scale textual corpora [Dataset]. http://doi.org/10.7910/DVN/BJNAPD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Jo Guldi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.

  10. T

    qa4mre

    • tensorflow.org
    Updated Dec 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). qa4mre [Dataset]. https://www.tensorflow.org/datasets/catalog/qa4mre
    Explore at:
    Dataset updated
    Dec 20, 2022
    Description

    QA4MRE dataset was created for the CLEF 2011/2012/2013 shared tasks to promote research in question answering and reading comprehension. The dataset contains a supporting passage and a set of questions corresponding to the passage. Multiple options for answers are provided for each question, of which only one is correct. The training and test datasets are available for the main track. Additional gold standard documents are available for two pilot studies: one on alzheimers data, and the other on entrance exams data.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('qa4mre', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  11. Yelp Dataset

    • kaggle.com
    zip
    Updated Mar 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/yelp-dataset/yelp-dataset
    Explore at:
    zip(4374983563 bytes)Available download formats
    Dataset updated
    Mar 17, 2022
    Dataset provided by
    Yelphttp://yelp.com/
    Authors
    Yelp, Inc.
    Description

    Context

    This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

    Content

    This dataset contains five JSON files and the user agreement. More information about those files can be found here.

    Code snippet to read the files

    in Python, you can read the JSON files like this (using the json and pandas libraries):

    import json
    import pandas as pd
    data_file = open("yelp_academic_dataset_checkin.json")
    data = []
    for line in data_file:
     data.append(json.loads(line))
    checkin_df = pd.DataFrame(data)
    data_file.close()
    
    
  12. Data from: Data corresponding to the paper "Traveling Bubbles and Vortex...

    • zenodo.org
    • portalcientifico.uvigo.gal
    bin, csv +1
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Humberto Michinel; Humberto Michinel (2025). Data corresponding to the paper "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" [Dataset]. http://doi.org/10.5281/zenodo.15362872
    Explore at:
    text/x-python, bin, csvAvailable download formats
    Dataset updated
    May 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Humberto Michinel; Humberto Michinel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets generated for the Physical Review E article with title: "Traveling Bubbles and Vortex Pairs within Symmetric 2D Quantum Droplets" by Paredes, Guerra-Carmenate, Salgueiro, Tommasini and Michinel. In particular, we provide the data needed to generate the figures in the publication, which illustrate the numerical results found during this work.

    We also include python code in the file "plot_from_data_for_repository.py" that generates a version of the figures of the paper from .pt data sets. Data can be read and plots can be produced with a simple modification of the python code.

    Figure 1: Data are in fig1.csv

    The csv file has four columns separated by comas. The four columns correspond to values of r (first column) and the function psi(r) for the three cases depicted in the figure (columns 2-4).

    Figures 2 and 4: Data are in data_figs_2_and_4.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 2 and 4 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 2 is the square of the modulus and figure 4 is the argument, both are obtained from the same data sets.

    Figure 3: Data are in fig3.csv

    The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), energy E (second column) and velocity U (third column).

    Figure 5: Data are in fig5.csv

    The csv file has three columns separated by comas. The three columns correspond to values of momentum p (first column), the minimum value of |psi|^2 (second column) and the value of |psi|^2 at the center (third column).

    Figure 6: Data are in data_fig_6.pt

    This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 6 ("psia", "psib", "psic", "psid").

    Figure 7: Data are in data_fig_7.pt

    This is a data file generated with the torch module of python. It includes six torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the four instants of time depicted in figure 7 ("psia", "psib", "psic", "psid").

    Figures 8 and 10: Data are in data_figs_8_and_10.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six eigenstates depicted in figures 8 and 10 ("psia", "psib", "psic", "psid", "psie", "psif"). Notice that figure 8 is the square of the modulus and figure 10 is the argument, both are obtained from the same data sets.

    Figure 9: Data are in fig9.csv

    The csv file has two columns separated by comas. The two columns correspond to values of momentum p (first column) and energy (second column).

    Figure 11: Data are in data_fig_11.pt

    This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the two cases, four instants of time for each case, depicted in figure 11 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

    Figure 12: Data are in data_fig_12.pt

    This is a data file generated with the torch module of python. It includes eight torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the six instants of time depicted in figure 12 ("psia", "psib", "psic", "psid", "psie", "psif").

    Figure 13: Data are in data_fig_13.pt

    This is a data file generated with the torch module of python. It includes ten torch tensors for the spatial grid "x" and "y" and for the complex values of psi for the eight instants of time depicted in figure 13 ("psia", "psib", "psic", "psid", "psie", "psif", "psig", "psih").

  13. US Consumer Complaints Against Businesses

    • kaggle.com
    Updated Oct 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jeffery Mandrake
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    2,121,458 records

    I used Google Colab to check out this dataset and pull the column names using Pandas.

    Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

    Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

    I did not modify the dataset.

    Use it to practice with dataframes - Pandas or PySpark on Google Colab:

    !unzip complaints.csv.zip

    import pandas as pd df = pd.read_csv('complaints.csv') df.columns

    df.head() etc.

  14. H

    Replication Data for: "A Computational Approach to Urban Space in Science...

    • dataverse.harvard.edu
    Updated Nov 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Bologna (2020). Replication Data for: "A Computational Approach to Urban Space in Science Fiction" [Dataset]. http://doi.org/10.7910/DVN/VXEK7A
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Federica Bologna
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data and code in Python for reproducing the results in "A Computational Approach to Urban Space in Science Fiction". (For more information see: https://github.com/federicabologna/thesis_space_scifi)

  15. t

    Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

    • researchdata.tuwien.at
    • b2find.eudat.eu
    html, pdf, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
    Explore at:
    html, zip, pdfAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    TU Wien
    Authors
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How To Cite?

    Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

    Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

    Folder Structure

    The folder named “submission” contains the following:

    1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
    2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

    Setting Up the Environment

    1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
    2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

    Subfolders

    1. Data_4_IJGIS

    • This folder contains the data used for the results reported in the paper.
    • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

    2. results_[DateTime] (e.g., results_20240906_15_00_13)

    • This folder will be generated when you run the code and will store the output of each step.
    • The current folder contains results created during code debugging for the submission.
    • When you run the code, a new folder with fresh results will be generated.

    Python Files

    1. helper_functions.py

    • Contains reusable functions used throughout the analysis.
    • Each function includes a description of its purpose and the input parameters required.

    2. create_sanity_plots.py

    • Generates scatter plots like those in Figure 3 of the paper.
    • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
    • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
    • Usage: Run this file to create visualizations similar to Figure 3.

    3. overlapping_sliding_window_loop.py

    • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
    • Output:
      • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
      • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
      • A visualization of the segments, similar to Figure 4, will be automatically generated.

    4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

    • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
    • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
    • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

    5. training_prediction.py

    • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
    a. Data Preparation (corresponding to Section 5.1.1 of the paper)
    • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
    • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
    b. Training/Validation/Test Split
    • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
    • Make sure that you follow the instructions in the comments to the code exactly.
    • Output: The split data is saved as .csv files in the results folder.
    c. Machine and Deep Learning Experiments

    This part contains three main code blocks:

    iii. One for the XGboost code with correct hyperparameter tuning:
    Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

    • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
    • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
    • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

    Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

    d. Inference (Monitoring Part)
    • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
    • Figure 8 in the paper is generated using this part of the code.

    6. sequence_analysis.py

    • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
    • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

    Licenses

    The data is licensed under CC-BY, the code is licensed under MIT.

  16. d

    GREMLIN CONUS3 Dataset for 2022

    • dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Hilburn (2025). GREMLIN CONUS3 Dataset for 2022 [Dataset]. http://doi.org/10.5061/dryad.2jm63xstt
    Explore at:
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kyle Hilburn
    Time period covered
    Apr 4, 2023
    Description

    Geostationary Operational Environmental Satellite (GOES) Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is a machine learning model that produces composite radar reflectivity using data from the Advanced Baseline Imager (ABI) and Geostationary Lightning Mapper (GLM). GREMLIN is useful for observing severe weather and providing information during convective initialization especially over regions without ground-based radars. Previous research found good skill compared to ground-based radar products, however, the analysis was done over a dataset with similar climatic and precipitation characteristics as the training dataset: warm season Eastern CONUS in 2019. This study expands the analysis to the entire contiguous United States, during all seasons, and covering the period 2020-2022. Several validation metrics including root-mean-square difference (RMSD), probability of detection (POD), and false alarm ratio (FAR) are plotted over CONUS by season, day-of-year, and time-of-da..., The methodology is described in detail by Hilburn et al. (2021). The ABI, GLM, and MRMS data sets were resampled to a common 3 km grid. A cloud height of 10 km was used for removing parallax displacements. Satellite and radar samples were matched in time with a maximum time difference of 2.5 minutes. GLM lightning groups were accumulated over 15-minute time periods., Code for reading the data: The data can be read using the provided Python code ("read_conus3_file.py" and "test_read.py") or the provided Fortran code ("gzmodule.f90", "test_read.f90", "compile.sh"). Note that this code reads the data without having to unzip the files. The “test_read.py†and “test_read.f90†use one sample file (ABI_C13_202001010000.bin.gz) to verify the code is reading correctly by checking against seven points within the image and against the minimum and maximum over the full image. This file has been included in the software.tar package. To use the Python code, at the command line simply invoke:  python test_read.py The function that actually reads a data file, in “read_conus3_file.py†, makes use of the gzip module, which is part of the Python Standard Library. If your code is reading correctly, you should see this output: testfile= ABI_C13_202001010000.bin.gzmin data, expected, isclose= -1e+30 -1e+30 Truemax data, expected, isclose= 296.77463 296.77463 Truedata[  0,...

  17. T

    squad_question_generation

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). squad_question_generation [Dataset]. http://doi.org/10.18653/v1/P17-1123
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    Question generation using squad dataset using data splits described in 'Neural Question Generation from Text: A Preliminary Study' (Zhou et al, 2017) and 'Learning to Ask: Neural Question Generation for Reading Comprehension' (Du et al, 2017).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('squad_question_generation', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  18. Data from: Learning Properties of Ordered and Disordered Materials from...

    • figshare.com
    application/gzip
    Updated Oct 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chi Chen (2020). Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data [Dataset]. http://doi.org/10.6084/m9.figshare.13040330.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 1, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chi Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())data is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}where mp_id is the Materials Project materials ID for the material, and icsd_id is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id and structure from this file for all materials. The structure in this case is a cif file. Users can use again pymatgen to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip contains two datasets in json format. 2.1 G4MP2.json contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}2.2 qm7b.json contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}

  19. d

    GREMLIN CONUS3 Dataset for 2021

    • dataone.org
    • search.dataone.org
    • +2more
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Hilburn (2025). GREMLIN CONUS3 Dataset for 2021 [Dataset]. http://doi.org/10.5061/dryad.zs7h44jf2
    Explore at:
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kyle Hilburn
    Time period covered
    Apr 4, 2023
    Description

    Geostationary Operational Environmental Satellite (GOES) Radar Estimation via Machine Learning to Inform NWP (GREMLIN) is a machine learning model that produces composite radar reflectivity using data from the Advanced Baseline Imager (ABI) and Geostationary Lightning Mapper (GLM). GREMLIN is useful for observing severe weather and providing information during convective initialization especially over regions without ground-based radars. Previous research found good skill compared to ground-based radar products, however the analysis was done over a dataset with similar climatic and precipitation characteristics as the training dataset: warm season Eastern CONUS in 2019. This study expands the analysis to the entire contiguous United States, during all seasons, and covering the period 2020-2022. Several validation metrics including root-mean-square difference (RMSD), probability of detection (POD), and false alarm ratio (FAR) are plotted over CONUS by season, day-of-year, and time-of-day..., The methodology is described in detail by Hilburn et al. (2021). The ABI, GLM, and MRMS data sets were resampled to a common 3 km grid. A cloud height of 10 km was used for removing parallax displacements. Satellite and radar samples were matched in time with a maximum time difference of 2.5 minutes. GLM lightning groups were accumulated over 15-minute time periods., Code for reading the data: The data can be read using the provided Python code ("read_conus3_file.py" and "test_read.py") or the provided Fortran code ("gzmodule.f90", "test_read.f90", "compile.sh"). Note that this code reads the data without having to unzip the files.The “test_read.py†and “test_read.f90†use one sample file (ABI_C13_202001010000.bin.gz) to verify the code is reading correctly by checking against seven points within the image and against the minimum and maximum over the full image. This file has been included in the software.tar package.To use the Python code, at the command line simply invoke: python test_read.py The function that actually reads a data file, in “read_conus3_file.py†, makes use of the gzip module, which is part of the Python Standard Library. If your code is reading correctly, you should see this output: testfile= ABI_C13_202001010000.bin.gzmin data, expected, isclose= -1e+30 -1e+30 Truemax data, expected, isclose= 296.77463 296.77463 Truedata[  0,  ...

  20. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Roig, Gemma
    Choksi, Bhavin
    Schaumlöffel, Timothy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Quang Dien Duong (2020). All_files_dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12164295.v1
Organization logo

All_files_dataset

Explore at:
binAvailable download formats
Dataset updated
Apr 21, 2020
Dataset provided by
figshare
Authors
Quang Dien Duong
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)

Search
Clear search
Close search
Google apps
Main menu