100+ datasets found

d
Surface Water - Habitat Results
datasets.ai
catalog.data.gov
33, 57, 8
Updated Jul 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of California (2021). Surface Water - Habitat Results [Dataset]. https://datasets.ai/datasets/surface-water-habitat-results
Explore at:
57, 8, 33Available download formats
Dataset updated
Jul 23, 2021
Dataset authored and provided by
State of California
Description
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.

Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
New York Taxi Data 2009-2016 in Parquet Fomat
academictorrents.com
bittorrent
Updated Jul 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
Explore at:
bittorrent(35078948106)Available download formats
Dataset updated
Jul 1, 2017
Dataset provided by
New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
Authors
New York Taxi and Limousine Commission
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Area covered
New York
Description
Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar
Criteo_dataset_parquet
kaggle.com
zip
Updated Dec 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BenediktSchifferer (2020). Criteo_dataset_parquet [Dataset]. https://www.kaggle.com/benediktschifferer/criteo-dataset-parquet
Explore at:
zip(2445976289 bytes)Available download formats
Dataset updated
Dec 1, 2020
Authors
BenediktSchifferer
Description
This dataset is based on Criteo subset dataset and Criteo Display Advertising Challenge. The notebook Preprocess Criteo to Parquet converts the .txt files to .parquet files. Parquet is a column-oriented, compressed dataformat, which require less data. To simplify it, it is faster to read data from a parquet file.
Surface Water - Habitat Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Habitat Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-habitat-results
Explore at:
pdf, csv, zipAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
g
PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes...
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes (parquet format) [Dataset]. https://gimi9.com/dataset/eu_66159f1bf0686eb4806508e1
Explore at:
Description
Format .parquet This dataset gathers data in .parquet format. Instead of having a .csv.gz per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM, YYYMMDD, YYYYMMDDDDH, YYYYMMDDDDHMN): integer - NUM_POST' : string -USUAL_NAME: string - "LAT": float -LON: float -ALTI: integer - if the column begins withQ(‘quality’) orNB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.com
Z
Data from: F-DATA: A Fugaku Workload Dataset for Job-centric Predictive...
data.niaid.nih.gov
zenodo.org
Updated Jun 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yamamoto, Keiji (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467482
Explore at:
Dataset updated
Jun 10, 2024
Dataset provided by
Domke, Jens
Yamamoto, Keiji
Antici, Francesco
Kiziltan, Zeynep
Bartolini, Andrea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

Importing pandas library

import pandas as pd

Read the 21_01.parquet file in a dataframe format

df = pd.read_parquet("21_01.parquet")

df.head()
Optiver-MemoryReducedDatasets
kaggle.com
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravi Ramakrishnan (2023). Optiver-MemoryReducedDatasets [Dataset]. https://www.kaggle.com/datasets/ravi20076/optiver-memoryreduceddatasets/versions/8
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravi Ramakrishnan
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
These data files are outputs from my public notebook at the below link- https://www.kaggle.com/code/ravi20076/optiver-memoryreduction This is my first tryst with the Optiver stock prediction challenge. I curate these datasets to ensue memory reduction by assigning a suitable datatype to the relevant columns in the dataset apropos to the min-max values. I curate 2 versions of the data, one with only integer columns compressed, and another with integer and float columns compressed. I remove row id in both versions and save the results as a parquet file to facilitate ease of usage.

Column descriptions are provided in the competition data page as below- https://www.kaggle.com/competitions/optiver-trading-at-the-close/data

A very good introduction kernel is also provided by the host as well- https://www.kaggle.com/code/tomforbes/optiver-trading-at-the-close-introduction

Image source - https://www.investopedia.com/stock-trading-4689660
h
pashto_speech_20k
huggingface.co
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanif Rahman (2025). pashto_speech_20k [Dataset]. https://huggingface.co/datasets/ihanif/pashto_speech_20k
Explore at:
Dataset updated
Apr 24, 2025
Authors
Hanif Rahman
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Pashto Synthetic Speech Dataset Parquet (20k)

This dataset contains 40000 synthetic speech recordings in the Pashto language, with 20000 male voice recordings and 20000 female voice recordings, stored in Parquet format.

Dataset Information

Dataset Size: 20000 sentences Total Recordings: 40000 audio files (20000 male + 20000 female) Audio Format: WAV, 24kHz, 16-bit PCM, embedded directly in Parquet files Dataset Format: Parquet with 500MB shards Sampling Rate: 24kHz… See the full description on the dataset page: https://huggingface.co/datasets/ihanif/pashto_speech_20k.
Datasets of the CIKM resource paper "A Semantically Enriched Mobility...
zenodo.org
zip
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15658129
Dataset updated
Jun 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

Input data

The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:

traj_id: trajectory identifier;

user: user identifier;

lat: latitude of a trajectory sample;

lon: longitude of a trajectory sample;

time: timestamp of a sample;

pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:

osmid: POI OSM identifier

element_type: POI OSM element type

name: POI native name;

name:en: POI English name;

wikidata: POI WikiData identifier;

geometry: geometry associated with the POI;

category: POI category.

social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:

tweet_ID: identifier of the post;

text: post's text;

tweet_created: post's timestamp;

uid: identifier of the user who posted.

weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:

DATE: date in which the weather observation was recorded;

TAVG_C: average temperature in celsius;

DESCRIPTION: weather conditions.

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:

(1) we filtered out trajectories having less than 2 samples;

(2) we filtered noisy samples inducing velocities above 300km/h:

(3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:

datetime, which indicates when a stop segments starts;

leaving_datetime, which indicates when a stop segment ends;

uid, the trajectory user's identifier;

tid, the trajectory's identifier;

lat, the stop segment's centroid latitude;

lng, the stop segment's centroid longitude.
NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).

moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:

datetime, which indicates when a sample's timestamp;

uid, the samples' trajectory user's identifier;

tid, the sample's trajectory's identifier;

lat, the sample's latitude;

lng, the sample's longitude;

move_id, the identifier of a move segment.
NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

The first set of columns concerns a stop's charachteristics:

stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;

geometry_stop, which is a Shapely Point representing a stop's centroid;

geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

index_poi, which is the index of the associated POI in the pois.parqet file;

osmid, which is the identifier given by OpenStreetMap to the POI;

name, the POI's name;

wikidata, the POI identifier on wikidata;

category, the POI's category;

geometry_poi, a Shapely (multi)polygon describing the POI's geometry;

distance, the distance between the stop segment's centroid and the POI.

enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:

systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;

frequency, the number of systematic stops within a systematic stop's cluster;

home, the probability that the systematic stop's cluster represents the home of the associated user;

work, the probability that the systematic stop's cluster represents the workplace of the associated user;

other,
Surface Water - Benthic Macroinvertebrate Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jun 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-benthic-macroinvertebrate-results
Explore at:
pdf, zip, csvAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.

The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Z
PSYCHE-D: predicting change in depression severity using person-generated...
data.niaid.nih.gov
zenodo.org
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Jaggi (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5085145
Explore at:
Dataset updated
Jul 18, 2024
Dataset provided by
Martin Jaggi
Marta Ferreira
Raghu Kainkaryam
Jae Min
Ieuan Clay
Mariko Makhmutova
Description
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

Dataset description

Parquet file, with:

35694 rows

154 columns

The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

The data subset used in this work comprises the following:

Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study

Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities

Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month

Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
g
Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...
gimi9.com
data.usgs.gov
+1more
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://gimi9.com/dataset/data-gov_water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Feb 22, 2025
Area covered
Contiguous United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Z
MASCDB, a database of images, descriptors and microphysical properties of...
data.niaid.nih.gov
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grazioli, Jacopo (2023). MASCDB, a database of images, descriptors and microphysical properties of individual snowflakes in free fall [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5578920
Explore at:
Dataset updated
Jul 5, 2023
Dataset provided by
Ghiggi, Gionata
Berne, Alexis
Grazioli, Jacopo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset overview

This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:

A triplet of gray-scale images corresponding to the three cameras of the MASC

A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.

The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:

Textural and geometrical descriptors as in Praz et al 2017

Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017

Blowing snow identification, as in Schaer et al 2020

Mass, volume, gyration estimation, as in Leinonen et al 2021

Data format and structure

The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.

Supporting code

A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.

Download notes

All files available here for download should be stored in the same folder, if the python-based API is used

MASCdb.zarr.zip must be unzipped after download

Field campaigns

A list of campaigns included in the dataset, with a minimal description is given in the following table

Campaign_name Information

Shielded / Not shielded

DFIR = Double Fence Intercomparison Reference

APRES3-2016 & APRES3-2017

Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017 Not shielded Davos-2015 Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment) Shielded (DFIR) Davos-2019 Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow) Not shielded ICEGENESIS-2021 Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023 Not shielded ICEPOP-2018 Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021. Shielded (DFIR) Jura-2019 & Jura-2023 Instrument installed in the Swiss Jura within a MeteoSwiss measurement site Not shielded Norway-2016 Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022. Not shielded PLATO-2019 Instrument installed in the "Davis" Antarctic base during the PLATO field campaign Not shielded POPE-2020 Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023. Not shielded Remoray-2022 Instrument installed in the French Jura. Not shielded Valais-2016 Instrument installed in the Swiss Alps in a ski resort. Not shielded

Version

1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.

0.3 - a new campaign is added to the dataset ("Remoray-2022")

0.2 - rename of variables. Variable precision (digits) standardized

0.1 - first upload
Explore data formats and ingestion methods
kaggle.com
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Why this Dataset

This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

Test Data Formats in Python

Test Data Formats in R

Iris Dataset

Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

The file downloaded is iris.data and is formatted as a comma delimited file.

This small data collection was created to help you test your skills with ingesting various data formats.

Content

This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
* feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
* npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

Acknowledgements

I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

Inspiration

Use these data formats to test your skills in ingesting data in various formats.
riiid_train_converted to Multiple Formats
kaggle.com
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Santh Raul
Description
Context

Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

Content

Train data of Riiid competition in different formats.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
Z
ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format
data.niaid.nih.gov
zenodo.org
Updated Oct 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
De Felice, Matteo (2022). ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5780184
Explore at:
Dataset updated
Oct 19, 2022
Dataset authored and provided by
De Felice, Matteo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format

TL;DR: this is a tidy and friendly version of a subset of the PECD 2021.3 data by ENTSO-E: hourly capacity factors for wind onshore, offshore, solar PV, hourly electricity demand, weekly inflow for reservoir and pumping and daily generation for run-of-river. All the data is provided for >30 climatic years (1982-2019 for wind and solar, 1982-2016 for demand, 1982-2017 for hydropower) and at national and sub-national (>140 zones) level.

UPDATE (19/10/2022): updated the demand files due after fixing a bug in the processing code (the file for 2030 was the same for 2025) and solving an issue caused by a malformed header in the ENTSO-E excel files.

ENTSO-E has released with the latest European Resource Adequacy Assessment (ERAA 2021) all the inputs used in the study. Those inputs include: - Demand dataset: https://eepublicdownloads.azureedge.net/clean-documents/sdc-documents/ERAA/Demand%20Dataset.7z - Climate data: https://eepublicdownloads.entsoe.eu/clean-documents/sdc-documents/ERAA/Climate%20Data.7z

The data files and the methodology are available on the official webpage.

As done for the previous releases (see https://zenodo.org/record/3702418#.YbmhR23MKMo and https://zenodo.org/record/3985078#.Ybmhem3MKMo), the original data - stored in large Excel spreadsheets - have been tidied and formatted in open and friendly formats (CSV for the small tables and Parquet for the large files)

Furthermore, we have carried out a simple country-aggregation for the original data - that uses instead >140 zones.

DISCLAIMER: the content of this dataset has been created with the greatest possible care. However, we invite to use the original data for critical applications and studies.

Description

This dataset includes the following files:

capacities-national-estimates.csv: installed capacity in MW per zone, technology and the two scenarios (2025 and 2030). The files include also the total capacity for each technology per country (sum of all the zones within a country)

PECD-2021.3-wide-LFSolarPV-2025 and PECD-2021.3-wide-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for solar PV for a hour of the year and all the climatic years (1982-2019) for a specific zone. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"

PECD-2021.3-wide-Onshore-2025 and PECD-2021.3-wide-Onshore-2030: same as above but for wind onshore

PECD-2021.3-wide-Offshore-2025 and PECD-2021.3-wide-Offshore-2030: same as above but for wind offshore

PECD-wide-demand_national_estimates-2025 and PECD-wide-demand_national_estimates-2030: hourly electricity demand for all the climatic years for a specific zone. The two files contain the load for the scenarios "National Estimates 2025" and "National Estimates 2030"

PECD-2021.3-country-LFSolarPV-2025 and PECD-2021.3-country-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for country/climatic year and hour of the year. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"

PECD-2021.3-country-Onshore-2025 and PECD-2021.3-country-Onshore-2030: same as above but for wind onshore

PECD-2021.3-country-Offshore-2025 and PECD-2021.3-country-Offshore-2030: same as above but for wind offshore

PECD-country-demand_national_estimates-2025 and PECD-country-demand_national_estimates-2030: same as above but for electricity demand

PECD_EERA2021_reservoir_pumping.zip: archive with four files per each scenario: 1. table.csv with generation and storage capacities per zone/technology, 2. zone weekly inflow (GWh), 3. table.csv with generation and storage per country/technology and 4. country weekly inflow (GWh)

PECD_EERA2021_ROR.zip: as for the previous file but the inflow is daily

plots.zip: archive with 182 png figures with the weekly climatology for all the variables (daily for the electricity demand)

Note

I would like to thank Laurens Stoop for sharing the onshore wind data for the scenario 2030, that was corrupted in the original archive.

Fuτure - dataset for studies, development, and training of algorithms for...

zenodo.org

bin

Updated Oct 3, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata (2024). Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons [Dataset]. http://doi.org/10.5281/zenodo.13881061

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13881061

Dataset updated

Oct 3, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data description

MC Simulation

The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.

In this version of the dataset no γγ -> hadrons background is included.

Samples

This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.

The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:

p8_ee_qq_ecm380 [Z -> qq events]
p8_ee_ZH_Htautau [ZH -> Ztautau]
p8_ee_Z_Ztautau_ecm380 [ZH -> Ztautau]

The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.

Features

The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.

Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.

In summary, the features found in the flat ntuples are:

Name	Description
reco_cand_p4s	4-momenta per particle in the reco jet.
reco_cand_charge	Charge per particle in the jet.
reco_cand_pdg	PDGid per particle in the jet.
reco_jet_p4s	RecoJet 4-momenta.
reco_cand_dz	Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dz_err	Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy	Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy_err	Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
gen_jet_p4s	GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3.
gen_jet_tau_decaymode	Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used.
gen_jet_tau_p4s	Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used.

The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:

a binary flag isTau if it was matched to a generator-level hadronically decaying τ lepton. gen_jet_tau_decaymode of value -1 indicates no match to generator-level hadronically decaying τ.
the categorical decay mode of the τ gen_jet_tau_decaymode in terms of the number of generator level charged and neutral hadrons. Possible gen_jet_tau_decaymode are {0, 1, . . . , 15}.
if matched, the visible (neglecting neutrinos), reconstructable pT of the τ lepton. This is inferred from the gen_jet_tau_p4s

Dataset characteristics

File	# Jets	Size
z_test.parquet	870 843	171 MB
z_train.parquet	3 483 369	681 MB
zh_test.parquet	1 068 606	213 MB
zh_train.parquet	4 274 423	851 MB
qq_test.parquet	6 366 715	1.4 GB
qq_train.parquet	25 466 858	5.6 GB

The dataset consists of 6 files of 8.9 GB in total.

How can you use these data?

The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb

a
LAION-400-MILLION OPEN DATASET
academictorrents.com
bittorrent
Updated Sep 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2021). LAION-400-MILLION OPEN DATASET [Dataset]. https://academictorrents.com/details/34b94abbcefef5a240358b9acd7920c8b675aacc
Explore at:
bittorrent(1211103363514)Available download formats
Dataset updated
Sep 14, 2021
Authors
None
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th
Z
MDverse datasets
data.niaid.nih.gov
zenodo.org
Updated Apr 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chavent, Mathieu (2023). MDverse datasets [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7856523
Explore at:
Dataset updated
Apr 23, 2023
Dataset provided by
Chavent, Mathieu
Tiemann, Johanna K. S.
Poulain, Pierre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files and datasets in Parquet format related to molecular dynamics and retrieved from the Zenodo, Figshare and OSF data repositories. The file 'data_model_parquet.md' is a codebook that contains data models for the Parquet files.
Multimodal Vision-Audio-Language Dataset
zenodo.org
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10060785
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
Annotation
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
Installation
pip install pandas pyarrow
Example
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
Description
The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
Data files
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Facebook

Twitter

Click to copy link

Link copied

Cite

State of California (2021). Surface Water - Habitat Results [Dataset]. https://datasets.ai/datasets/surface-water-habitat-results

Surface Water - Habitat Results

Explore at:

57, 8, 33Available download formats

Dataset updated

Jul 23, 2021

Dataset authored and provided by

State of California

Description

This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.

Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

Clear search

Close search

Google apps

Main menu

Surface Water - Habitat Results

New York Taxi Data 2009-2016 in Parquet Fomat

Criteo_dataset_parquet

Surface Water - Habitat Results

PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes...

Data from: F-DATA: A Fugaku Workload Dataset for Job-centric Predictive...

Importing pandas library

Read the 21_01.parquet file in a dataframe format

Optiver-MemoryReducedDatasets

pashto_speech_20k

Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

Input data

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

Surface Water - Benthic Macroinvertebrate Results

PSYCHE-D: predicting change in depression severity using person-generated...

Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

MASCDB, a database of images, descriptors and microphysical properties of...

Explore data formats and ingestion methods

Why this Dataset

Iris Dataset

Content

Acknowledgements

Inspiration

riiid_train_converted to Multiple Formats

Context

Content

Acknowledgements

Inspiration

ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format

Fuτure - dataset for studies, development, and training of algorithms for...

Data description

MC Simulation

Samples

Features

Contents:

Dataset characteristics

How can you use these data?

LAION-400-MILLION OPEN DATASET

MDverse datasets

Multimodal Vision-Audio-Language Dataset

Annotation

Installation

Example

Description

Data files

Surface Water - Habitat ResultsSee More Versions

Surface Water - Habitat Results