This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar
This dataset is based on Criteo subset dataset and Criteo Display Advertising Challenge. The notebook Preprocess Criteo to Parquet converts the .txt files to .parquet files. Parquet is a column-oriented, compressed dataformat, which require less data. To simplify it, it is faster to read data from a parquet file.
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
.parquet
This dataset gathers data in .parquet
format. Instead of having a .csv.gz
per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared
have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM
, YYYMMDD
, YYYYMMDDDDH
, YYYYMMDDDDHMN
): integer - NUM_POST' : string -
USUAL_NAME: string - "LAT": float -
LON: float -
ALTI: integer - if the column begins with
Q(‘quality’) or
NB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.comAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
These data files are outputs from my public notebook at the below link- https://www.kaggle.com/code/ravi20076/optiver-memoryreduction This is my first tryst with the Optiver stock prediction challenge. I curate these datasets to ensue memory reduction by assigning a suitable datatype to the relevant columns in the dataset apropos to the min-max values. I curate 2 versions of the data, one with only integer columns compressed, and another with integer and float columns compressed. I remove row id in both versions and save the results as a parquet file to facilitate ease of usage.
Column descriptions are provided in the competition data page as below- https://www.kaggle.com/competitions/optiver-trading-at-the-close/data
A very good introduction kernel is also provided by the host as well- https://www.kaggle.com/code/tomforbes/optiver-trading-at-the-close-introduction
Image source - https://www.investopedia.com/stock-trading-4689660
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Pashto Synthetic Speech Dataset Parquet (20k)
This dataset contains 40000 synthetic speech recordings in the Pashto language, with 20000 male voice recordings and 20000 female voice recordings, stored in Parquet format.
Dataset Information
Dataset Size: 20000 sentences Total Recordings: 40000 audio files (20000 male + 20000 female) Audio Format: WAV, 24kHz, 16-bit PCM, embedded directly in Parquet files Dataset Format: Parquet with 500MB shards Sampling Rate: 24kHz… See the full description on the dataset page: https://huggingface.co/datasets/ihanif/pashto_speech_20k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).
The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.
The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:
The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:
There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:
Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.
The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.
Dataset description
Parquet file, with:
35694 rows
154 columns
The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.
Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.
File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.
The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.
The data subset used in this work comprises the following:
Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity
From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).
The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset overview
This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:
A triplet of gray-scale images corresponding to the three cameras of the MASC
A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.
The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:
Textural and geometrical descriptors as in Praz et al 2017
Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017
Blowing snow identification, as in Schaer et al 2020
Mass, volume, gyration estimation, as in Leinonen et al 2021
Data format and structure
The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.
Supporting code
A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.
Download notes
All files available here for download should be stored in the same folder, if the python-based API is used
MASCdb.zarr.zip must be unzipped after download
Field campaigns
A list of campaigns included in the dataset, with a minimal description is given in the following table
Campaign_name
Information
Shielded / Not shielded
DFIR = Double Fence Intercomparison Reference
APRES3-2016 & APRES3-2017
Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017
Not shielded
Davos-2015
Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment)
Shielded (DFIR)
Davos-2019
Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow)
Not shielded
ICEGENESIS-2021
Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023
Not shielded
ICEPOP-2018
Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021.
Shielded (DFIR)
Jura-2019 & Jura-2023
Instrument installed in the Swiss Jura within a MeteoSwiss measurement site
Not shielded
Norway-2016
Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022.
Not shielded
PLATO-2019
Instrument installed in the "Davis" Antarctic base during the PLATO field campaign
Not shielded
POPE-2020
Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023.
Not shielded
Remoray-2022
Instrument installed in the French Jura.
Not shielded
Valais-2016
Instrument installed in the Swiss Alps in a ski resort.
Not shielded
Version
1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.
0.3 - a new campaign is added to the dataset ("Remoray-2022")
0.2 - rename of variables. Variable precision (digits) standardized
0.1 - first upload
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).
You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:
Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.
Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris
Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
The file downloaded is iris.data and is formatted as a comma delimited file.
This small data collection was created to help you test your skills with ingesting various data formats.
This file was processed to convert the data in the following formats:
* csv - comma separated values format
* tsv - tab separated values format
* parquet - parquet format
* feather - feather format
* parquet.gzip - compressed parquet format
* h5 - hdf5 format
* pickle - Python binary object file - pickle format
* xslx - Excel format
* npy - Numpy (Python library) binary format
* npz - Numpy (Python library) binary compressed format
* rds - Rds (R specific data format) binary format
I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.
Use these data formats to test your skills in ingesting data in various formats.
Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.
Train data of Riiid competition in different formats.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format
TL;DR: this is a tidy and friendly version of a subset of the PECD 2021.3 data by ENTSO-E: hourly capacity factors for wind onshore, offshore, solar PV, hourly electricity demand, weekly inflow for reservoir and pumping and daily generation for run-of-river. All the data is provided for >30 climatic years (1982-2019 for wind and solar, 1982-2016 for demand, 1982-2017 for hydropower) and at national and sub-national (>140 zones) level.
UPDATE (19/10/2022): updated the demand files due after fixing a bug in the processing code (the file for 2030 was the same for 2025) and solving an issue caused by a malformed header in the ENTSO-E excel files.
ENTSO-E has released with the latest European Resource Adequacy Assessment (ERAA 2021) all the inputs used in the study. Those inputs include: - Demand dataset: https://eepublicdownloads.azureedge.net/clean-documents/sdc-documents/ERAA/Demand%20Dataset.7z - Climate data: https://eepublicdownloads.entsoe.eu/clean-documents/sdc-documents/ERAA/Climate%20Data.7z
The data files and the methodology are available on the official webpage.
As done for the previous releases (see https://zenodo.org/record/3702418#.YbmhR23MKMo and https://zenodo.org/record/3985078#.Ybmhem3MKMo), the original data - stored in large Excel spreadsheets - have been tidied and formatted in open and friendly formats (CSV for the small tables and Parquet for the large files)
Furthermore, we have carried out a simple country-aggregation for the original data - that uses instead >140 zones.
DISCLAIMER: the content of this dataset has been created with the greatest possible care. However, we invite to use the original data for critical applications and studies.
Description
This dataset includes the following files:
Note
I would like to thank Laurens Stoop for sharing the onshore wind data for the scenario 2030, that was corrupted in the original archive.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.
In this version of the dataset no γγ -> hadrons background is included.
This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.
The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:
The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.
The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.
Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.
In summary, the features found in the flat ntuples are:
Name | Description |
reco_cand_p4s | 4-momenta per particle in the reco jet. |
reco_cand_charge | Charge per particle in the jet. |
reco_cand_pdg | PDGid per particle in the jet. |
reco_jet_p4s | RecoJet 4-momenta. |
reco_cand_dz | Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dz_err | Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy | Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
reco_cand_dxy_err | Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated. |
gen_jet_p4s | GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3. |
gen_jet_tau_decaymode | Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used. |
gen_jet_tau_p4s | Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used. |
The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:
File | # Jets | Size |
z_test.parquet |
870 843 | 171 MB |
z_train.parquet |
3 483 369 | 681 MB |
zh_test.parquet |
1 068 606 | 213 MB |
zh_train.parquet |
4 274 423 | 851 MB |
qq_test.parquet |
6 366 715 | 1.4 GB |
qq_train.parquet |
25 466 858 | 5.6 GB |
The dataset consists of 6 files of 8.9 GB in total.
The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Files and datasets in Parquet format related to molecular dynamics and retrieved from the Zenodo, Figshare and OSF data repositories. The file 'data_model_parquet.md' is a codebook that contains data models for the Parquet files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
pip install pandas pyarrow
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
The annotation file consists of the following fields:
filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool