100+ datasets found
  1. d

    Surface Water - Habitat Results

    • datasets.ai
    • catalog.data.gov
    33, 57, 8
    Updated Jul 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of California (2021). Surface Water - Habitat Results [Dataset]. https://datasets.ai/datasets/surface-water-habitat-results
    Explore at:
    57, 8, 33Available download formats
    Dataset updated
    Jul 23, 2021
    Dataset authored and provided by
    State of California
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.

    Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

  2. New York Taxi Data 2009-2016 in Parquet Fomat

    • academictorrents.com
    bittorrent
    Updated Jul 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
    Explore at:
    bittorrent(35078948106)Available download formats
    Dataset updated
    Jul 1, 2017
    Dataset provided by
    New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
    Authors
    New York Taxi and Limousine Commission
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Area covered
    New York
    Description

    Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

  3. Criteo_dataset_parquet

    • kaggle.com
    zip
    Updated Dec 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BenediktSchifferer (2020). Criteo_dataset_parquet [Dataset]. https://www.kaggle.com/benediktschifferer/criteo-dataset-parquet
    Explore at:
    zip(2445976289 bytes)Available download formats
    Dataset updated
    Dec 1, 2020
    Authors
    BenediktSchifferer
    Description

    This dataset is based on Criteo subset dataset and Criteo Display Advertising Challenge. The notebook Preprocess Criteo to Parquet converts the .txt files to .parquet files. Parquet is a column-oriented, compressed dataformat, which require less data. To simplify it, it is faster to read data from a parquet file.

  4. Surface Water - Habitat Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Habitat Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-habitat-results
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  5. g

    PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes...

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes (parquet format) [Dataset]. https://gimi9.com/dataset/eu_66159f1bf0686eb4806508e1
    Explore at:
    Description

    Format .parquet This dataset gathers data in .parquet format. Instead of having a .csv.gz per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM, YYYMMDD, YYYYMMDDDDH, YYYYMMDDDDHMN): integer - NUM_POST' : string -USUAL_NAME: string - "LAT": float -LON: float -ALTI: integer - if the column begins withQ(‘quality’) orNB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.com

  6. Z

    Data from: F-DATA: A Fugaku Workload Dataset for Job-centric Predictive...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamamoto, Keiji (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467482
    Explore at:
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    Domke, Jens
    Yamamoto, Keiji
    Antici, Francesco
    Kiziltan, Zeynep
    Bartolini, Andrea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

    The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

    F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

    The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

    Importing pandas library

    import pandas as pd

    Read the 21_01.parquet file in a dataframe format

    df = pd.read_parquet("21_01.parquet")

    df.head()

  7. Optiver-MemoryReducedDatasets

    • kaggle.com
    Updated Oct 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Ramakrishnan (2023). Optiver-MemoryReducedDatasets [Dataset]. https://www.kaggle.com/datasets/ravi20076/optiver-memoryreduceddatasets/versions/8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravi Ramakrishnan
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    These data files are outputs from my public notebook at the below link- https://www.kaggle.com/code/ravi20076/optiver-memoryreduction This is my first tryst with the Optiver stock prediction challenge. I curate these datasets to ensue memory reduction by assigning a suitable datatype to the relevant columns in the dataset apropos to the min-max values. I curate 2 versions of the data, one with only integer columns compressed, and another with integer and float columns compressed. I remove row id in both versions and save the results as a parquet file to facilitate ease of usage.

    Column descriptions are provided in the competition data page as below- https://www.kaggle.com/competitions/optiver-trading-at-the-close/data

    A very good introduction kernel is also provided by the host as well- https://www.kaggle.com/code/tomforbes/optiver-trading-at-the-close-introduction

    Image source - https://www.investopedia.com/stock-trading-4689660

  8. h

    pashto_speech_20k

    • huggingface.co
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanif Rahman (2025). pashto_speech_20k [Dataset]. https://huggingface.co/datasets/ihanif/pashto_speech_20k
    Explore at:
    Dataset updated
    Apr 24, 2025
    Authors
    Hanif Rahman
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Pashto Synthetic Speech Dataset Parquet (20k)

    This dataset contains 40000 synthetic speech recordings in the Pashto language, with 20000 male voice recordings and 20000 female voice recordings, stored in Parquet format.

      Dataset Information
    

    Dataset Size: 20000 sentences Total Recordings: 40000 audio files (20000 male + 20000 female) Audio Format: WAV, 24kHz, 16-bit PCM, embedded directly in Parquet files Dataset Format: Parquet with 500MB shards Sampling Rate: 24kHz… See the full description on the dataset page: https://huggingface.co/datasets/ihanif/pashto_speech_20k.

  9. Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

    • zenodo.org
    zip
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

    The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

    Input data

    The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

    • raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:
      • traj_id: trajectory identifier;
      • user: user identifier;
      • lat: latitude of a trajectory sample;
      • lon: longitude of a trajectory sample;
      • time: timestamp of a sample;

    • pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:
      • osmid: POI OSM identifier
      • element_type: POI OSM element type
      • name: POI native name;
      • name:en: POI English name;
      • wikidata: POI WikiData identifier;
      • geometry: geometry associated with the POI;
      • category: POI category.

    • social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:
      • tweet_ID: identifier of the post;
      • text: post's text;
      • tweet_created: post's timestamp;
      • uid: identifier of the user who posted.

    • weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:
      • DATE: date in which the weather observation was recorded;
      • TAVG_C: average temperature in celsius;
      • DESCRIPTION: weather conditions.

    Output data: the semantically enriched Paris and New York City datasets

    Tabular Representation

    The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

    • traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:
      • (1) we filtered out trajectories having less than 2 samples;
      • (2) we filtered noisy samples inducing velocities above 300km/h:
      • (3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

    • stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:
      • datetime, which indicates when a stop segments starts;
      • leaving_datetime, which indicates when a stop segment ends;
      • uid, the trajectory user's identifier;
      • tid, the trajectory's identifier;
      • lat, the stop segment's centroid latitude;
      • lng, the stop segment's centroid longitude.
        NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).
    • moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:
      • datetime, which indicates when a sample's timestamp;
      • uid, the samples' trajectory user's identifier;
      • tid, the sample's trajectory's identifier;
      • lat, the sample's latitude;
      • lng, the sample's longitude;
      • move_id, the identifier of a move segment.
        NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

    • enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

      The first set of columns concerns a stop's charachteristics:
      • stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;
      • geometry_stop, which is a Shapely Point representing a stop's centroid;
      • geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

    There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

      • index_poi, which is the index of the associated POI in the pois.parqet file;
      • osmid, which is the identifier given by OpenStreetMap to the POI;
      • name, the POI's name;
      • wikidata, the POI identifier on wikidata;
      • category, the POI's category;
      • geometry_poi, a Shapely (multi)polygon describing the POI's geometry;
      • distance, the distance between the stop segment's centroid and the POI.

    • enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:
      • systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;
      • frequency, the number of systematic stops within a systematic stop's cluster;
      • home, the probability that the systematic stop's cluster represents the home of the associated user;
      • work, the probability that the systematic stop's cluster represents the workplace of the associated user;
      • other,

  10. Surface Water - Benthic Macroinvertebrate Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jun 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-benthic-macroinvertebrate-results
    Explore at:
    pdf, zip, csvAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.

    The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  11. Z

    PSYCHE-D: predicting change in depression severity using person-generated...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Jaggi (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5085145
    Explore at:
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Martin Jaggi
    Marta Ferreira
    Raghu Kainkaryam
    Jae Min
    Ieuan Clay
    Mariko Makhmutova
    Description

    This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

    Dataset description

    Parquet file, with:

    35694 rows

    154 columns

    The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

    Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

    File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

    The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

    The data subset used in this work comprises the following:

    Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study

    Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities

    Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month

    Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

    From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

    The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.

  12. g

    Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

    • gimi9.com
    • data.usgs.gov
    • +1more
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://gimi9.com/dataset/data-gov_water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Feb 22, 2025
    Area covered
    Contiguous United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  13. Z

    MASCDB, a database of images, descriptors and microphysical properties of...

    • data.niaid.nih.gov
    Updated Jul 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grazioli, Jacopo (2023). MASCDB, a database of images, descriptors and microphysical properties of individual snowflakes in free fall [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5578920
    Explore at:
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Ghiggi, Gionata
    Berne, Alexis
    Grazioli, Jacopo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset overview

    This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:

    A triplet of gray-scale images corresponding to the three cameras of the MASC

    A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.

    The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:

    Textural and geometrical descriptors as in Praz et al 2017

    Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017

    Blowing snow identification, as in Schaer et al 2020

    Mass, volume, gyration estimation, as in Leinonen et al 2021

    Data format and structure

    The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.

    Supporting code

    A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.

    Download notes

    All files available here for download should be stored in the same folder, if the python-based API is used

    MASCdb.zarr.zip must be unzipped after download

    Field campaigns

    A list of campaigns included in the dataset, with a minimal description is given in the following table

        Campaign_name
        Information
    

    Shielded / Not shielded

    DFIR = Double Fence Intercomparison Reference

    APRES3-2016 & APRES3-2017

        Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017
        Not shielded
    
    
        Davos-2015
        Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment)
        Shielded (DFIR)
    
    
        Davos-2019
        Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow)
        Not shielded
    
    
        ICEGENESIS-2021
        Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023
        Not shielded
    
    
        ICEPOP-2018
        Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021.
        Shielded (DFIR)
    
    
        Jura-2019 & Jura-2023
        Instrument installed in the Swiss Jura within a MeteoSwiss measurement site
        Not shielded
    
    
        Norway-2016
        Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022.
        Not shielded
    
    
        PLATO-2019
        Instrument installed in the "Davis" Antarctic base during the PLATO field campaign
        Not shielded
    
    
        POPE-2020
        Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023.
        Not shielded
    
    
        Remoray-2022
        Instrument installed in the French Jura.
        Not shielded
    
    
        Valais-2016
        Instrument installed in the Swiss Alps in a ski resort.
        Not shielded
    

    Version

    1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.

    0.3 - a new campaign is added to the dataset ("Remoray-2022")

    0.2 - rename of variables. Variable precision (digits) standardized

    0.1 - first upload

  14. Explore data formats and ingestion methods

    • kaggle.com
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gabriel Preda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Why this Dataset

    This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

    You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

    Iris Dataset

    Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

    Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

    Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

    The file downloaded is iris.data and is formatted as a comma delimited file.

    This small data collection was created to help you test your skills with ingesting various data formats.

    Content

    This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
    * feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
    * npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

    Acknowledgements

    I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

    Inspiration

    Use these data formats to test your skills in ingesting data in various formats.

  15. riiid_train_converted to Multiple Formats

    • kaggle.com
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Santh Raul
    Description

    Context

    Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

    Content

    Train data of Riiid competition in different formats.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.

  16. Z

    ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    De Felice, Matteo (2022). ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5780184
    Explore at:
    Dataset updated
    Oct 19, 2022
    Dataset authored and provided by
    De Felice, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ENTSO-E Pan-European Climatic Database (PECD 2021.3) in Parquet format

    TL;DR: this is a tidy and friendly version of a subset of the PECD 2021.3 data by ENTSO-E: hourly capacity factors for wind onshore, offshore, solar PV, hourly electricity demand, weekly inflow for reservoir and pumping and daily generation for run-of-river. All the data is provided for >30 climatic years (1982-2019 for wind and solar, 1982-2016 for demand, 1982-2017 for hydropower) and at national and sub-national (>140 zones) level.

    UPDATE (19/10/2022): updated the demand files due after fixing a bug in the processing code (the file for 2030 was the same for 2025) and solving an issue caused by a malformed header in the ENTSO-E excel files.

    ENTSO-E has released with the latest European Resource Adequacy Assessment (ERAA 2021) all the inputs used in the study. Those inputs include: - Demand dataset: https://eepublicdownloads.azureedge.net/clean-documents/sdc-documents/ERAA/Demand%20Dataset.7z - Climate data: https://eepublicdownloads.entsoe.eu/clean-documents/sdc-documents/ERAA/Climate%20Data.7z

    The data files and the methodology are available on the official webpage.

    As done for the previous releases (see https://zenodo.org/record/3702418#.YbmhR23MKMo and https://zenodo.org/record/3985078#.Ybmhem3MKMo), the original data - stored in large Excel spreadsheets - have been tidied and formatted in open and friendly formats (CSV for the small tables and Parquet for the large files)

    Furthermore, we have carried out a simple country-aggregation for the original data - that uses instead >140 zones.

    DISCLAIMER: the content of this dataset has been created with the greatest possible care. However, we invite to use the original data for critical applications and studies.

    Description

    This dataset includes the following files:

    • capacities-national-estimates.csv: installed capacity in MW per zone, technology and the two scenarios (2025 and 2030). The files include also the total capacity for each technology per country (sum of all the zones within a country)
    • PECD-2021.3-wide-LFSolarPV-2025 and PECD-2021.3-wide-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for solar PV for a hour of the year and all the climatic years (1982-2019) for a specific zone. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"
    • PECD-2021.3-wide-Onshore-2025 and PECD-2021.3-wide-Onshore-2030: same as above but for wind onshore
    • PECD-2021.3-wide-Offshore-2025 and PECD-2021.3-wide-Offshore-2030: same as above but for wind offshore
    • PECD-wide-demand_national_estimates-2025 and PECD-wide-demand_national_estimates-2030: hourly electricity demand for all the climatic years for a specific zone. The two files contain the load for the scenarios "National Estimates 2025" and "National Estimates 2030"
    • PECD-2021.3-country-LFSolarPV-2025 and PECD-2021.3-country-LFSolarPV-2030: tables in Parquet format storing in each row the capacity factor for country/climatic year and hour of the year. The two files contain the capacity factors for the scenarios "National Estimates 2025" and "National Estimates 2030"
    • PECD-2021.3-country-Onshore-2025 and PECD-2021.3-country-Onshore-2030: same as above but for wind onshore
    • PECD-2021.3-country-Offshore-2025 and PECD-2021.3-country-Offshore-2030: same as above but for wind offshore
    • PECD-country-demand_national_estimates-2025 and PECD-country-demand_national_estimates-2030: same as above but for electricity demand
    • PECD_EERA2021_reservoir_pumping.zip: archive with four files per each scenario: 1. table.csv with generation and storage capacities per zone/technology, 2. zone weekly inflow (GWh), 3. table.csv with generation and storage per country/technology and 4. country weekly inflow (GWh)
    • PECD_EERA2021_ROR.zip: as for the previous file but the inflow is daily
    • plots.zip: archive with 182 png figures with the weekly climatology for all the variables (daily for the electricity demand)

    Note

    I would like to thank Laurens Stoop for sharing the onshore wind data for the scenario 2030, that was corrupted in the original archive.

  17. Fuτure - dataset for studies, development, and training of algorithms for...

    • zenodo.org
    bin
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata (2024). Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons [Dataset]. http://doi.org/10.5281/zenodo.13881061
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data description

    MC Simulation


    The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.

    In this version of the dataset no γγ -> hadrons background is included.

    Samples


    This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.

    The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:

    • p8_ee_qq_ecm380 [Z -> qq events]
    • p8_ee_ZH_Htautau [ZH -> Ztautau]
    • p8_ee_Z_Ztautau_ecm380 [ZH -> Ztautau]

    The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.


    Features


    The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.

    Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.

    In summary, the features found in the flat ntuples are:

    NameDescription
    reco_cand_p4s4-momenta per particle in the reco jet.
    reco_cand_chargeCharge per particle in the jet.
    reco_cand_pdgPDGid per particle in the jet.
    reco_jet_p4sRecoJet 4-momenta.
    reco_cand_dzLongitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    reco_cand_dz_errUncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    reco_cand_dxyTransverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    reco_cand_dxy_errUncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    gen_jet_p4sGenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3.
    gen_jet_tau_decaymodeDecay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used.
    gen_jet_tau_p4sVisible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used.

    The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:

    • a binary flag isTau if it was matched to a generator-level hadronically decaying τ lepton. gen_jet_tau_decaymode of value -1 indicates no match to generator-level hadronically decaying τ.
    • the categorical decay mode of the τ gen_jet_tau_decaymode in terms of the number of generator level charged and neutral hadrons. Possible gen_jet_tau_decaymode are {0, 1, . . . , 15}.
    • if matched, the visible (neglecting neutrinos), reconstructable pT of the τ lepton. This is inferred from the gen_jet_tau_p4s

    Contents:

    • qq_test.parquet
    • qq_train.parquet
    • zh_test.parquet
    • zh_train.parquet
    • z_test.parquet
    • z_train.parquet
    • data_intro.ipynb

    Dataset characteristics

    File# JetsSize
    z_test.parquet
    870 843
    171 MB
    z_train.parquet
    3 483 369
    681 MB
    zh_test.parquet
    1 068 606
    213 MB
    zh_train.parquet
    4 274 423
    851 MB
    qq_test.parquet
    6 366 715
    1.4 GB
    qq_train.parquet
    25 466 858
    5.6 GB

    The dataset consists of 6 files of 8.9 GB in total.

    How can you use these data?

    The .parquet files can be directly loaded with the Awkward Array Python library.
    An example how one might use the dataset and the features is given in data_intro.ipynb

  18. a

    LAION-400-MILLION OPEN DATASET

    • academictorrents.com
    bittorrent
    Updated Sep 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2021). LAION-400-MILLION OPEN DATASET [Dataset]. https://academictorrents.com/details/34b94abbcefef5a240358b9acd7920c8b675aacc
    Explore at:
    bittorrent(1211103363514)Available download formats
    Dataset updated
    Sep 14, 2021
    Authors
    None
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like th

  19. Z

    MDverse datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chavent, Mathieu (2023). MDverse datasets [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7856523
    Explore at:
    Dataset updated
    Apr 23, 2023
    Dataset provided by
    Chavent, Mathieu
    Tiemann, Johanna K. S.
    Poulain, Pierre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Files and datasets in Parquet format related to molecular dynamics and retrieved from the Zenodo, Figshare and OSF data repositories. The file 'data_model_parquet.md' is a codebook that contains data models for the Parquet files.

  20. Multimodal Vision-Audio-Language Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

    Details can be found in the attached report.

    Annotation

    The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

    The split into train, validation and test set follows the split of the original datasets.

    Installation

    pip install pandas pyarrow

    Example

    import pandas as pd
    df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
    print(df.iloc[0])

    dataset AudioSet

    filename train/---2_BBVHAA.mp3

    captions_visual [a man in a black hat and glasses.]

    captions_auditory [a man speaks and dishes clank.]

    tags [Speech]

    Description

    The annotation file consists of the following fields:

    filename: Name of the corresponding file (video or audio file)
    dataset: Source dataset associated with the data point
    captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
    captions_auditory: A list of captions related to the auditory content of the video
    tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

    Data files

    The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
State of California (2021). Surface Water - Habitat Results [Dataset]. https://datasets.ai/datasets/surface-water-habitat-results

Surface Water - Habitat Results

Explore at:
57, 8, 33Available download formats
Dataset updated
Jul 23, 2021
Dataset authored and provided by
State of California
Description

This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.

Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

Search
Clear search
Close search
Google apps
Main menu