53 datasets found
  1. d

    Surface Water - Habitat Results

    • datasets.ai
    33, 57, 8
    Updated Jul 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of California (2021). Surface Water - Habitat Results [Dataset]. https://datasets.ai/datasets/surface-water-habitat-results
    Explore at:
    57, 8, 33Available download formats
    Dataset updated
    Jul 23, 2021
    Dataset authored and provided by
    State of California
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.

    Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

  2. d

    Surface Water - Benthic Macroinvertebrate Results

    • catalog.data.gov
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://catalog.data.gov/dataset/surface-water-benthic-macroinvertebrate-results
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    California State Water Resources Control Board
    Description

    Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data. Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  3. Z

    CKW Smart Meter Data

    • data.niaid.nih.gov
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barahona Garzon, Braulio (2024). CKW Smart Meter Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13304498
    Explore at:
    Dataset updated
    Sep 22, 2024
    Dataset authored and provided by
    Barahona Garzon, Braulio
    Description

    Overview

    The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.

    Data set A - anonimised smart meter data

    Data set B - aggregated smart meter data

    Contents of this data set

    This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet

    The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:

    id: the anonymized counter ID (text)

    timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)

    value_kwh: the consumption in kWh in the time window under consideration (float)

    In this archive, data from:

    | Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |

    was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.

    A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.

    It consists of the following parquet files:

    | Size | Date | Name |

    |------|------|------|

    | 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |

    | 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |

    | 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |

    | 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |

    | 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |

    | 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |

    | 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |

    | 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |

    | 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |

    | 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |

    | 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |

    | 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |

    | 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |

    | 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |

    | 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |

    | 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |

    | 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |

    | 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |

    | 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |

    | 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |

    | 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |

    | 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |

    | 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |

    | 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |

    | 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |

    | 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |

    | 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |

    | 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |

    | 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |

    | 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |

    | 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |

    | 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |

    | 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |

    | 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |

    | 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |

    | 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |

    | 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |

    | 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |

    | 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |

    | 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |

    | 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |

    | 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |

    | 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |

    | 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |

    | 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |

    | 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |

    | 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |

    | 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |

    | 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |

    | 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |

    | 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |

    | 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |

    | 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |

    | 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |

    | 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |

    | 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |

    | 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |

    | 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |

    | 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |

    | 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |

    | 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |

    | 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |

    | 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |

    | 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |

    | 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |

    where each file contains data from a single smart meter.

    Acknowledgement

    The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.

  4. riiid_train_converted to Multiple Formats

    • kaggle.com
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Santh Raul
    Description

    Context

    Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

    Content

    Train data of Riiid competition in different formats.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.

  5. Z

    Data from: F-DATA: A Fugaku Workload Dataset for Job-centric Predictive...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamamoto, Keiji (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467482
    Explore at:
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    Yamamoto, Keiji
    Antici, Francesco
    Kiziltan, Zeynep
    Bartolini, Andrea
    Domke, Jens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

    The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

    F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

    The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

    Importing pandas library

    import pandas as pd

    Read the 21_01.parquet file in a dataframe format

    df = pd.read_parquet("21_01.parquet")

    df.head()

  6. HERO WEC 2024 Hydraulic Configuration Deployment Data

    • mhkdr.openei.org
    • data.openei.org
    • +1more
    archive, code, data
    Updated Mar 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Jenne; Andrew Simms; Justin Panzarella; Rob Raye; Casey Nichols; Aidan Bharath; Mark Murphy; Kyle Swartz; Charles Candon; Scott Jenne; Andrew Simms; Justin Panzarella; Rob Raye; Casey Nichols; Aidan Bharath; Mark Murphy; Kyle Swartz; Charles Candon (2024). HERO WEC 2024 Hydraulic Configuration Deployment Data [Dataset]. http://doi.org/10.15473/2479748
    Explore at:
    archive, code, dataAvailable download formats
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Marine and Hydrokinetic Data Repository
    National Renewable Energy Laboratory
    Authors
    Scott Jenne; Andrew Simms; Justin Panzarella; Rob Raye; Casey Nichols; Aidan Bharath; Mark Murphy; Kyle Swartz; Charles Candon; Scott Jenne; Andrew Simms; Justin Panzarella; Rob Raye; Casey Nichols; Aidan Bharath; Mark Murphy; Kyle Swartz; Charles Candon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The following submission includes raw and processed data from the in water deployment of NREL's Hydraulic and Electric Reverse Osmosis Wave Energy Converter (HERO WEC), in the form of parquet files, TDMS files, CSV files, bag files and MATLAB workspaces. This dataset was collected in March 2024 at the Jennette's pier test site in North Carolina.

    This submission includes the following:

    • Data description document (HERO WEC FY24 Hydraulic Deployment Data Descriptions.doc) - This document includes detailed descriptions of the type of data and how it was processed and/or calculated.

    • Processed MATLAB workspace - The processed data is provided in the form of a single MATLAB workspace containing data from the full deployment. This workspace contains data from all sensors down sampled to 10 Hz along with all array Value Added Products (VAPs).

    • MATLAB visualization scripts - The MATLAB workspaces can be visualized using the file "HERO_WEC_2024_Hydraulic_Config_Data_Viewer.m/mlx". The user simply needs to download the processed MATLAB workspaces, specify the desired start and end times and run this file. Both the .m and .mlx file format has been provided depending on the user's preference.

    • Summary Data - The fully processed data was used to create a summary data set with averages and important calculations performed on 30-minute intervals to align with the intervals of wave resource data reported from nearby CDIP ocean observing buoys located 20km East of Jennette's pier and 40km Northeast of Jennette's pier. The wave resource data provided in this data set is to be used for reference only due the difference in water depth and proximity to shore between the Jennette's pier test site and the locations of the ocean observing buoys. This data is provided in the Summary Data zip folder, which includes this data set in the form of a MATLAB workspace, parquet file, and excel spreadsheet.

    • Processed Parquet File - The processed data is provided in the form of a single parquet file containing data from all HERO WEC sensors collected during the full deployment. Data in these files has been down sampled to 10 Hz and all array VAPs are included.

    • Interim Filtered Data - Raw data from each sensor group partitioned into 30-minute parquet files. These files are outputs from an intermediate stage of data processing and contain the raw data with no Quality Control (QC) or calculations performed in a format that is easier to use than the raw data.

    • Raw Data - Raw, unprocessed data from this deployment can be found in the Raw Data zip folder. This data is provided in the form of TDMS, CSV, and bag files in the original format output by the MODAQ system.

    • Python Data Processing Script - This links to an NREL public github repository containing the python script used to go from raw data to fully processed parquet files. Additional documentation on how to use this script is included in the github repository.

    This data set has been developed by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Water Power Technologies Office.

  7. g

    PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes...

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PARQUET - Basic climatological data - monthly - daily - hourly - 6 minutes (parquet format) [Dataset]. https://gimi9.com/dataset/eu_66159f1bf0686eb4806508e1
    Explore at:
    Description

    Format .parquet This dataset gathers data in .parquet format. Instead of having a .csv.gz per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM, YYYMMDD, YYYYMMDDDDH, YYYYMMDDDDHMN): integer - NUM_POST' : string -USUAL_NAME: string - "LAT": float -LON: float -ALTI: integer - if the column begins withQ(‘quality’) orNB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.com

  8. o

    GitTables 1M

    • explore.openaire.eu
    Updated May 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M [Dataset]. http://doi.org/10.5281/zenodo.6517052
    Explore at:
    Dataset updated
    May 3, 2022
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    Description

    Summary GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables. Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions. We believe GitTables can facilitate many use-cases, among which: Data integration, search and validation. Data visualization and analysis recommendation. Schema analysis and completion for e.g. database or knowledge base design. If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets). Dataset contents The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4. In summary, this dataset can be characterized as follows: Statistic Value # tables 1M average # columns 12 average # rows 142 # annotated tables (at least 1 column annotation) 723K+ (DBpedia), 738K+ (Schema.org) # unique semantic types 835 (DBpedia), 677 (Schema.org) How to download The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download). Future releases Future releases will include the following: Increased number of tables (expected at least 10M) Associated datasets - GitTables benchmark - column type detection: https://zenodo.org/record/5706316 - GitTables 1M - CSV files: https://zenodo.org/record/6515973

  9. d

    Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  10. Surface Water - Chemistry Results

    • catalog.data.gov
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Chemistry Results [Dataset]. https://catalog.data.gov/dataset/surface-water-chemistry-results
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    California State Water Resources Control Board
    Description

    This data provides results from the California Environmental Data Exchange Network (CEDEN) for field and lab chemistry analyses. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering. NOTE: Some of the field and lab chemistry data that has been submitted to CEDEN since 2020 has not been loaded into the CEDEN database. That data is not included in this data set (and is also not available via the CEDEN query tool described above), but is available as a supplemental data set available here: Surface Water - Chemistry Results - CEDEN Augmentation. For consistency, many of the conditions applied to the data in this dataset and in the CEDEN query tool are also applied to that supplemental dataset (e.g., no rejected data or replicates are included), but that supplemental data is provisional and may not reflect all of the QA/QC controls applied to the regular CEDEN data available here.

  11. e

    Scrambled text: training Language Models to correct OCR errors using...

    • b2find.eudat.eu
    Updated Oct 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Scrambled text: training Language Models to correct OCR errors using synthetic data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1ea0205e-de3a-54e7-a918-fde36ad3156f
    Explore at:
    Dataset updated
    Oct 27, 2024
    Description

    This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data

  12. [Dataset] One year of high-precision operational data including measurement...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, json +1
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Tschopp; Daniel Tschopp; Philip Ohnewein; Philip Ohnewein; Roman Stelzer; Roman Stelzer; Lukas Feierl; Lukas Feierl; Marnoch Hamilton-Jones; Marnoch Hamilton-Jones; Maria Moser; Maria Moser; Christian Holter; Christian Holter (2024). [Dataset] One year of high-precision operational data including measurement uncertainties from a large-scale solar thermal collector array with flat plate collectors, located in Graz, Austria [Dataset]. http://doi.org/10.5281/zenodo.7741084
    Explore at:
    csv, text/x-python, json, binAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Tschopp; Daniel Tschopp; Philip Ohnewein; Philip Ohnewein; Roman Stelzer; Roman Stelzer; Lukas Feierl; Lukas Feierl; Marnoch Hamilton-Jones; Marnoch Hamilton-Jones; Maria Moser; Maria Moser; Christian Holter; Christian Holter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Graz, Austria
    Description

    Highlights:

    • High-precision measurement data acquired within a scientific research project, using high-quality measurement equipment and implementing extensive data quality assurance measures.
    • The dataset includes data from one full operational year in a 1-minute sampling rate, covering all seasons.
    • Measured data channels include global, beam and diffuse irradiances in horizontal and collector plane. Heat transfer fluid properties were determined in a dedicated laboratory test.
    • In addition to the measured data channels, calculated data channels, such as thermal power output, mass flow, fluid properties, solar incidence angle and shadowing masks are provided to facilitate further analysis.
    • Uncertainties of data channels are provided based on data sheet specifications and GUM error propagation.
    • The dataset refers to a real-scale application which is representative of typical large-scale solar thermal plant designs (flat plate collectors, common hydraulic layout).
    • Additional information is provided in a "Data in Brief" journal article: https://doi.org/10.1016/j.dib.2023.109224

    Collector array description: The data is from a flat plate collector array with a total gross collector area of 516 m2 (361 kW nominal thermal power). The array consists of four parallel collector rows with a common inlet and outlet manifold. Large-area flat-plate collectors from Arcon-Sunmark A/S are used in the plant. Collectors are all oriented towards the south (180°), have a tilt angle of 30° and a row spacing of 3.1 m. The collector array is part of a large-scale solar thermal plant located at Fernheizwerk Graz, Austria (latitude: 47.047294 N, longitude: 15.436366 E). The plant feeds into the local district heating network and is one of the largest Solar District Heating installations in Central Europe.

    Data files:

    • FHW_ArcS_main_2017.csv – This is the main dataset. It is advised to use this file for further analysis. The file contains the full time series of all measured and all calculated data channels and their (propagated) measurement uncertainty (53 data channels in total). Calculated data channels are derived from measured channels (see script make_data.py below) and have the suffix _calc in their channel names. Uncertainty information is given in terms of standard deviation of a normal distribution (suffix _std); some data channels are assumed to have no uncertainty (e.g., sun azimuth or shadowing).
    • FHW_ArcS_main_2017.parquet – Same as FHW_ArcS_main_2017.csv, but in parquet file format for smaller file size and improved performance when loading the dataset in software.
    • FHW_ArcS_parameters.json – Contains various metadata about the dataset, in both human and machine-readable format. Includes plant parameters, data channel descriptions, physical units, etc.
    • FHW_ArcS_raw_2017.csv – Dataset with time series of all measured data channels and their measurement uncertainty. The main dataset FHW_ArcS_main_2017.csv, which includes all calculated data channels, is a superset of this file.

    Scripts:

    • make_data.py – This Python script exposes the calculation process of the calculated data channels (suffix _calc), including error propagation. The main calculations are defined as functions in the module utils_data.py.
    • make_plots.py – This Python script, together with utils_plots.py, generates several figures based on the main dataset.

    Data collection and preparation: AEE — Institute for Sustainable Technologies (AEE INTEC), Feldgasse 19, 8200 Gleisdorf, Austria; and SOLID Solar Energy Systems GmbH (SOLID), Am Pfangberg 117, 8045 Graz, Austria

    Data owner: solar.nahwaerme.at Energiecontracting GmbH, Puchstrasse 85, 8020 Graz, Austria

    Additional information is provided in a journal article in "Data in Brief", titled "One year of high-precision operational data including measurement uncertainties from a large-scale solar thermal collector array with flat plate collectors in Graz, Austria".

    Note: A Gitlab repository is associated with this dataset, intended as a companion to facilitate maintenance of the Python code that is provided along with the data. If you want to use or contribute to the code, please do so using the Gitlab project: https://gitlab.com/sunpeek/zenodo-fhw-arconsouth-dataset-2017

  13. Z

    origo

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sobotkova, Adela (2025). origo [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14604221
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    Heřmánková, Petra
    Sobotkova, Adela
    Kaše, Vojtěch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Origo is a geospatial spreadsheet dataset documenting ancient migrants attested in the Epigraphic Database Heidelberg (EDH). It is derived from a subset of individuals in the EDH People dataset (available at: https://edh.ub.uni-heidelberg.de/data/download/edh_data_pers.csv) who explicitly declare their geographic origin in the inscriptions. Based on the data curated by the EDH team, we have geocoded the stated places of origin and further enriched the dataset with additional metadata, prioritizing machine readability. We have developed the dataset for the purpose of a quantitative study of migration trends in the Roman Empire as part of the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The scripts used for producing the dataset and for our related publications are available from here: https://github.com/sdam-au/LI_origo/tree/master.

    The dataset includes two point geometries per individual:

    • Geographic origin (origo_geometry) – representing the individual’s place of origin or birth.

    • Findspot (findspot_geometry) – indicating the location where the inscription was discovered, which often approximates the place of death, as approximately 70% of the inscriptions are funerary.

    Scope and Structure:

    The dataset covers 2,313 individuals, described through 36 attributes. For a detailed explanation of these attributes, please refer to the accompanying file origo_variable_dictionary.csv.

    File Formats:

    We provide the dataset in two formats for download and analysis:

    1. CSV – for general spreadsheet use.

    2. GeoParquet (v1.0.0) – optimized for geospatial data handling.

    In the GeoParquet version, the default geometry is defined by the origo_line attribute, a linestring connecting the origo_geometry (place of origin) and the findspot_geometry (findspot of the inscription). This allows for immediate visualization and analysis of migration patterns in GIS environments.

    Getting Started with Python:

    To load and explore the GeoParquet dataset in Python, you can use the following code:

    import geopandas as gpd import fsspec origo = gpd.read_parquet(fsspec.open("https://zenodo.org/records/14604222/files/origo_geo.parquet?download=1").open())

  14. u

    Data from: Scrambled text: training Language Models to correct OCR errors...

    • rdr.ucl.ac.uk
    zip
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonno Bourne (2024). Scrambled text: training Language Models to correct OCR errors using synthetic data [Dataset]. http://doi.org/10.5522/04/27108334.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 27, 2024
    Dataset provided by
    University College London
    Authors
    Jonno Bourne
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data

  15. h

    feature-factory-datasets

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassan Abedi, feature-factory-datasets [Dataset]. https://huggingface.co/datasets/habedi/feature-factory-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Hassan Abedi
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Tabular Datasets

    The datasets are used in this project: Feature Factory

    Index Dataset Name File Name Data Type

    Records (Approx.)

    Format Source

    1 Wine Quality (Red Wine) winequality-red.csv Tabular 1,599 CSV Link

    2 NYC Yellow Taxi Trip (Jan 2019) yellow_tripdata_2019.parquet Taxi Trip Data ~7M Parquet Link

    3 NYC Green Taxi Trip (Jan 2019)green_tripdata_2019.parquet Taxi Trip Data ~1M Parquet Link

    4 California Housing Prices california_housing.csv Real Estate Prices… See the full description on the dataset page: https://huggingface.co/datasets/habedi/feature-factory-datasets.

  16. d

    Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

    • catalog.data.gov
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://catalog.data.gov/dataset/buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.

  17. ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset...

    • cryptodata.center
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptodata.center (2024). ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/orbitaal-comprehensive-bitcoin-dataset-for-temoral-graph-analysis
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    CryptoDATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Construction This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/ [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain}, Dataset Description Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021 Overview: This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs. Every dates have been retrieved from bloc UNIX timestamp and GMT timezone. Contents: The dataset is distributed across three compressed archives: All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package. orbitaal-stream_graph.tar.gz: The root directory is STREAM_GRAPH/ Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes). The stream graph is divided into 13 files, one for each year Files format is parquet Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory STREAM_GRAPH/EDGES/ orbitaal-snapshot-all.tar.gz: The root directory is SNAPSHOT/ Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021). Files format is parquet Name format is orbitaal-snapshot-all.snappy.parquet. These files are in the subdirectory SNAPSHOT/EDGES/ALL/ orbitaal-snapshot-year.tar.gz: The root directory is SNAPSHOT/ Contains the yearly resolution of snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory SNAPSHOT/EDGES/year/ orbitaal-snapshot-month.tar.gz: The root directory is SNAPSHOT/ Contains the monthly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering These files are in the subdirectory SNAPSHOT/EDGES/month/ orbitaal-snapshot-day.tar.gz: The root directory is SNAPSHOT/ Contains the daily resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering These files are in the subdirectory SNAPSHOT/EDGES/day/ orbitaal-snapshot-hour.tar.gz: The root directory is SNAPSHOT/ Contains the hourly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering These files are in the subdirectory SNAPSHOT/EDGES/hour/ orbitaal-nodetable.tar.gz: The root directory is NODE_TABLE/ Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses. Small samples in CSV format orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv These two CSV files are related to stream graph representations of an halvening happening in 2016.

  18. Reference datasets for in-flight emergency situations

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv +1
    Updated Jul 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Olive; Xavier Olive; Axel Tanner; Martin Strohmeier; Martin Strohmeier; Matthias Schäfer; Metin Feridun; Allan Tart; Ivan Martinovic; Vincent Lenders; Axel Tanner; Matthias Schäfer; Metin Feridun; Allan Tart; Ivan Martinovic; Vincent Lenders (2020). Reference datasets for in-flight emergency situations [Dataset]. http://doi.org/10.5281/zenodo.3937483
    Explore at:
    txt, csv, application/gzipAvailable download formats
    Dataset updated
    Jul 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xavier Olive; Xavier Olive; Axel Tanner; Martin Strohmeier; Martin Strohmeier; Matthias Schäfer; Metin Feridun; Allan Tart; Ivan Martinovic; Vincent Lenders; Axel Tanner; Matthias Schäfer; Metin Feridun; Allan Tart; Ivan Martinovic; Vincent Lenders
    Description

    Motivation

    The data in this dataset is derived and cleaned from the full OpenSky dataset in order to illustrate in-flight emergency situations triggering the 7700 transponder code. It spans flights seen by the network's more than 2500 members between 1 January 2018 and 29 January 2020.

    The dataset complements the following publication:

    Xavier Olive, Axel Tanner, Martin Strohmeier, Matthias Schäfer, Metin Feridun, Allan Tart, Ivan Martinovic and Vincent Lenders.
    "OpenSky Report 2020: Analysing in-flight emergencies using big data".
    In 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC), October 2020

    License

    See LICENSE.txt

    Disclaimer

    The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.

    Most aircraft information come from the OpenSky aircraft database and have been filled with manual research from various sources on the Internet. Most information about flight plans has been automatically fetched and processed using open APIs; some manual processing was required to cross-check, correct erroneous and fill missing information.

    Description of the dataset

    Two files are provided in the dataset:

    • one compressed parquet file with trajectory information;
    • one metadata CSV file with the following features:
      • flight_id: a unique identifier for each trajectory;
      • callsign: ICAO flight callsign information;
      • number: IATA flight number, when available;
      • icao24, registration, typecode: information about the aircraft;
      • origin: the origin airport for the aircraft, when available;
      • landing: the airport where the aircraft actually landed, when available;
      • destination: the intended destination airport, when available;
      • diverted: the diversion airport, if applicable, when available;
      • tweet_problem, tweet_result, tweet_fueldump: information extracted from Twitter accounts, about the nature of the issue, the consequence of the emergency and whether the aircraft is known to have dumped fuel;
      • avh_id, avh_problem, avh_result, avh_fueldump: information extracted from The Aviation Herald, about the nature of the issue, the consequence of the emergency and whether the aircraft is known to have dumped fuel.
        The complete URL for each event is https://avherald.com/h?article={avh_id}&opt=1 (replace avh_id by the actual value)

    Examples

    Additional analyses and visualisations of the data are available at the following page:
    <https://traffic-viz.github.io/paper/squawk7700.html>

    Credit

    If you use this dataset, please cite the original OpenSky paper:

    Xavier Olive, Axel Tanner, Martin Strohmeier, Matthias Schäfer, Metin Feridun, Allan Tart, Ivan Martinovic and Vincent Lenders.
    "OpenSky Report 2020: Analysing in-flight emergencies using big data".
    In 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC), October 2020

    Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic and Matthias Wilhelm.
    "Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research".
    In Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83-94, April 2014.

    and the traffic library used to derive the data:

    Xavier Olive.
    "traffic, a toolbox for processing and analysing air traffic data."
    Journal of Open Source Software 4(39), July 2019.

  19. h

    EnigmaDataset

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivendra S (2025). EnigmaDataset [Dataset]. https://huggingface.co/datasets/shivendrra/EnigmaDataset
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    Shivendra S
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DNA Sequence Database from NCBI

    Welcome to the curated DNA sequence dataset, automatically gathered from NCBI using the Enigma2 pipeline. This repository provides ready-to-use CSV and Parquet files for downstream machine-learning and bioinformatics tasks.

      📋 Dataset Overview
    

    Scope

    A collection of topic-specific DNA sequence sets (e.g., BRCA1, TP53, CFTR) sourced directly from NCBI’s Nucleotide database.

    Curation Process

    Query Design

    Predefined Entrez queries (gene… See the full description on the dataset page: https://huggingface.co/datasets/shivendrra/EnigmaDataset.

  20. Data from: SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting over...

    • figshare.com
    • paperswithcode.com
    bin
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jingbo Zhou (2024). SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting over a Large Turbine Array [Dataset]. http://doi.org/10.6084/m9.figshare.24798654.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 20, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jingbo Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PaperThis dataset is associated with the paper published in Scientific Data, titled "SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting over a Large Turbine Array." You can access the paper: https://www.nature.com/articles/s41597-024-03427-5If you find this dataset useful, please consider citing our paper: Scientific Data Paper@article{zhou2024sdwpf, title={SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting over a Large Turbine Array}, author={Zhou, Jingbo and Lu, Xinjiang and Xiao, Yixiong and Tang, Jian and Su, Jiantao and Li, Yu, and Liu, Ji and Lyu, Junfu and Ma, Yanjun and Dou, Dejing},journal={Scientific Data},volume={11},number={1},pages={649},year={2024},url = {https://doi.org/10.1038/s41597-024-03427-5},publisher={Nature Publishing Group}}Baidu KDD Cup Paper@article{zhou2022sdwpf,title={SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022}, author={Zhou, Jingbo and Lu, Xinjiang and Xiao, Yixiong and Su, Jiantao and Lyu, Junfu and Ma, Yanjun and Dou, Dejing}, journal={arXiv preprint arXiv:2208.04360},url = {https://arxiv.org/abs/2208.04360}, year={2022}}BackgroundThe SDWPF dataset, collected over two years from a wind farm with 134 turbines, details the spatial layout of the turbines and dynamic context factors for each. This dataset was utilized to launch the ACM KDD Cup 2022, attracting registrations from over 2,400 teams worldwide. To facilitate its use, we have released the dataset in two parts: sdwpf_kddcup and sdwpf_full. The sdwpf_kddcup is the original dataset used for the Baidu KDD Cup 2022, comprising both training and test datasets. The sdwpf_full offers a more comprehensive collection, including additional data not available during the KDD Cup, such as weather conditions, dates, and elevation.sdwpf_kddcupThe sdwpf_kddcup dataset is the original dataset used for Baidu KDD Cup 2022 Challenge. The folder structure of sdwpf_kddcup is:sdwpf_kddcup --- sdwpf_245days_v1.csv --- sdwpf_baidukddcup2022_turb_location.csv --- final_phase_test --- infile --- 0001in.csv --- 0002in.csv --- ... --- outfile --- 0001out.csv --- 0002out.csv --- ...The descriptions of each sub-folder in the sdwpf_kddcup dataset are as follows:sdwpf_245days_v1.csv: This dataset, released for the KDD Cup 2022 challenge, includes data spanning 245 days.sdwpf_baidukddcup2022_turb_location.csv: This file provides the relative positions of all wind turbines within the dataset.final_phase_test: This dataset serves as the test data for the final phase of the Baidu KDD Cup. It allows for a comparison of methodologies against those of the award-winning teams from KDD Cup 2022. It includes an 'infile' folder containing input data for the model, and an 'outfile' folder which holds the ground truth for the corresponding output. In other words, for a model function y = f(x), x represents the files in the 'infile' folder, and the ground truth of y corresponds to files in the 'outfile' folder, such as {001out} = f({001in}).More information about the sdwpf_kddcup used for Baidu KDD Cup 2022 can be found in Baidu KDD Cup Paper: SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022sdwpf_fullThe sdwpf_full dataset offers more information than what was released for the KDD Cup 2022. It includes not only SCADA data but also weather data such as relative humidity, wind speed, and wind direction, sourced from the Fifth Generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalyses of the global climate (ERA5). The dataset encompasses data collected over two years from a wind farm with 134 wind turbines, covering the period from January 2020 to December 2021. The folder structure of sdwpf_full is:sdwpf_full--- sdwpf_turb_location_elevation.csv--- sdwpf_2001_2112_full.csv--- sdwpf_2001_2112_full.parquetThe descriptions of each sub-folder in the sdwpf_full dataset are as follows:sdwpf_turb_location_elevation.csv: This file details the relative positions and elevations of all wind turbines within the dataset.sdwpf_2001_2112_full.csv: This dataset includes data collected two years from a wind farm containing 134 wind turbines, spanning from Jan. 2020 to Dec. 2021. It offers comprehensive enhancements over the sdwpf_kddcup/sdwpf_245days_v1.csv, including:Extended time span: It spans two years, from January 2020 to December 2021, whereas sdwpf_245days_v1.csv covers only 245 days.Enriched weather information: This includes additional data such as relative humidity, wind speed, and wind direction, sourced from the Fifth generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalyses of the global climate (ERA5).Expanded temporal details: Unlike during the KDD Cup Challenge where timestamp information was withheld to prevent data linkage, this version includes specific timestamps for each data point.sdwpf_2001_2112_full.parquet: This dataset is identical to sdwpf_2001_2112_full.csv, but in a different data format.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
State of California (2021). Surface Water - Habitat Results [Dataset]. https://datasets.ai/datasets/surface-water-habitat-results

Surface Water - Habitat Results

Explore at:
57, 8, 33Available download formats
Dataset updated
Jul 23, 2021
Dataset authored and provided by
State of California
Description

This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.

Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

Search
Clear search
Close search
Google apps
Main menu