This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data. Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Overview
The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.
Data set A - anonimised smart meter data
Data set B - aggregated smart meter data
Contents of this data set
This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet
The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:
id: the anonymized counter ID (text)
timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)
value_kwh: the consumption in kWh in the time window under consideration (float)
In this archive, data from:
| Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |
was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.
A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.
It consists of the following parquet files:
| Size | Date | Name |
|------|------|------|
| 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |
| 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |
| 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |
| 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |
| 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |
| 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |
| 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |
| 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |
| 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |
| 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |
| 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |
| 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |
| 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |
| 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |
| 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |
| 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |
| 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |
| 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |
| 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |
| 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |
| 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |
| 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |
| 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |
| 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |
| 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |
| 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |
| 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |
| 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |
| 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |
| 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |
| 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |
| 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |
| 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |
| 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |
| 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |
| 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |
| 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |
| 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |
| 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |
| 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |
| 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |
| 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |
| 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |
| 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |
| 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |
| 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |
| 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |
| 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |
| 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |
| 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |
| 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |
| 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |
| 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |
| 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |
| 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |
| 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |
| 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |
| 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |
| 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |
| 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |
| 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |
| 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |
| 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |
| 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |
| 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |
where each file contains data from a single smart meter.
Acknowledgement
The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.
Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.
Train data of Riiid competition in different formats.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following submission includes raw and processed data from the in water deployment of NREL's Hydraulic and Electric Reverse Osmosis Wave Energy Converter (HERO WEC), in the form of parquet files, TDMS files, CSV files, bag files and MATLAB workspaces. This dataset was collected in March 2024 at the Jennette's pier test site in North Carolina.
This submission includes the following:
Data description document (HERO WEC FY24 Hydraulic Deployment Data Descriptions.doc) - This document includes detailed descriptions of the type of data and how it was processed and/or calculated.
Processed MATLAB workspace - The processed data is provided in the form of a single MATLAB workspace containing data from the full deployment. This workspace contains data from all sensors down sampled to 10 Hz along with all array Value Added Products (VAPs).
MATLAB visualization scripts - The MATLAB workspaces can be visualized using the file "HERO_WEC_2024_Hydraulic_Config_Data_Viewer.m/mlx". The user simply needs to download the processed MATLAB workspaces, specify the desired start and end times and run this file. Both the .m and .mlx file format has been provided depending on the user's preference.
Summary Data - The fully processed data was used to create a summary data set with averages and important calculations performed on 30-minute intervals to align with the intervals of wave resource data reported from nearby CDIP ocean observing buoys located 20km East of Jennette's pier and 40km Northeast of Jennette's pier. The wave resource data provided in this data set is to be used for reference only due the difference in water depth and proximity to shore between the Jennette's pier test site and the locations of the ocean observing buoys. This data is provided in the Summary Data zip folder, which includes this data set in the form of a MATLAB workspace, parquet file, and excel spreadsheet.
Processed Parquet File - The processed data is provided in the form of a single parquet file containing data from all HERO WEC sensors collected during the full deployment. Data in these files has been down sampled to 10 Hz and all array VAPs are included.
Interim Filtered Data - Raw data from each sensor group partitioned into 30-minute parquet files. These files are outputs from an intermediate stage of data processing and contain the raw data with no Quality Control (QC) or calculations performed in a format that is easier to use than the raw data.
Raw Data - Raw, unprocessed data from this deployment can be found in the Raw Data zip folder. This data is provided in the form of TDMS, CSV, and bag files in the original format output by the MODAQ system.
Python Data Processing Script - This links to an NREL public github repository containing the python script used to go from raw data to fully processed parquet files. Additional documentation on how to use this script is included in the github repository.
This data set has been developed by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Water Power Technologies Office.
.parquet
This dataset gathers data in .parquet
format. Instead of having a .csv.gz
per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared
have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM
, YYYMMDD
, YYYYMMDDDDH
, YYYYMMDDDDHMN
): integer - NUM_POST' : string -
USUAL_NAME: string - "LAT": float -
LON: float -
ALTI: integer - if the column begins with
Q(‘quality’) or
NB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.comSummary GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables. Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions. We believe GitTables can facilitate many use-cases, among which: Data integration, search and validation. Data visualization and analysis recommendation. Schema analysis and completion for e.g. database or knowledge base design. If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets). Dataset contents The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4. In summary, this dataset can be characterized as follows: Statistic Value # tables 1M average # columns 12 average # rows 142 # annotated tables (at least 1 column annotation) 723K+ (DBpedia), 738K+ (Schema.org) # unique semantic types 835 (DBpedia), 677 (Schema.org) How to download The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download). Future releases Future releases will include the following: Increased number of tables (expected at least 10M) Associated datasets - GitTables benchmark - column type detection: https://zenodo.org/record/5706316 - GitTables 1M - CSV files: https://zenodo.org/record/6515973
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
This data provides results from the California Environmental Data Exchange Network (CEDEN) for field and lab chemistry analyses. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering. NOTE: Some of the field and lab chemistry data that has been submitted to CEDEN since 2020 has not been loaded into the CEDEN database. That data is not included in this data set (and is also not available via the CEDEN query tool described above), but is available as a supplemental data set available here: Surface Water - Chemistry Results - CEDEN Augmentation. For consistency, many of the conditions applied to the data in this dataset and in the CEDEN query tool are also applied to that supplemental dataset (e.g., no rejected data or replicates are included), but that supplemental data is provisional and may not reflect all of the QA/QC controls applied to the regular CEDEN data available here.
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Highlights:
Collector array description: The data is from a flat plate collector array with a total gross collector area of 516 m2 (361 kW nominal thermal power). The array consists of four parallel collector rows with a common inlet and outlet manifold. Large-area flat-plate collectors from Arcon-Sunmark A/S are used in the plant. Collectors are all oriented towards the south (180°), have a tilt angle of 30° and a row spacing of 3.1 m. The collector array is part of a large-scale solar thermal plant located at Fernheizwerk Graz, Austria (latitude: 47.047294 N, longitude: 15.436366 E). The plant feeds into the local district heating network and is one of the largest Solar District Heating installations in Central Europe.
Data files:
Scripts:
Data collection and preparation: AEE — Institute for Sustainable Technologies (AEE INTEC), Feldgasse 19, 8200 Gleisdorf, Austria; and SOLID Solar Energy Systems GmbH (SOLID), Am Pfangberg 117, 8045 Graz, Austria
Data owner: solar.nahwaerme.at Energiecontracting GmbH, Puchstrasse 85, 8020 Graz, Austria
Additional information is provided in a journal article in "Data in Brief", titled "One year of high-precision operational data including measurement uncertainties from a large-scale solar thermal collector array with flat plate collectors in Graz, Austria".
Note: A Gitlab repository is associated with this dataset, intended as a companion to facilitate maintenance of the Python code that is provided along with the data. If you want to use or contribute to the code, please do so using the Gitlab project: https://gitlab.com/sunpeek/zenodo-fhw-arconsouth-dataset-2017
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Origo is a geospatial spreadsheet dataset documenting ancient migrants attested in the Epigraphic Database Heidelberg (EDH). It is derived from a subset of individuals in the EDH People dataset (available at: https://edh.ub.uni-heidelberg.de/data/download/edh_data_pers.csv) who explicitly declare their geographic origin in the inscriptions. Based on the data curated by the EDH team, we have geocoded the stated places of origin and further enriched the dataset with additional metadata, prioritizing machine readability. We have developed the dataset for the purpose of a quantitative study of migration trends in the Roman Empire as part of the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The scripts used for producing the dataset and for our related publications are available from here: https://github.com/sdam-au/LI_origo/tree/master.
The dataset includes two point geometries per individual:
• Geographic origin (origo_geometry) – representing the individual’s place of origin or birth.
• Findspot (findspot_geometry) – indicating the location where the inscription was discovered, which often approximates the place of death, as approximately 70% of the inscriptions are funerary.
Scope and Structure:
The dataset covers 2,313 individuals, described through 36 attributes. For a detailed explanation of these attributes, please refer to the accompanying file origo_variable_dictionary.csv.
File Formats:
We provide the dataset in two formats for download and analysis:
CSV – for general spreadsheet use.
GeoParquet (v1.0.0) – optimized for geospatial data handling.
In the GeoParquet version, the default geometry is defined by the origo_line attribute, a linestring connecting the origo_geometry (place of origin) and the findspot_geometry (findspot of the inscription). This allows for immediate visualization and analysis of migration patterns in GIS environments.
Getting Started with Python:
To load and explore the GeoParquet dataset in Python, you can use the following code:
import geopandas as gpd import fsspec origo = gpd.read_parquet(fsspec.open("https://zenodo.org/records/14604222/files/origo_geo.parquet?download=1").open())
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.The files in the repository are as followsncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the papersynth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.The data in this repo is used by the code repositories associated with the project https://github.com/JonnoB/scrambledtext_analysishttps://github.com/JonnoB/training_lms_with_synthetic_data
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Tabular Datasets
The datasets are used in this project: Feature Factory
Index Dataset Name File Name Data Type
Format Source
1 Wine Quality (Red Wine) winequality-red.csv Tabular 1,599 CSV Link
2 NYC Yellow Taxi Trip (Jan 2019) yellow_tripdata_2019.parquet Taxi Trip Data ~7M Parquet Link
3 NYC Green Taxi Trip (Jan 2019)green_tripdata_2019.parquet Taxi Trip Data ~1M Parquet Link
4 California Housing Prices california_housing.csv Real Estate Prices… See the full description on the dataset page: https://huggingface.co/datasets/habedi/feature-factory-datasets.
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Construction This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/ [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain}, Dataset Description Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021 Overview: This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs. Every dates have been retrieved from bloc UNIX timestamp and GMT timezone. Contents: The dataset is distributed across three compressed archives: All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package. orbitaal-stream_graph.tar.gz: The root directory is STREAM_GRAPH/ Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes). The stream graph is divided into 13 files, one for each year Files format is parquet Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory STREAM_GRAPH/EDGES/ orbitaal-snapshot-all.tar.gz: The root directory is SNAPSHOT/ Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021). Files format is parquet Name format is orbitaal-snapshot-all.snappy.parquet. These files are in the subdirectory SNAPSHOT/EDGES/ALL/ orbitaal-snapshot-year.tar.gz: The root directory is SNAPSHOT/ Contains the yearly resolution of snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory SNAPSHOT/EDGES/year/ orbitaal-snapshot-month.tar.gz: The root directory is SNAPSHOT/ Contains the monthly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering These files are in the subdirectory SNAPSHOT/EDGES/month/ orbitaal-snapshot-day.tar.gz: The root directory is SNAPSHOT/ Contains the daily resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering These files are in the subdirectory SNAPSHOT/EDGES/day/ orbitaal-snapshot-hour.tar.gz: The root directory is SNAPSHOT/ Contains the hourly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering These files are in the subdirectory SNAPSHOT/EDGES/hour/ orbitaal-nodetable.tar.gz: The root directory is NODE_TABLE/ Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses. Small samples in CSV format orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv These two CSV files are related to stream graph representations of an halvening happening in 2016.
Motivation
The data in this dataset is derived and cleaned from the full OpenSky dataset in order to illustrate in-flight emergency situations triggering the 7700 transponder code. It spans flights seen by the network's more than 2500 members between 1 January 2018 and 29 January 2020.
The dataset complements the following publication:
Xavier Olive, Axel Tanner, Martin Strohmeier, Matthias Schäfer, Metin Feridun, Allan Tart, Ivan Martinovic and Vincent Lenders.
"OpenSky Report 2020: Analysing in-flight emergencies using big data".
In 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC), October 2020
License
See LICENSE.txt
Disclaimer
The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.
Most aircraft information come from the OpenSky aircraft database and have been filled with manual research from various sources on the Internet. Most information about flight plans has been automatically fetched and processed using open APIs; some manual processing was required to cross-check, correct erroneous and fill missing information.
Description of the dataset
Two files are provided in the dataset:
Examples
Additional analyses and visualisations of the data are available at the following page:
<https://traffic-viz.github.io/paper/squawk7700.html>
Credit
If you use this dataset, please cite the original OpenSky paper:
Xavier Olive, Axel Tanner, Martin Strohmeier, Matthias Schäfer, Metin Feridun, Allan Tart, Ivan Martinovic and Vincent Lenders.
"OpenSky Report 2020: Analysing in-flight emergencies using big data".
In 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC), October 2020
Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic and Matthias Wilhelm.
"Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research".
In Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83-94, April 2014.
and the traffic library used to derive the data:
Xavier Olive.
"traffic, a toolbox for processing and analysing air traffic data."
Journal of Open Source Software 4(39), July 2019.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DNA Sequence Database from NCBI
Welcome to the curated DNA sequence dataset, automatically gathered from NCBI using the Enigma2 pipeline. This repository provides ready-to-use CSV and Parquet files for downstream machine-learning and bioinformatics tasks.
📋 Dataset Overview
Scope
A collection of topic-specific DNA sequence sets (e.g., BRCA1, TP53, CFTR) sourced directly from NCBI’s Nucleotide database.
Curation Process
Query Design
Predefined Entrez queries (gene… See the full description on the dataset page: https://huggingface.co/datasets/shivendrra/EnigmaDataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PaperThis dataset is associated with the paper published in Scientific Data, titled "SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting over a Large Turbine Array." You can access the paper: https://www.nature.com/articles/s41597-024-03427-5If you find this dataset useful, please consider citing our paper: Scientific Data Paper@article{zhou2024sdwpf, title={SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting over a Large Turbine Array}, author={Zhou, Jingbo and Lu, Xinjiang and Xiao, Yixiong and Tang, Jian and Su, Jiantao and Li, Yu, and Liu, Ji and Lyu, Junfu and Ma, Yanjun and Dou, Dejing},journal={Scientific Data},volume={11},number={1},pages={649},year={2024},url = {https://doi.org/10.1038/s41597-024-03427-5},publisher={Nature Publishing Group}}Baidu KDD Cup Paper@article{zhou2022sdwpf,title={SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022}, author={Zhou, Jingbo and Lu, Xinjiang and Xiao, Yixiong and Su, Jiantao and Lyu, Junfu and Ma, Yanjun and Dou, Dejing}, journal={arXiv preprint arXiv:2208.04360},url = {https://arxiv.org/abs/2208.04360}, year={2022}}BackgroundThe SDWPF dataset, collected over two years from a wind farm with 134 turbines, details the spatial layout of the turbines and dynamic context factors for each. This dataset was utilized to launch the ACM KDD Cup 2022, attracting registrations from over 2,400 teams worldwide. To facilitate its use, we have released the dataset in two parts: sdwpf_kddcup and sdwpf_full. The sdwpf_kddcup is the original dataset used for the Baidu KDD Cup 2022, comprising both training and test datasets. The sdwpf_full offers a more comprehensive collection, including additional data not available during the KDD Cup, such as weather conditions, dates, and elevation.sdwpf_kddcupThe sdwpf_kddcup dataset is the original dataset used for Baidu KDD Cup 2022 Challenge. The folder structure of sdwpf_kddcup is:sdwpf_kddcup --- sdwpf_245days_v1.csv --- sdwpf_baidukddcup2022_turb_location.csv --- final_phase_test --- infile --- 0001in.csv --- 0002in.csv --- ... --- outfile --- 0001out.csv --- 0002out.csv --- ...The descriptions of each sub-folder in the sdwpf_kddcup dataset are as follows:sdwpf_245days_v1.csv: This dataset, released for the KDD Cup 2022 challenge, includes data spanning 245 days.sdwpf_baidukddcup2022_turb_location.csv: This file provides the relative positions of all wind turbines within the dataset.final_phase_test: This dataset serves as the test data for the final phase of the Baidu KDD Cup. It allows for a comparison of methodologies against those of the award-winning teams from KDD Cup 2022. It includes an 'infile' folder containing input data for the model, and an 'outfile' folder which holds the ground truth for the corresponding output. In other words, for a model function y = f(x), x represents the files in the 'infile' folder, and the ground truth of y corresponds to files in the 'outfile' folder, such as {001out} = f({001in}).More information about the sdwpf_kddcup used for Baidu KDD Cup 2022 can be found in Baidu KDD Cup Paper: SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022sdwpf_fullThe sdwpf_full dataset offers more information than what was released for the KDD Cup 2022. It includes not only SCADA data but also weather data such as relative humidity, wind speed, and wind direction, sourced from the Fifth Generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalyses of the global climate (ERA5). The dataset encompasses data collected over two years from a wind farm with 134 wind turbines, covering the period from January 2020 to December 2021. The folder structure of sdwpf_full is:sdwpf_full--- sdwpf_turb_location_elevation.csv--- sdwpf_2001_2112_full.csv--- sdwpf_2001_2112_full.parquetThe descriptions of each sub-folder in the sdwpf_full dataset are as follows:sdwpf_turb_location_elevation.csv: This file details the relative positions and elevations of all wind turbines within the dataset.sdwpf_2001_2112_full.csv: This dataset includes data collected two years from a wind farm containing 134 wind turbines, spanning from Jan. 2020 to Dec. 2021. It offers comprehensive enhancements over the sdwpf_kddcup/sdwpf_245days_v1.csv, including:Extended time span: It spans two years, from January 2020 to December 2021, whereas sdwpf_245days_v1.csv covers only 245 days.Enriched weather information: This includes additional data such as relative humidity, wind speed, and wind direction, sourced from the Fifth generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalyses of the global climate (ERA5).Expanded temporal details: Unlike during the KDD Cup Challenge where timestamp information was withheld to prevent data linkage, this version includes specific timestamps for each data point.sdwpf_2001_2112_full.parquet: This dataset is identical to sdwpf_2001_2112_full.csv, but in a different data format.
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool