This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
.parquet
This dataset gathers data in .parquet
format. Instead of having a .csv.gz
per department per period, all departments are grouped into a single file per period. When possible (depending on the size), several periods are grouped in the same file. ### Data origin The data come from: - Basic climatological data - monthly - Basic climatological data - daily - Basic climatological data - times - Basic climatological data - 6 minutes ### Data preparation The files ending with .prepared
have undergone slight preparation steps: - deleting spaces in the name of columns - typing (flexible) The data are typed according to: - date (YYYYMM
, YYYMMDD
, YYYYMMDDDDH
, YYYYMMDDDDHMN
): integer - NUM_POST' : string -
USUAL_NAME: string - "LAT": float -
LON: float -
ALTI: integer - if the column begins with
Q(‘quality’) or
NB` (‘number’): integer ### Update The data are updated at least once a week (depending on my availability) on the data for the period ‘latest-2023-2024’. If you have specific needs, feel free to get closer to me. ### Re-use: Meteo Squad These files are used in the Meteo Squad web application: https://www.meteosquad.com ### Contact If you have specific requests, please do not hesitate to contact me: contact@mistermeteo.comData collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.
The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
This dataset was created by VK
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Converted ~50 GB of csv data -> - ~5.4GB of feather file data and - ~20 GB of parquet file data for the American Express: Default Prediction competition.
This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.
It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).
AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.
For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).
Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
This data provides results from the California Environmental Data Exchange Network (CEDEN) for field and lab chemistry analyses. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
NOTE: Some of the field and lab chemistry data that has been submitted to CEDEN since 2020 has not been loaded into the CEDEN database. That data is not included in this data set (and is also not available via the CEDEN query tool described above), but is available as a supplemental data set available here: Surface Water - Chemistry Results - CEDEN Augmentation. For consistency, many of the conditions applied to the data in this dataset and in the CEDEN query tool are also applied to that supplemental dataset (e.g., no rejected data or replicates are included), but that supplemental data is provisional and may not reflect all of the QA/QC controls applied to the regular CEDEN data available here.
Overview
The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.
Data set A - anonimised smart meter data
Data set B - aggregated smart meter data
Contents of this data set
This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet
The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:
id: the anonymized counter ID (text)
timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)
value_kwh: the consumption in kWh in the time window under consideration (float)
In this archive, data from:
| Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |
was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.
A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.
It consists of the following parquet files:
| Size | Date | Name |
|------|------|------|
| 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |
| 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |
| 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |
| 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |
| 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |
| 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |
| 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |
| 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |
| 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |
| 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |
| 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |
| 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |
| 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |
| 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |
| 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |
| 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |
| 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |
| 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |
| 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |
| 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |
| 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |
| 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |
| 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |
| 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |
| 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |
| 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |
| 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |
| 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |
| 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |
| 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |
| 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |
| 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |
| 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |
| 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |
| 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |
| 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |
| 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |
| 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |
| 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |
| 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |
| 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |
| 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |
| 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |
| 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |
| 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |
| 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |
| 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |
| 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |
| 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |
| 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |
| 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |
| 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |
| 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |
| 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |
| 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |
| 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |
| 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |
| 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |
| 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |
| 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |
| 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |
| 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |
| 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |
| 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |
| 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |
where each file contains data from a single smart meter.
Acknowledgement
The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.
https://brightdata.com/licensehttps://brightdata.com/license
Unlock powerful insights with the Amazon Prime dataset, offering access to millions of records from any Amazon domain. This dataset provides comprehensive data points such as product titles, descriptions, exclusive Prime discounts, brand details, pricing (initial and discounted), availability, customer ratings, reviews, and product categories. Additionally, it includes unique identifiers like ASINs, images, and seller information, allowing you to analyze Prime offerings, trends, and customer preferences with precision. Use this dataset to optimize your eCommerce strategies by analyzing Prime-exclusive pricing strategies, identifying top-performing brands and products, and tracking customer sentiment through reviews and ratings. Gain valuable insights into consumer demand, seasonal trends, and the impact of Prime discounts to make data-driven decisions that enhance your inventory management, marketing campaigns, and pricing strategies. Whether you’re a retailer, marketer, data analyst, or researcher, the Amazon Prime dataset empowers you with the data needed to stay competitive in the dynamic eCommerce landscape. Available in various formats such as JSON, CSV, and Parquet, and delivered via flexible options like API, S3, or email, this dataset ensures seamless integration into your workflows.
Summary GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables. Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions. We believe GitTables can facilitate many use-cases, among which: Data integration, search and validation. Data visualization and analysis recommendation. Schema analysis and completion for e.g. database or knowledge base design. If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets). Dataset contents The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4. In summary, this dataset can be characterized as follows: Statistic Value # tables 1M average # columns 12 average # rows 142 # annotated tables (at least 1 column annotation) 723K+ (DBpedia), 738K+ (Schema.org) # unique semantic types 835 (DBpedia), 677 (Schema.org) How to download The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download). Future releases Future releases will include the following: Increased number of tables (expected at least 10M) Associated datasets - GitTables benchmark - column type detection: https://zenodo.org/record/5706316 - GitTables 1M - CSV files: https://zenodo.org/record/6515973
This data provides results from chemistry and field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Tabular Datasets
The datasets are used in this project: Feature Factory
Index Dataset Name File Name Data Type
Format Source
1 Wine Quality (Red Wine) winequality-red.csv Tabular 1,599 CSV Link
2 NYC Yellow Taxi Trip (Jan 2019) yellow_tripdata_2019.parquet Taxi Trip Data ~7M Parquet Link
3 NYC Green Taxi Trip (Jan 2019)green_tripdata_2019.parquet Taxi Trip Data ~1M Parquet Link
4 California Housing Prices california_housing.csv Real Estate Prices… See the full description on the dataset page: https://huggingface.co/datasets/habedi/feature-factory-datasets.
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the _byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr_{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
The following submission includes raw and processed electrical configuration deployment data from the in water deployment of NREL's Hydraulic and Electric Reverse Osmosis Wave Energy Converter (HERO WEC), in the form of parquet files, TDMS files, CSV files, bag files, and MATLAB workspaces. This dataset was collected in April 2024 at the Jennette's pier test site in North Carolina. Raw data as TDMS, CSV, and bag files are provided here alongside processed data in the form of MATLAB workspaces and Parquet files. This dataset includes the Python code used to process the data and MATLAB scripts to visualize the processed data. All data types, calculations, and processing is described in the included "Data Descriptions" document. All files in this dataset are described in detail in the included README. This data set has been developed by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Water Power Technologies Office.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes scripts, text files, and cached CSV/Parquet or raw TXT data files used to generate all analysis and results from the paper. A README.md file is included in replication-pkg.zip for details on using the scripts.
If you only want to inspect the figures, you do not need a data ZIP.
If you want to simply re-generate the figures without changes, download data-cached.zip. If you want to make any sort of change to the analyses, you will want to download data-raw.zip.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Latin Inscriptions in Space and Time (LIST) dataset is an aggregate of the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/); aggregated EDH on Zenodo and Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/); aggregated EDCS on Zenodo epigraphic datasets created by the Social Dynamics in the Ancient Mediterranean Project (SDAM), 2019-2023, funded by the Aarhus University Forskningsfond Starting grant no. AUFF-E-2018-7-2. The LIST dataset consists of 525,870 inscriptions, enriched by 65 attributes. 77,091 inscriptions are overlapping between the two source datasets (i.e. EDH and EDCS); 3,316 inscriptions are exclusively from EDH; 445,463 inscriptions are exclusively from EDCS. 511,973 inscriptions have valid geospatial coordinates (the geometry
attribute). This information is also used to determine the urban context of each inscription (i.e. whether it is in the neighbourhood (i.e. within a 5000m buffer) of a large city, medium city, or small city or rural (>5000m to any type of city; see the attributes urban_context
, urban_context_city
, and urban_context_pop
). 206,570 inscriptions have a numerical date of origin expressed by means of an interval or singular year using the attributes not_before
and not_after
. The dataset also employs a machine learning model to classify the inscriptions covered exclusively by EDCS in terms of 22 categories employed by EDH, see Kaše, Heřmánková, Sobotkova 2021.
Formats
We publish the dataset in the parquet and geojson file format. A description of individual attributes is available in the Metadata.csv. Using geopandas
library, you can load the data directly from Zenodo into your Python environment using the following command: LIST = gpd.read_parquet("
https://zenodo.org/record/8431323/files/LIST_v1-0.parquet?download=1")
. In R, the sfarrow and sf library hold tools (st_read_parquet(), read_sf()) to load a parquet and geojson respectively after you have downloaded the datasets locally. The scripts used to generate the dataset are available via GitHub: https://github.com/sdam-au/LI_ETL
The origin of existing attributes is further described in columns ‘dataset_source’, ‘source’, and ‘description’ in the attached Metadata.csv.
Further reading on the dataset creation and methodology:
Reading on applications of the datasets in research:
Notes on spatial attributes
Machine-readable spatial point geometries are provided within the geojson and parquet formats, as well as ‘Latitude’ and ‘Longitude’ columns, which contain geospatial decimal coordinates where these are known. Additional attributes exist that contain textual references to original location at different scales. The most reliable attribute with textual information on place of origin is the urban_context_city. This contains the ancient toponym of the largest city within a 5 km distance from the inscription findspot, using cities from Hanson’s 2016 list. After these universal attributes, the remaining columns are source-dependent, and exist only for either EDH or EDCS subsets. ‘pleiades_id’ column, for example, cross references the inscription findspot to geospatial location in the Pleiades but only in the EDH subset. ‘place’ attribute exists for data from EDCS (Ort) and contains ancient as well as modern place names referring to the findspot or region of provenance separated by “/”. This column requires additional cleaning before computational analysis. Attributes with _clean affix indicate that the text string has been stripped of symbols (such as ?), and most refer to aspects of provenance in the EDH subset of inscriptions.
List of all spatial attributes:
Disclaimer
The original data is provided by the third party indicated as the data source (see the ‘data_source’ column in the Metadata.csv). SDAM did not create the original data, vouch for its accuracy, or guarantee that it is the most recent data available from the data provider. For many or all of the data, the data is by its nature approximate and will contain some inaccuracies or missing values. The data may contain errors introduced by the data provider(s) and/or by SDAM. We always recommend checking the accuracy directly in the primary source, i.e. the editio princeps of the inscription in question.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About the data
These are partial results from The Geography of Human Flourishing Project analysis for the years 2010-2023. This project is one of the 10 national projects awarded within the Spatial AI-Challange 2024, an international initiative at the crossroads of geospatial science and artificial intelligence. At present only a subset of data for 2010-2012 are present. Data are in the form of CSV or parquet. In the datasets, FIPS is the FIPS code for a US state, county is the US… See the full description on the dataset page: https://huggingface.co/datasets/siacus/flourishing.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
VTuber 1B is a dataset for large-scale academic research, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.
See GitHub and join #livechat-dataset
channel on SIGVT Discord for discussions.
We also offer ❤️🩹 Sensai, a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.
See public notebooks built on VTuber 1B and VTuber 1B Elements for ideas.
We employed Honeybee cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.
Kaggle Datasets (2 MB)
VTuber 1B Elements is most suitable for statistical visualizations and explanatory data analysis.
filename | summary |
---|---|
channels.csv | Channel index |
chat_stats.csv | Chat statistics |
superchat_stats.csv | Super Chat statistics |
VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster fro Kaggle Kernels to load data in.
filename | summary |
---|---|
chats_%Y-%m.parquet | Live chat events (> 1,000,000,000) |
superchats_%Y-%m.parquet | Super chat events (> 4,000,000) |
deletion_events.parquet | Deletion events |
ban_events.parquet | Ban events |
Ban and deletion are equivalent to
markChatItemsByAuthorAsDeletedAction
andmarkChatItemAsDeletedAction
respectively.
chats_%Y-%m.csv
)column | type | description | in standard version |
---|---|---|---|
timestamp | string | ISO 8601 UTC timestamp | limited accuracy |
id | string | chat id | N/A |
authorName | string | author name | N/A |
authorChannelId | string | author channel id | anonymized |
body | string | chat message | N/A |
bodyLength | number | chat message length | standard version only |
membership | string | membership status | N/A |
isMember | nullable boolean | is member (null if unknown) | standard version only |
isModerator | boolean | is channel moderator | N/A |
isVerified | boolean | is verified account | N/A |
videoId | string | source video id | |
channelId | string | source channel id |
value | duration |
---|---|
unknown | Indistinguishable |
non-member | 0 |
new | < 1 month |
1 month | >= 1 month, < 2 months |
2 months | >= 2 months, < 6 months |
6 months | >= 6 months, < 12 months |
1 year | >= 12 months, < 24 months |
2 years | >= 24 months |
Set keep_default_na
to False
and na_values
to ''
in read_csv
. Otherwise, chat message like NA
would incorrectly be treated as NaN value.
chats = ...
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here.
Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool