100+ datasets found

w
Randomized Hourly Load Data for use with Taxonomy Distribution Feeders
data.wu.ac.at
application/unknown
Updated Aug 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
Explore at:
application/unknownAvailable download formats
Dataset updated
Aug 29, 2017
Dataset provided by
Department of Energy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

For questions about this dataset, contact andy.hoke@nrel.gov.

If you find this dataset useful, please mention NREL and cite [1] in your work.

References:

[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

data.niaid.nih.gov

Updated Oct 20, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Efstathiou, Stefanos (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682

Explore at:

Dataset updated

Oct 20, 2022

Dataset provided by

Yfantidou, Sofia
Karagianni, Christina
Marchioro, Thomas
Palotti, Joao
Giakatos, Dimitrios Panteleimon
Efstathiou, Stefanos
Girdzijauskas, Šarūnas
Kazlouski, Andrei
Vakali, Athena
Ferrari, Elena

Description

LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id: id (or user_id): type: data: }

Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

Surveys Encoding

BREQ2

Why do you engage in exercise?

    Code
    Text


    engage[SQ001]
    I exercise because other people say I should


    engage[SQ002]
    I feel guilty when I don’t exercise


    engage[SQ003]
    I value the benefits of exercise


    engage[SQ004]
    I exercise because it’s fun


    engage[SQ005]
    I don’t see why I should have to exercise


    engage[SQ006]
    I take part in exercise because my friends/family/partner say I should


    engage[SQ007]
    I feel ashamed when I miss an exercise session


    engage[SQ008]
    It’s important to me to exercise regularly


    engage[SQ009]
    I can’t see why I should bother exercising


    engage[SQ010]
    I enjoy my exercise sessions


    engage[SQ011]
    I exercise because others will not be pleased with me if I don’t


    engage[SQ012]
    I don’t see the point in exercising


    engage[SQ013]
    I feel like a failure when I haven’t exercised in a while


    engage[SQ014]
    I think it is important to make the effort to exercise regularly


    engage[SQ015]
    I find exercise a pleasurable activity


    engage[SQ016]
    I feel under pressure from my friends/family to exercise


    engage[SQ017]
    I get restless if I don’t exercise regularly


    engage[SQ018]
    I get pleasure and satisfaction from participating in exercise


    engage[SQ019]
    I think exercising is a waste of time

PANAS

Indicate the extent you have felt this way over the past week

    P1[SQ001]
    Interested


    P1[SQ002]
    Distressed


    P1[SQ003]
    Excited


    P1[SQ004]
    Upset


    P1[SQ005]
    Strong


    P1[SQ006]
    Guilty


    P1[SQ007]
    Scared


    P1[SQ008]
    Hostile


    P1[SQ009]
    Enthusiastic


    P1[SQ010]
    Proud


    P1[SQ011]
    Irritable


    P1[SQ012]
    Alert


    P1[SQ013]
    Ashamed


    P1[SQ014]
    Inspired


    P1[SQ015]
    Nervous


    P1[SQ016]
    Determined


    P1[SQ017]
    Attentive


    P1[SQ018]
    Jittery


    P1[SQ019]
    Active


    P1[SQ020]
    Afraid

Personality

How Accurately Can You Describe Yourself?

    Code
    Text


    ipip[SQ001]
    Am the life of the party.


    ipip[SQ002]
    Feel little concern for others.


    ipip[SQ003]
    Am always prepared.


    ipip[SQ004]
    Get stressed out easily.


    ipip[SQ005]
    Have a rich vocabulary.


    ipip[SQ006]
    Don't talk a lot.


    ipip[SQ007]
    Am interested in people.


    ipip[SQ008]
    Leave my belongings around.


    ipip[SQ009]
    Am relaxed most of the time.


    ipip[SQ010]
    Have difficulty understanding abstract ideas.


    ipip[SQ011]
    Feel comfortable around people.


    ipip[SQ012]
    Insult people.


    ipip[SQ013]
    Pay attention to details.


    ipip[SQ014]
    Worry about things.


    ipip[SQ015]
    Have a vivid imagination.


    ipip[SQ016]
    Keep in the background.


    ipip[SQ017]
    Sympathize with others' feelings.


    ipip[SQ018]
    Make a mess of things.


    ipip[SQ019]
    Seldom feel blue.


    ipip[SQ020]
    Am not interested in abstract ideas.


    ipip[SQ021]
    Start conversations.


    ipip[SQ022]
    Am not interested in other people's problems.


    ipip[SQ023]
    Get chores done right away.


    ipip[SQ024]
    Am easily disturbed.


    ipip[SQ025]
    Have excellent ideas.


    ipip[SQ026]
    Have little to say.


    ipip[SQ027]
    Have a soft heart.


    ipip[SQ028]
    Often forget to put things back in their proper place.


    ipip[SQ029]
    Get upset easily.


    ipip[SQ030]
    Do not have a good imagination.


    ipip[SQ031]
    Talk to a lot of different people at parties.


    ipip[SQ032]
    Am not really interested in others.


    ipip[SQ033]
    Like order.


    ipip[SQ034]
    Change my mood a lot.


    ipip[SQ035]
    Am quick to understand things.


    ipip[SQ036]
    Don't like to draw attention to myself.


    ipip[SQ037]
    Take time out for others.


    ipip[SQ038]
    Shirk my duties.


    ipip[SQ039]
    Have frequent mood swings.


    ipip[SQ040]
    Use difficult words.


    ipip[SQ041]
    Don't mind being the centre of attention.


    ipip[SQ042]
    Feel others' emotions.


    ipip[SQ043]
    Follow a schedule.


    ipip[SQ044]
    Get irritated easily.


    ipip[SQ045]
    Spend time reflecting on things.


    ipip[SQ046]
    Am quiet around strangers.


    ipip[SQ047]
    Make people feel at ease.


    ipip[SQ048]
    Am exacting in my work.


    ipip[SQ049]
    Often feel blue.


    ipip[SQ050]
    Am full of ideas.

STAI

Indicate how you feel right now

    Code
    Text


    STAI[SQ001]
    I feel calm


    STAI[SQ002]
    I feel secure


    STAI[SQ003]
    I am tense


    STAI[SQ004]
    I feel strained


    STAI[SQ005]
    I feel at ease


    STAI[SQ006]
    I feel upset


    STAI[SQ007]
    I am presently worrying over possible misfortunes


    STAI[SQ008]
    I feel satisfied


    STAI[SQ009]
    I feel frightened


    STAI[SQ010]
    I feel comfortable


    STAI[SQ011]
    I feel self-confident


    STAI[SQ012]
    I feel nervous


    STAI[SQ013]
    I am jittery


    STAI[SQ014]
    I feel indecisive


    STAI[SQ015]
    I am relaxed


    STAI[SQ016]
    I feel content


    STAI[SQ017]
    I am worried


    STAI[SQ018]
    I feel confused


    STAI[SQ019]
    I feel steady


    STAI[SQ020]
    I feel pleasant

TTM

Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

    Code
    Text


    processes[SQ002]
    I read articles to learn more about physical

The Canada Trademarks Dataset
zenodo.org
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4999655
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeremy Sheff; Jeremy Sheff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Canada
Description
The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

Python and Stata Scripts (c) 2021 Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the data files

/do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset

/dta: contains the .dta versions of the data files

/py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
data.europa.eu
zip
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4571228
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.

The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.

All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.

The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.

Notable changes to each version of the dataset are documented in CHANGELOG.md.
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
d
Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://catalog.data.gov/dataset/buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f
Explore at:
Dataset updated
Jan 11, 2024
Dataset provided by
National Renewable Energy Laboratory
Description
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
i
Dataset of synthetic clinical notes in European Portuguese generated using...
rdm.inesctec.pt
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset of synthetic clinical notes in European Portuguese generated using an open-source large language model, along with prompting and evaluation data - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2025-005
Explore at:
Dataset updated
Jun 26, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was generated using an open-source large language model and carefully curated prompts, simulating realistic clinical narratives while ensuring no real patient data is included. The primary purpose of this dataset is to support the development, evaluation, and benchmarking of Artificial Intelligence tools for clinical and biomedical applications in the Portuguese language, especially European Portuguese. It is particularly valuable for information extraction (IE) tasks such as named entity recognition, clinical note classification, summarization, and synthetic data generation in low-resource language settings. The dataset promotes research on the responsible use of synthetic data in healthcare and aims to serve as a foundation for training or fine-tuning domain-specific Portuguese language models in clinical IE and other natural language processing tasks. About the dataset XML files comprising 98,571 fully synthetic clinical notes in European Portuguese, divided into 4 types: 24,759 admission notes, 24,411 ambulatory notes, 24,639 discharge summaries, and 24,762 nursing notes; CSV file with prompts and responses from prompt engineering; CSV files with prompts and responses from synthetic dataset generation; CSV file with results from human evaluation; TXT files containing 1,000 clinical notes (250 of each type) taken from the synthetic dataset and used during automatic evaluation.

Data from: Large-Scale Dataset for Radio Frequency based Device-Free Crowd...

data.niaid.nih.gov
repository.uantwerpen.be
+1more

Updated Apr 28, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Denis, Stijn (2022). Large-Scale Dataset for Radio Frequency based Device-Free Crowd Estimation [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3813449

Explore at:

Dataset updated

Apr 28, 2022

Dataset provided by

Bellekens, Ben
Kaya, Abdil
Weyn, Maarten
Denis, Stijn
Berkvens, Rafael

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset serves to estimate the status, in particular the size, of a crowd given the impact on radio frequency communication links within a wireless sensor network. To quantify this relation, signal strengths across sub-GHz communication links are collected at the premises of the Tomorrowland music festival. The communication links are formed between the network nodes of wireless sensor networks deployed in three of the festival's stage environments.

The table below lists the eighteen dataset files. They are collected at the music festival's 2017 and 2018 editions. There are three environments, labeled: ‘Freedom Stage 2017’, ‘Freedom Stage 2018’, and ‘Main Comfort 2018’. Each environment has both 433 MHz and 868 MHz data. The measurements at each environment were collected over a period of three festival days. The dataset files are formatted as Comma-Separated Values (CSV).

Dataset file	Reference file	Number of messages
free17_433_fri.csv	None	393 852
free17_868_fri.csv	None	472 202
free17_433_sat.csv	free17_transactions.csv	996 033
free17_868_sat.csv	free17_transactions.csv	1 023 059
free17_433_sun.csv	free17_transactions.csv	1 007 066
free17_868_sun.csv	free17_transactions.csv	1 036 456
free18_433_fri.csv	None	765 024
free18_868_fri.csv	None	757 657
free18_433_sat.csv	free18_transactions.csv	711 438
free18_868_sat.csv	free18_transactions.csv	714 390
free18_433_sun.csv	free18_transactions.csv	648 329
free18_868_sun.csv	free18_transactions.csv	656 290
main18_433_fri.csv	None	791 462
main18_868_fri.csv	None	908 407
main18_433_sat.csv	main18_counts.csv	863 666
main18_868_sat.csv	main18_counts.csv	884 682
main18_433_sun.csv	main18_counts.csv	903 862
main18_868_sun.csv	main18_counts.csv	894 496

In addition to the datasets and reference files, a software example is provided to illustrate the data use and visualise the initial findings and relation between crowd size and network signal strength impact.

In order to use the software, please retain the following file structure:

. ├── data ├── data_reference ├── graphs └── software

The peer-reviewed data descriptor for this dataset has now been published in MDPI Data - an open access journal aiming at enhancing data transparency and reusability, and can be accessed here: https://doi.org/10.3390/data5020052. Please cite this when using the dataset.

Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Z
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...
data.niaid.nih.gov
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cesar, Diego (2024). Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10373153
Explore at:
Dataset updated
Jul 8, 2024
Dataset provided by
Cesar, Diego
Bande, Miguel
Wehbe, Bilal
Shah, Nimish
Backe, Christian
Pribbernow, Max
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation

Introduction

This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.

Locations and sensors

The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.

Data volume per session

Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:

Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]

2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1

2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3

2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8

2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9

2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6

2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3

255 40.3 542.7 3’914’921 287.0

Data and metadata structure

Sensor data corpus

The sensor data corpus comprises two processing stages:

raw data streams stored in ROS bagfiles (aka logfiles),

camera and sonar images (aka datafiles) extracted from the logfiles.

The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:

${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}

A typical logfile path has this form:

2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag

A typical datafile path has this form:

2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg

All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.

Metadatabase

The metadatabase is provided in two equivalent forms:

as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,

as a collection of CSV files in the csv/ directory for users who prefer other tools.

The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.

An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json

Some general design remarks:

For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.

In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.

A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.

As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:

SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;
Caravan - A global community dataset for large-sample hydrology (csv...
zenodo.org
application/gzip, zip
Updated May 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederik Kratzert; Frederik Kratzert; Grey Nearing; Grey Nearing; Nans Addor; Nans Addor; Tyler Erickson; Martin Gauch; Martin Gauch; Oren Gilon; Lukas Gudmundsson; Lukas Gudmundsson; Avinatan Hassidim; Daniel Klotz; Daniel Klotz; Sella Nevo; Guy Shalev; Yossi Matias; Tyler Erickson; Oren Gilon; Avinatan Hassidim; Sella Nevo; Guy Shalev; Yossi Matias (2025). Caravan - A global community dataset for large-sample hydrology (csv version) [Dataset]. http://doi.org/10.5281/zenodo.15530022
Explore at:
zip, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15530022
Dataset updated
May 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frederik Kratzert; Frederik Kratzert; Grey Nearing; Grey Nearing; Nans Addor; Nans Addor; Tyler Erickson; Martin Gauch; Martin Gauch; Oren Gilon; Lukas Gudmundsson; Lukas Gudmundsson; Avinatan Hassidim; Daniel Klotz; Daniel Klotz; Sella Nevo; Guy Shalev; Yossi Matias; Tyler Erickson; Oren Gilon; Avinatan Hassidim; Sella Nevo; Guy Shalev; Yossi Matias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the accompanying dataset to the following paper https://www.nature.com/articles/s41597-023-01975-w

Caravan is an open community dataset of meteorological forcing data, catchment attributes, and discharge daat for catchments around the world. Additionally, Caravan provides code to derive meteorological forcing data and catchment attributes from the same data sources in the cloud, making it easy for anyone to extend Caravan to new catchments. The vision of Caravan is to provide the foundation for a truly global open source community resource that will grow over time.

If you use Caravan in your research, it would be appreciated to not only cite Caravan itself, but also the source datasets, to pay respect to the amount of work that was put into the creation of these datasets and that made Caravan possible in the first place.

All current development and additional community extensions can be found at https://github.com/kratzert/Caravan

IMPORTANT: Due to size limitations for individual repositories, the netCDF version and the CSV version of Caravan (since Version 1.6) are split into two different repositories. You can find the netCDF version at https://zenodo.org/records/14673536

Channel Log:

23 May 2022: Version 0.2 - Resolved a bug when renaming the LamaH gauge ids from the LamaH ids to the official gauge ids provided as "govnr" in the LamaH dataset attribute files.

24 May 2022: Version 0.3 - Fixed gaps in forcing data in some "camels" (US) basins.

15 June 2022: Version 0.4 - Fixed replacing negative CAMELS US values with NaN (-999 in CAMELS indicates missing observation).

1 December 2022: Version 0.4 - Added 4298 basins in the US, Canada and Mexico (part of HYSETS), now totalling to 6830 basins. Fixed a bug in the computation of catchment attributes that are defined as pour point properties, where sometimes the wrong HydroATLAS polygon was picked. Restructured the attribute files and added some more meta data (station name and country).

16 January 2023: Version 1.0 - Version of the official paper release. No changes in the data but added a static copy of the accompanying code of the paper. For the most up to date version, please check https://github.com/kratzert/Caravan

10 May 2023: Version 1.1 - No data change, just update data description.

17 May 2023: Version 1.2 - Updated a handful of attribute values that were affected by a bug in their derivation. See https://github.com/kratzert/Caravan/issues/22 for details.

16 April 2024: Version 1.4 - Added 9130 gauges from the original source dataset that were initially not included because of the area thresholds (i.e. basins smaller than 100sqkm or larger than 2000sqkm). Also extended the forcing period for all gauges (including the original ones) to 1950-2023. Added two different download options that include timeseries data only as either csv files (Caravan-csv.tar.xz) or netcdf files (Caravan-nc.tar.xz). Including the large basins also required an update in the earth engine code

16 Jan 2025: Version 1.5 - Added FAO Penman-Monteith PET (potential_evaporation_sum_FAO_PENMAN_MONTEITH) and renamed the ERA5-LAND potential_evaporation band to potential_evaporation_sum_ERA5_LAND. Also added all PET-related climated indices derived with the Penman-Monteith PET band (suffix "_FAO_PM") and renamed the old PET-related indices accordingly (suffix "_ERA5_LAND").

27 May 2025: Version 1.6

Updated the CAMELS-AUS data to source from CAMELS-AUS v2. This means more basins (561 compared to 222) and more recent streamflow data (2022 compared to 2014). Note that the gauge id for four basins changed between the original CAMELS-AUS version and v2. Those gauges are ['camelsaus_224213A', 'camelsaus_224214A', 'camelsaus_227225A', 'camelsaus_403213A'] that all lost their trailing "A". To stay synced with CAMELS-AUS (v2), we also adapted the new naming.

Added VERSION file to the root directory that contains the current version number.

Updated the code to the most recent GitHub snapshot (commit 6eab036).

Due to the 50GB repository limit, we had to split the netCDF version and the CSV version into two separate repositories. The CSV version can be found under https://zenodo.org/records/15530021
Z
Reference datasets for in-flight emergency situations
data.niaid.nih.gov
zenodo.org
Updated Jul 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Lenders (2020). Reference datasets for in-flight emergency situations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3937482
Explore at:
Dataset updated
Jul 10, 2020
Dataset provided by
Allan Tart
Ivan Martinovic
Martin Strohmeier
Matthias Schäfer
Xavier Olive
Axel Tanner
Metin Feridun
Vincent Lenders
Description
Motivation

The data in this dataset is derived and cleaned from the full OpenSky dataset in order to illustrate in-flight emergency situations triggering the 7700 transponder code. It spans flights seen by the network's more than 2500 members between 1 January 2018 and 29 January 2020.

The dataset complements the following publication:

Xavier Olive, Axel Tanner, Martin Strohmeier, Matthias Schäfer, Metin Feridun, Allan Tart, Ivan Martinovic and Vincent Lenders. "OpenSky Report 2020: Analysing in-flight emergencies using big data". In 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC), October 2020

License

See LICENSE.txt

Disclaimer

The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.

Most aircraft information come from the OpenSky aircraft database and have been filled with manual research from various sources on the Internet. Most information about flight plans has been automatically fetched and processed using open APIs; some manual processing was required to cross-check, correct erroneous and fill missing information.

Description of the dataset

Two files are provided in the dataset:

one compressed parquet file with trajectory information;

one metadata CSV file with the following features:

flight_id: a unique identifier for each trajectory;

callsign: ICAO flight callsign information;

number: IATA flight number, when available;

icao24, registration, typecode: information about the aircraft;

origin: the origin airport for the aircraft, when available;

landing: the airport where the aircraft actually landed, when available;

destination: the intended destination airport, when available;

diverted: the diversion airport, if applicable, when available;

tweet_problem, tweet_result, tweet_fueldump: information extracted from Twitter accounts, about the nature of the issue, the consequence of the emergency and whether the aircraft is known to have dumped fuel;

avh_id, avh_problem, avh_result, avh_fueldump: information extracted from The Aviation Herald, about the nature of the issue, the consequence of the emergency and whether the aircraft is known to have dumped fuel. The complete URL for each event is https://avherald.com/h?article={avh_id}&opt=1 (replace avh_id by the actual value)

Examples

Additional analyses and visualisations of the data are available at the following page:

Credit

If you use this dataset, please cite the original OpenSky paper:

Xavier Olive, Axel Tanner, Martin Strohmeier, Matthias Schäfer, Metin Feridun, Allan Tart, Ivan Martinovic and Vincent Lenders. "OpenSky Report 2020: Analysing in-flight emergencies using big data". In 2020 IEEE/AIAA 39th Digital Avionics Systems Conference (DASC), October 2020

Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic and Matthias Wilhelm. "Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research". In Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83-94, April 2014.

and the traffic library used to derive the data:

Xavier Olive. "traffic, a toolbox for processing and analysing air traffic data." Journal of Open Source Software 4(39), July 2019.
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
H
Wake Vision
dataverse.harvard.edu
tensorflow.org
+1more
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colby Banbury; Emil Njor; Matthew Stewart; Pete Warden; Manjunath Kudlur; Nat Jeffries; Andrew Howard; Xenofon Fafoutis; Vijay Reddi (2024). Wake Vision [Dataset]. http://doi.org/10.7910/DVN/1HOPXC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/1HOPXC
Dataset updated
May 29, 2024
Dataset provided by
Harvard Dataverse
Authors
Colby Banbury; Emil Njor; Matthew Stewart; Pete Warden; Manjunath Kudlur; Nat Jeffries; Andrew Howard; Xenofon Fafoutis; Vijay Reddi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
"Wake Vision" is a large, high-quality dataset featuring over 6 million images, significantly exceeding the scale and diversity of current tinyML datasets (100x). The dataset contains images with annotations of whether each image contains a person. Additionally, the dataset incorporates a comprehensive fine-grained benchmark to assess fairness and robustness, covering perceived gender, perceived age, subject distance, lighting conditions, and depictions. This dataset hosted on Harvard Dataverse contains images, CSV files, and code to generate a Wake Vision TensorFlow Dataset. We publish the annotations of this dataset under a CC BY 4.0 license. All images in the dataset are from the Open Images v7 dataset, which are sourced images from Flickr and are listed as having a CC BY 2.0 license.
FHFA Data: Uniform Appraisal Dataset Aggregate Statistics
datalumos.org
openicpsr.org
Updated Feb 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Housing Finance Agency (2025). FHFA Data: Uniform Appraisal Dataset Aggregate Statistics [Dataset]. http://doi.org/10.3886/E219961V1
Explore at:
Unique identifier
https://doi.org/10.3886/E219961V1
Dataset updated
Feb 18, 2025
Dataset authored and provided by
Federal Housing Finance Agencyhttps://www.fhfa.gov/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2013 - 2024
Area covered
United States of America
Description
The Uniform Appraisal Dataset (UAD) Aggregate Statistics Data File and Dashboards are the nation’s first publicly available datasets of aggregate statistics on appraisal records, giving the public new access to a broad set of data points and trends found in appraisal reports. The UAD Aggregate Statistics for Enterprise Single-Family, Enterprise Condominium, and Federal Housing Administration (FHA) Single-Family appraisals may be grouped by neighborhood characteristics, property characteristics and different geographic levels.DocumentationOverview (10/28/2024)Data Dictionary (10/28/2024)Data File Version History and Suppression Rates (12/18/2024)Dashboard Guide (2/3/2025)UAD Aggregate Statistics DashboardsThe UAD Aggregate Statistics Dashboards are the visual front end of the UAD Aggregate Statistics Data File. The Dashboards are designed to provide easy access to customized maps and charts for all levels of users. Access the UAD Aggregate Statistics Dashboards here.UAD Aggregate Statistics DatasetsNotes:Some of the data files are relatively large in size and will not open correctly in certain software packages, such as Microsoft Excel. All the files can be opened and used in data analytics software such as SAS, Python, or R.All CSV files are zipped.
D
Dataset for Design Ideation Study
dataverse.azure.uit.no
dataverse.no
application/x-h5, pdf +3
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Gornitzka Abelson; Filip Gornitzka Abelson; Henrikke Dybvik; Henrikke Dybvik; Martin Steinert; Martin Steinert (2024). Dataset for Design Ideation Study [Dataset]. http://doi.org/10.18710/PZQC4A
Explore at:
tsv(7501), txt(13093), application/x-h5(25860340), application/x-h5(286920385), zip(581532), tsv(295160), application/x-h5(540715825), tsv(767327), application/x-h5(49209334), application/x-h5(510702725), tsv(1336354), tsv(2010), tsv(1935109), pdf(33267), application/x-h5(272694817)Available download formats
Unique identifier
https://doi.org/10.18710/PZQC4A
Dataset updated
Feb 28, 2024
Dataset provided by
DataverseNO
Authors
Filip Gornitzka Abelson; Filip Gornitzka Abelson; Henrikke Dybvik; Henrikke Dybvik; Martin Steinert; Martin Steinert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Study information Design ideation study (N = 24) using eye tracking technology. Participants solved a total of twelve design problems while receiving inspirational stimuli on a monitor. Their task was to generate as many solutions to each problem and explain their solution briefly by thinking aloud. The study allows for getting further insight into how inspirational stimuli improve idea fluency during design ideation. This dataset features processed data from the experiment. Eye tracking data includes gaze data, fixation data, blink data, and pupillometry data for all participants. The study is based on the following research paper and follows the same experimental setup: Goucher-Lambert, K., Moss, J., & Cagan, J. (2019). A neuroimaging investigation of design ideation with and without inspirational stimuli—understanding the meaning of near and far stimuli. Design Studies, 60, 1-38. DOI Dataset Most files in the dataset are saved as CSV files or other human readable file formats. Large files are saved in Hierarchical Data Format (HDF5/H5) to allow for smaller file sizes and higher compression. All data is described thoroughly in 00_ReadMe.txt. The following processed data is included in the dataset: Concatenated annotations file of experimental flow for all participants (CSV). All eye tracking raw data in concatenated files. Annotated with only participant ID. (CSV/HDF5) Annotated eye tracking data for ideation routines only. A subset of the files above. (CSV/HDF5) Audio transcriptions from Google Cloud Speech-to-Text API of each recording with annotations. (CSV) Raw API response for each transcription. These files include time offset for each word in a recording. (JSON) Data for questionnaire feedback and ideas generated during the experiment. (CSV) Data for the post-experiment survey, including demographic information (TSV). Python code used for the open-source experimental setup and dataset construction is hosted at GitHub. Repository also includes code of how the dataset has been further processed.
F
Spanish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/spanish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Spanish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Spanish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Spanish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Spanish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Spanish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Spanish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Spanish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
h
McBE
huggingface.co
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Velikaya Scarlet (2025). McBE [Dataset]. https://huggingface.co/datasets/Velikaya/McBE
Explore at:
Dataset updated
Aug 9, 2025
Authors
Velikaya Scarlet
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ATTENTION: There are two types of data format files here: CSV and XLSX. The CSV files are uploaded for easy browsing of the data on Hugging Face. For actual testing, please use the files in the XLSX folder.

Dataset Card for Dataset Name Dataset Details Dataset Description

McBE is designed to address the scarcity of Chinese-centric bias evaluation resources for large language models (LLMs). It supports multi-faceted bias assessment across 5 evaluation tasks… See the full description on the dataset page: https://huggingface.co/datasets/Velikaya/McBE.
Expenditure in the Salisbury NHS (V2)
kaggle.com
zip
Updated Sep 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepak Tejasvi Singh (2020). Expenditure in the Salisbury NHS (V2) [Dataset]. https://www.kaggle.com/deepaktejasvisingh/expenditure-in-the-salisbury-nhs-v2
Explore at:
zip(5364575 bytes)Available download formats
Dataset updated
Sep 27, 2020
Authors
Deepak Tejasvi Singh
Description
Context

To create this dataset, I first accessed government datasets from https://data.gov.uk/dataset/88c0ff75-0efb-4e9b-b8d5-2282eb03efb8/spend-over-25-000-in-salisbury-nhs-foundation-trust/

These datasets contained monthly records of spending in the Salisbury NHs Foundation Trust. I merged and cleaned all the datasets to produce one large CSV file.

Content

This dataset contains all details of expenditure from 2010 to 2020 by the Salisbury NHS Foundation Trust. There are a few months missing due to issues with data.gov.uk.

Acknowledgements

Our thanks to the Salisbury NHS Foundation Trust for collecting the original data.

License: Open Government License, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

Inspiration

This is the first open dataset for the Salisbury Open Data Project. We welcome any and every possible exploration of this data.
S&P 500 stock data
kaggle.com
zip
Updated Aug 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cam Nugent (2017). S&P 500 stock data [Dataset]. https://www.kaggle.com/camnugent/sandp500
Explore at:
zip(31994392 bytes)Available download formats
Dataset updated
Aug 11, 2017
Authors
Cam Nugent
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.

The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.

Content

The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the all_stocks_5yr.csv and corresponding folder) and a smaller version of the dataset (all_stocks_1yr.csv) with only the past year's stock data for those wishing to use something more manageable in size.

The folder individual_stocks_5yr contains files of data for individual stocks, labelled by their stock ticker name. The all_stocks_5yr.csv and all_stocks_1yr.csv contain this same data, presented in merged .csv files. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.

All the files have the following columns: Date - in format: yy-mm-dd Open - price of the stock at market open (this is NYSE data so all in USD) High - Highest price reached in the day Low Close - Lowest price reached in the day Volume - Number of shares traded Name - the stock's ticker name

Acknowledgements

I scraped this data from Google finance using the python library 'pandas_datareader'. Special thanks to Kaggle, Github and The Market.

Inspiration

This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!

Facebook

Twitter

Click to copy link

Link copied

Cite

Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

application/unknownAvailable download formats

Dataset updated

Aug 29, 2017

Dataset provided by

Department of Energy

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

For questions about this dataset, contact andy.hoke@nrel.gov.

If you find this dataset useful, please mention NREL and cite [1] in your work.

References:

[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

Clear search

Close search

Google apps

Main menu

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

The Canada Trademarks Dataset

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Network traffic and code for machine learning classification

Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

Dataset of synthetic clinical notes in European Portuguese generated using...

Data from: Large-Scale Dataset for Radio Frequency based Device-Free Crowd...

Film Circulation dataset

Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...

Caravan - A global community dataset for large-sample hydrology (csv...

Reference datasets for in-flight emergency situations

Network Traffic Dataset

Wake Vision

FHFA Data: Uniform Appraisal Dataset Aggregate Statistics

Dataset for Design Ideation Study

Spanish Open Ended Question Answer Text Dataset

What’s Included

McBE

Expenditure in the Salisbury NHS (V2)

Context

Content

Acknowledgements

Inspiration

S&P 500 stock data

Context

Content

Acknowledgements

Inspiration

Randomized Hourly Load Data for use with Taxonomy Distribution Feeders