36 datasets found

Language Generation Dataset: 200M Samples
kaggle.com
zip
Updated Sep 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples
Explore at:
zip(3416608411 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
Abhishek Chatterjee
Description
Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.
h
instruction-dataset-mini-with-generations
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdoulaye Diallo, instruction-dataset-mini-with-generations [Dataset]. https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Abdoulaye Diallo
Description
Dataset Card for instruction-dataset-mini-with-generations

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations.
w
Dataset of books called The M-factor : how the millennial generation is...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called The M-factor : how the millennial generation is rocking the workplace [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+M-factor+%3A+how+the+millennial+generation+is+rocking+the+workplace
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is The M-factor : how the millennial generation is rocking the workplace. It features 7 columns including author, publication date, language, and book publisher.
ManimBench v1
kaggle.com
huggingface.co
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravidu Silva (2025). ManimBench v1 [Dataset]. https://www.kaggle.com/datasets/ravidussilva/manim-sft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravidu Silva
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
📚 ManimBench Dataset v1

Overview

ManimBench v1 is a curated dataset designed to support fine-tuning and evaluation of models that generate or interpret Manim animations. It pairs Manim code snippets with natural language descriptions, enabling research in code generation, animation synthesis, and multimodal understanding.

🔗 GitHub Repository for Fine-Tuning: SuienS/manim-fine-tune

📄 Research Paper: Coming Soon

The dataset can be also accessed directly from the HuggingFace Hub.

🧠 Use Cases

Fine-tuning LLMs for code generation or animation synthesis

Benchmarking natural language to animation tasks

Studying alignment between code and human-readable descriptions
Z
CMS 2011A Open Data | Jet Primary Dataset | pT > 375 GeV | MOD HDF5 Format
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mastandrea, Radha (2020). CMS 2011A Open Data | Jet Primary Dataset | pT > 375 GeV | MOD HDF5 Format [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3340204
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Komiske, Patrick
Naik, Preksha
Thaler, Jesse
Metodiev, Eric
Mastandrea, Radha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of 1,785,625 jets from the Jet Primary Dataset of the CMS 2011A Open Data reprocessed into the MOD HDF5 format. Jets are selected from the hardest two anti-kT R=0.5 jets in events passing the Jet300 High Level Trigger and are required to have (p_T^\text{jet}>375) GeV, where (p_T^\text{jet}) includes a jet energy correction factor. Particle Flow Candidates (PFCs) for each jet are provided and include information about the PFC kinematics, PDG ID, and vertex. Additionally, jets have metadata describing their kinematics and provenance in the original CMS AOD files.

For additional details about the dataset, please see the accompanying paper, Exploring the Space of Jets with CMS Open Data. There, jets were further restricted to have (|\eta^\text{jet}|<1.9) to ensure tracking coverage and have "medium" quality to reject fake jets.

The supported method for downloading, reading, and using this dataset is through the EnergyFlow Python package, which has additional documentation about how to read and use this and related datasets. Should any problems be encountered, please submit an issue on GitHub.

There are corresponding datasets of simulated jets organized by hard parton (\hat p_T) also available on Zenodo:

SIM/GEN QCD Jets 170-300 GeV

SIM/GEN QCD Jets 300-470 GeV

SIM/GEN QCD Jets 470-600 GeV

SIM/GEN QCD Jets 600-800 GeV

SIM/GEN QCD Jets 800-1000 GeV

SIM/GEN QCD Jets 1000-1400 GeV

SIM/GEN QCD Jets 1400-1800 GeV

SIM/GEN QCD Jets 1800-(\infty) GeV
General Near Surface Ocean Current - Dataset - data.gov.ie
data.gov.ie
Updated Nov 11, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.gov.ie (2016). General Near Surface Ocean Current - Dataset - data.gov.ie [Dataset]. https://data.gov.ie/dataset/general-near-surface-ocean-current
Explore at:
Dataset updated
Nov 11, 2016
Dataset provided by
data.gov.ie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An ocean current is a continuous, directed movement of seawater generated by forces acting upon this mean flow, such as breaking waves, wind, the Coriolis effect, cabbeling, and temperature and salinity differences, while tides are caused by the gravitational pull of the Sun and Moon. Depth contours, shoreline configurations, and interactions with other currents influence a current's direction and strength. Ocean currents flow for great distances, and together, create the global conveyor belt which plays a dominant role in determining the climate of many of the Earths regions. More specifically, ocean currents influence the temperature of the regions through which they travel. General near surface ocean current data was provided by Petroleum Affairs Division. Data was created as part of the Irish Offshore Strategic Environmental Assessment (IOSEA).
NLUCat
zenodo.org
huggingface.co
+1more
zip
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10721193
Dataset updated
Mar 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLUCat

Dataset Description

Dataset Summary

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

In this repository you'll find the following items:

NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

NLUCat_dataset.json: the completed NLUCat dataset

NLUCat_stats.tsv: statistics about de NLUCat dataset

dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

reports: folder with the reports done as feedback to the annotators during the annotation process

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

Supported Tasks and Leaderboards

Intent classification, spans identification and examples generation.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Data Fields

example: `str`. Example

annotation: `dict`. Annotation of the example

intent: `str`. Intent tag

slots: `list`. List of slots

Tag:`str`. tag to the slot

Text:`str`. Text of the slot

Start_char: `int`. First character of the span

End_char: `int`. Last character of the span

Example

An example looks as follows:

{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},

Data Splits

NLUCat.train: 9128 examples

NLUCat.dev: 1441 examples

NLUCat.test: 1441 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

Source Data

Initial Data Collection and Normalization

We commissioned a company to create fictitious examples for the creation of this dataset.

Who are the source language producers?

We commissioned the writing of the examples to the company m47 labs.

Annotations

Annotation process

The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

Who are the annotators?

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

Personal and Sensitive Information

No personal or sensitive information included.

The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

Considerations for Using the Data

Social Impact of Dataset

We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

Discussion of Biases

When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
t
PV Generation and Consumption Dataset of an Estonian Residential Dwelling
data.taltech.ee
Updated Mar 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov (2025). PV Generation and Consumption Dataset of an Estonian Residential Dwelling [Dataset]. http://doi.org/10.48726/6hayh-x0h25
Explore at:
Unique identifier
https://doi.org/10.48726/6hayh-x0h25
Dataset updated
Mar 22, 2025
Dataset provided by
TalTech Data Repository
Authors
Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Estonia
Description
This is a Residential PV generation and consumption data set from an Estonian house. At the time of submission, one year (2023) of data was available. The data was logged at a 10-second resolution. The untouched dataset can be found in the raw data folder, which is separated month-wise. A few missing points in the dataset were filled with a simple KNN algorithm. However, improved data imputation methods based on machine learning are also possible. To carry out the imputing, run the scripts in the script folder one by one in the numerical serial order (SC1..py, SC2..py, etc.).

Data Descriptor (Scientific Data): https://doi.org/10.1038/s41597-025-04747-w">https://doi.org/10.1038/s41597-025-04747-w

General Information:

Duration: January 2023 – December 2023

Resolution: 10 seconds

Dataset Type: Aggregated consumption and PV generation data

Logging Device: Camile Bauer PQ1000 (×2)

Load/Appliance Information:

5 kW Rooftop PV array connected to AC Bus via 4.2kW 3-ϕ Inverter

Air conditioner: 0.44 kW (Cooling), 0.62 kW (Heating)

Air to Water (ATW) Heat Pump: 2.5kW (Cooling), 2.6 kW (Heating)

ATW Cylinder unit: 0.21 kW (Controller), 9 kW (Booster Heater)

Microwave oven: 0.9 kW

Coffee Maker: 1 kW

Cooktop Hot Plate: 4.6 kW

TV: 0.103 kW

Vacuum Cleaner: 1.5 kW

Ventilation: 0.1 kW

Washing Machine: 2.2 kW

Electric Sauna: 10 kW

Lighting: 0.25 kW

EV charger: 2.4 kW 1-ϕ

Measurement Points:

PV converter-side current transformer, potential transformer (Measurement of PV generation).

Utility meter-side current transformer, potential transformer (Measurement of power exchange with the grid).

Measured Parameters:

Per-phase mean power recorded within the sampling period

Per-phase Minimum power recorded within the sampling period

Per-phase maximum power recorded within the sampling period

Quadrant-wise mean power recorded within the sampling period (1st + 3rd), (2nd + 4th)

Quadrant-wise minimum power recorded within the sampling period (1st + 3rd), (2nd + 4th)

Quadrant-wise maximum power recorded within the sampling period (1st + 3rd), (2nd + 4th)

mean power Factor recorded within the sampling period

Minimum power Factor recorded within the sampling period

Maximum power Factor recorded within the sampling period

System Voltage

Minimum system Voltage

Maximum system Voltage

Mean Voltage between phase and neutral

Minimum voltage between phase and neutral

Maximum voltage between phase and neutral

Zero displacement voltage 4-wire systems (mean, min, max)

Script Description:

SC1_PV_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for PV generation data.

SC2_L2_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for meter-side measurement data.

SC3_PV_KNN_impute.py : Filling missing data points by simple KNN for PV generation data.

SC4_L2_KNN_impute.py : Filling missing data points by simple KNN for meter-side measurement data.

SC5_Final_data_gen.py : Merge PV and meter-side measurement data, and calculate load consumption.

The dataset provides all the outcomes (CSV files) from the scripts. All processed variables (PV generation, load, power import, and export) are expressed in kW units.

Update: 'SC1_PV_auto_sort.py' & 'SC2_L2_auto_sort.py' are adequate for cleaning up data and making the missing point visible. 'SC3_PV_KNN_impute.py' & 'SC4_L2_KNN_impute.py' work fine for short-range missing data points; however, these two scripts won't help much for missing data points for a longer period. They are provided as examples of one method of processing data. Future updates will include proper ML-based forecasting to predict missing data points.

Funding Agency and Grant Number:

European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 955614.

Estonian Research Council under Grant PRG1086.

Estonian Centre of Excellence in Energy Efficiency, ENER, funded by the Estonian Ministry of Education and Research under Grant TK230.
i
IBM Debater® - Recorded Debating Dataset - Release #4 (Full version) +...
research.ibm.com
Updated Sep 25, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). IBM Debater® - Recorded Debating Dataset - Release #4 (Full version) + Annotated general-purpose claim-rebuttal pairs 200 speeches recorded by professional debaters discussing 50 controversial topics (with their manual and automatic transcriptions), and 55 general-purpose claim-rebuttal pairs, along with the results of several annotation experiments performed on these data. The dataset includes: - Audio files of 200 debating speeches. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - Manual and automatic transcripts of the speeches, in both raw and cleaned (processed) versions. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - 55 general-purpose claim-rebuttal pairs written by an expert human debater - The results of several annotation experiments performed using the general-purpose claim-rebuttal pairs and the speeches Size: 3.2 GB [Dataset]. https://research.ibm.com/haifa/dept/vst/debating_data.shtml
Explore at:
Dataset updated
Sep 25, 2017
Description
200 speeches recorded by professional debaters discussing 50 controversial topics (with their manual and automatic transcriptions), and 55 general-purpose claim-rebuttal pairs, along with the results of several annotation experiments performed on these data. The dataset includes: - Audio files of 200 debating speeches. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - Manual and automatic transcripts of the speeches, in both raw and cleaned (processed) versions. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - 55 general-purpose claim-rebuttal pairs written by an expert human debater - The results of several annotation experiments performed using the general-purpose claim-rebuttal pairs and the speeches Size: 3.2 GB
d
Data from: International Climate Benchmarks and Input Parameters for a...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). International Climate Benchmarks and Input Parameters for a Stochastic Weather Generator, CLIGEN [Dataset]. https://catalog.data.gov/dataset/international-climate-benchmarks-and-input-parameters-for-a-stochastic-weather-generator-c-74051
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset represents CLIGEN input parameters for locations in 68 countries. CLIGEN is a point-scale stochastic weather generator that produces long-term weather simulations with daily output. The input parameters are essentially monthly climate statistics that also serve as climate benchmarks. Three unique input parameter sets are differentiated by having been produced from 30-year, 20-year and 10-year minimum record lengths that correspond to 7673, 2336, and 2694 stations, respectively. The primary source of data is the NOAA GHCN-Daily dataset, and due to data gaps, records longer than the three minimum record lengths were often queried to produce the needed number of complete monthly records. The vast majority of stations used at least some data from the 2000's, and temporal coverages are shown in the Excel table for each station. CLIGEN has various applications including being used to force soil erosion models. This dataset may reduce the effort needed in preparing climate inputs for such applications. Revised input files added on 11/16/20. These files were revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months. Second revision input files added on 2/12/20. A formatting error was fixed that affected transition probabilities for 238 stations with zero recorded precipitation for one or more months. The affected stations were predominantly in Australia and Mexico. Resources in this dataset:Resource Title: 30-year input files. File Name: 30-year.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files. File Name: 20-year.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files. File Name: 10-year.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: Map Layer. File Name: MapLayer.kmzResource Description: Map Layer showing locations of the new CLIGEN stations. This layer may be imported into Google Earth and used to find the station closest to an area of interest.Resource Software Recommended: Google Earth,url: https://www.google.com/earth/ Resource Title: Temporal Ranges of Years Queried. File Name: GHCN-Daily Year Ranges.xlsxResource Description: Excel tables of the first and last years queried from GHCN-Daily when searching for complete monthly records (with no gaps in data). Any ranges in excess of 30 years, 20 years and 10 years, for respective datasets, are due to data gaps.Resource Title: 30-year input files (revised). File Name: 30-year revised.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised). File Name: 20-year revised.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised). File Name: 10-year revised.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 30-year input files (revised 2). File Name: 30-year revised 2.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised 2). File Name: 20-year revised 2.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised 2). File Name: 10-year revised 2.zipResource Description: CLIGEN *.par input files based on 10-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/
EMHIRES dataset: wind and solar power generation [archived]
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger; Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger (2024). EMHIRES dataset: wind and solar power generation [archived] [Dataset]. http://doi.org/10.5281/zenodo.4803353
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4803353
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger; Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository is now archived. The official repository for the EMHIRES dataset is now EMHIRES dataset: wind and solar power generation | Zenodo

EMHIRES Wind

The first version of EMHIRES dataset releases four different files about the wind power generation hourly time series during 30 years (1986-2015), taking into account the existing wind fleet at the end of 2015, for each country (onshore and offshore), bidding zone and by NUTS 1 and NUTS 2 region. The time series are given as capacity factors. The installed capacity used accounted for calculating the capacity factors are summarised in the annexes of the report.

- https://setis.ec.europa.eu/emhires-dataset-part-i-wind-power-generation_en

EMHIRES Solar

EMHIRES provides RES-E generation time series for the EU-28 and neighbouring countries. The solar power time series are released at hourly granularity and at different aggregation levels: by country, power market bidding zone, and by the European Nomenclature of territorial units for statistics (NUTS) defined by EUROSTAT; in particular, by NUTS 1 and NUTS 2 level. The time series provided by bidding zones include special aggregations to reflect the power market reality where this deviates from political or territorial boundaries.

The overall scope of EMHIRES is to allow users to assess the impact of meteorological and climate variability on the generation of solar power in Europe and not to mime the actual evolution of solar power production in the latest decades. For this reason, the hourly solar power generation time series are released for meteorological conditions of the years 1986-2015 (30 years) without considering any changes in the solar installed capacity. Thus, the installed capacity considered is fixed as the one installed at the end of 2015. For this reason, data from EMHIRES should not be compared with actual power generation data other than referring to the reference year 2015.

- https://setis.ec.europa.eu/emhires-dataset-part-ii-solar-power-generation_en
d
Embedded Generation by Type (SPEN_010) Data Quality Checks - Dataset -...
demo.dev.datopian.com
Updated May 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Embedded Generation by Type (SPEN_010) Data Quality Checks - Dataset - Datopian CKAN instance [Dataset]. https://demo.dev.datopian.com/dataset/sp-energy-networks--spen_data_quality_embedded_generation
Explore at:
Dataset updated
May 27, 2025
Description
This data table provides the detailed data quality assessment scores for the Embedded Generation by Type dataset. The quality assessment was carried out on the 31st March. At SPEN, we are dedicated to sharing high-quality data with our stakeholders and being transparent about its' quality. This is why we openly share the results of our data quality assessments. We collaborate closely with Data Owners to address any identified issues and enhance our overall data quality. To demonstrate our progress we conduct, at a minimum, bi-annual assessments of our data quality - for datasets that are refreshed more frequently than this, please note that the quality assessment may be based on an earlier version of the dataset. To learn more about our approach to how we assess data quality, visit Data Quality - SP Energy Networks. We welcome feedback and questions from our stakeholders regarding this process. Our Open Data Team is available to answer any enquiries or receive feedback on the assessments. You can contact them via our Open Data mailbox at opendata@spenergynetworks.co.uk.The first phase of our comprehensive data quality assessment measures the quality of our datasets across three dimensions. Please refer to the data table schema for the definitions of these dimensions. We are now in the process of expanding our quality assessments to include additional dimensions to provide a more comprehensive evaluation and will update the data tables with the results when available.
F
Norwegian General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Norwegian General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-norwegian-norway
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Norwegian General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Norwegian speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Norwegian communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Norwegian speech models that understand and respond to authentic Norwegian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Norwegian. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Norwegian speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Norway to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Norwegian speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Norwegian.

•
Voice Assistants: Build smart assistants capable of understanding natural Norwegian conversations.
w
Dataset of books called Boomer nation : the largest and richest generation...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Boomer nation : the largest and richest generation ever, and how it changed America [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Boomer+nation+%3A+the+largest+and+richest+generation+ever%2C+and+how+it+changed+America
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
This dataset is about books. It has 1 row and is filtered where the book is Boomer nation : the largest and richest generation ever, and how it changed America. It features 7 columns including author, publication date, language, and book publisher.
T
rlu_atari_checkpoints_ordered
tensorflow.org
Updated Dec 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). rlu_atari_checkpoints_ordered [Dataset]. https://www.tensorflow.org/datasets/catalog/rlu_atari_checkpoints_ordered
Explore at:
Dataset updated
Dec 9, 2021
Description
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

The datasets follow the RLDS format to represent steps and episodes.

We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.

Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.

The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.

Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).

Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.

This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('rlu_atari_checkpoints_ordered', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
d
Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data
catalog.data.gov
data.transportation.gov
+5more
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Highway Administration (2025). Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data [Dataset]. https://catalog.data.gov/dataset/next-generation-simulation-ngsim-vehicle-trajectories-and-supporting-data
Explore at:
Dataset updated
Jun 16, 2025
Dataset provided by
Federal Highway Administration
Description
Click “Export” on the right to download the vehicle trajectory data. The associated metadata and additional data can be downloaded below under "Attachments". Researchers for the Next Generation Simulation (NGSIM) program collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, eastbound I-80 in Emeryville, CA and Peachtree Street in Atlanta, Georgia. Data was collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program, transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles. Click the "Show More" button below to find additional contextual data and metadata for this dataset. For site-specific NGSIM video file datasets, please see the following: - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Lankershim Boulevard Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Lankershi/uv3e-y54k - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf
Z
Architectural interior styles sample Dataset
data.niaid.nih.gov
Updated Sep 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcin Kostrzewski (2023). Architectural interior styles sample Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8360664
Explore at:
Dataset updated
Sep 20, 2023
Dataset provided by
Michał Ulaniuk
Marcin Kostrzewski
Adam Wojdyła
Description
The dataset contains around 1600 images depicting a particular interior style. The photos belong to one of eight classes: rustic, industrial, classic, vintage, modernist, art-deco, scandinavian, glamour.

The source of the dataset is Houzz.com. The images were downloaded from the website and grouped into folders.

You may use the dataset under the following terms:

Research and Development Purposes Only: Access to the dataset hosted on Zenodo is granted exclusively for research and development purposes. Users are required to clearly state their intention for using the dataset in this context.

Acknowledgment and Citation: Users must commit to providing proper acknowledgment and citation of the dataset in their research or development work. They should include the dataset's DOI and a reference to the original source in all publications, presentations, or reports derived from the dataset.

No Commercial Use: The dataset is not to be used for any commercial, for-profit, or financially exploitative purposes. Users must refrain from any activities that generate direct monetary gains from the dataset.

Ethical Use: Users are required to use the dataset in a manner consistent with ethical research practices. This includes respecting privacy, complying with relevant laws and regulations, and ensuring that the use of the data does not harm individuals, groups, or communities.

No Redistribution: Users are strictly prohibited from redistributing the dataset to third parties without prior written consent from the dataset owner. Any sharing of the dataset should be done solely for collaboration within the context of the research or development project.

Non-Discrimination: Access to the dataset should not be denied or granted based on factors such as race, ethnicity, gender, religion, nationality, or any other discriminatory criteria. All requests for access will be evaluated solely based on the justification provided by the user.

No Charge for Access: Users will not be charged any fees for accessing the data hosted on Zenodo. Access is provided free of charge, and users should not be required to make any payments to obtain or use the dataset.

Compliance with Zenodo's Terms of Use: Users are expected to comply with Zenodo's terms of use, including any additional terms or policies specific to the platform
f
Comparison of the Predictive Performance and Interpretability of Random...
acs.figshare.com
figshare.com
zip
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley (2023). Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets [Dataset]. http://doi.org/10.1021/acs.jcim.6b00753.s006
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.6b00753.s006
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.
o
Network Flow: Power, Current and Embedded Generation (SPEN_008) Data Quality...
spenergynetworks.opendatasoft.com
Updated Mar 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Network Flow: Power, Current and Embedded Generation (SPEN_008) Data Quality Checks [Dataset]. https://spenergynetworks.opendatasoft.com/explore/dataset/spen_data_quality_network_flow/
Explore at:
Dataset updated
Mar 28, 2025
Description
This data table provides the detailed data quality assessment scores for the Network Flow: Power, Current and Embedded Generation dataset. The quality assessment was carried out on the 31st March. At SPEN, we are dedicated to sharing high-quality data with our stakeholders and being transparent about its' quality. This is why we openly share the results of our data quality assessments. We collaborate closely with Data Owners to address any identified issues and enhance our overall data quality. To demonstrate our progress we conduct, at a minimum, bi-annual assessments of our data quality - for datasets that are refreshed more frequently than this, please note that the quality assessment may be based on an earlier version of the dataset. To learn more about our approach to how we assess data quality, visit Data Quality - SP Energy Networks. We welcome feedback and questions from our stakeholders regarding this process. Our Open Data Team is available to answer any enquiries or receive feedback on the assessments. You can contact them via our Open Data mailbox at opendata@spenergynetworks.co.uk.The first phase of our comprehensive data quality assessment measures the quality of our datasets across three dimensions. Please refer to the data table schema for the definitions of these dimensions. We are now in the process of expanding our quality assessments to include additional dimensions to provide a more comprehensive evaluation and will update the data tables with the results when available.
L
Election 2017 May General Voting Results
data.lacity.org
s.cnmilf.com
+1more
application/rdfxml +5
Updated Jul 13, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Election 2017 May General Voting Results [Dataset]. https://data.lacity.org/Administration-Finance/Election-2017-May-General-Voting-Results/qpi4-ig3x
Explore at:
csv, application/rdfxml, xml, tsv, application/rssxml, jsonAvailable download formats
Dataset updated
Jul 13, 2017
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Statement of Votes Cast of the election Results

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples

Language Generation Dataset: 200M Samples

A processed Amazon Review Dataset for Language Generation

Explore at:

zip(3416608411 bytes)Available download formats

Dataset updated

Sep 7, 2019

Authors

Abhishek Chatterjee

Description

Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.

Clear search

Close search

Google apps

Main menu

Language Generation Dataset: 200M Samples

Context

Content

Acknowledgements

Inspiration

instruction-dataset-mini-with-generations

Dataset of books called The M-factor : how the millennial generation is...

ManimBench v1

📚 ManimBench Dataset v1

Overview

🧠 Use Cases

CMS 2011A Open Data | Jet Primary Dataset | pT > 375 GeV | MOD HDF5 Format

General Near Surface Ocean Current - Dataset - data.gov.ie

NLUCat

NLUCat

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

PV Generation and Consumption Dataset of an Estonian Residential Dwelling

IBM Debater® - Recorded Debating Dataset - Release #4 (Full version) +...

Data from: International Climate Benchmarks and Input Parameters for a...

EMHIRES dataset: wind and solar power generation [archived]

Embedded Generation by Type (SPEN_010) Data Quality Checks - Dataset -...

Norwegian General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Dataset of books called Boomer nation : the largest and richest generation...

rlu_atari_checkpoints_ordered

Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data

Architectural interior styles sample Dataset

Comparison of the Predictive Performance and Interpretability of Random...

Network Flow: Power, Current and Embedded Generation (SPEN_008) Data Quality...

Election 2017 May General Voting Results

Language Generation Dataset: 200M Samples

A processed Amazon Review Dataset for Language Generation

Context

Content

Acknowledgements

Inspiration