Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.
This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.
To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset
The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.
The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.
The next column contains the next character after the sequence.
There are about 200 million samples are in the dataset.
Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html
This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.
Dataset Card for instruction-dataset-mini-with-generations
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book is The M-factor : how the millennial generation is rocking the workplace. It features 7 columns including author, publication date, language, and book publisher.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
ManimBench v1 is a curated dataset designed to support fine-tuning and evaluation of models that generate or interpret Manim animations. It pairs Manim code snippets with natural language descriptions, enabling research in code generation, animation synthesis, and multimodal understanding.
🔗 GitHub Repository for Fine-Tuning: SuienS/manim-fine-tune
📄 Research Paper: Coming Soon
The dataset can be also accessed directly from the HuggingFace Hub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of 1,785,625 jets from the Jet Primary Dataset of the CMS 2011A Open Data reprocessed into the MOD HDF5 format. Jets are selected from the hardest two anti-kT R=0.5 jets in events passing the Jet300 High Level Trigger and are required to have (p_T^\text{jet}>375) GeV, where (p_T^\text{jet}) includes a jet energy correction factor. Particle Flow Candidates (PFCs) for each jet are provided and include information about the PFC kinematics, PDG ID, and vertex. Additionally, jets have metadata describing their kinematics and provenance in the original CMS AOD files.
For additional details about the dataset, please see the accompanying paper, Exploring the Space of Jets with CMS Open Data. There, jets were further restricted to have (|\eta^\text{jet}|<1.9) to ensure tracking coverage and have "medium" quality to reject fake jets.
The supported method for downloading, reading, and using this dataset is through the EnergyFlow Python package, which has additional documentation about how to read and use this and related datasets. Should any problems be encountered, please submit an issue on GitHub.
There are corresponding datasets of simulated jets organized by hard parton (\hat p_T) also available on Zenodo:
SIM/GEN QCD Jets 170-300 GeV
SIM/GEN QCD Jets 300-470 GeV
SIM/GEN QCD Jets 470-600 GeV
SIM/GEN QCD Jets 600-800 GeV
SIM/GEN QCD Jets 800-1000 GeV
SIM/GEN QCD Jets 1000-1400 GeV
SIM/GEN QCD Jets 1400-1800 GeV
SIM/GEN QCD Jets 1800-(\infty) GeV
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An ocean current is a continuous, directed movement of seawater generated by forces acting upon this mean flow, such as breaking waves, wind, the Coriolis effect, cabbeling, and temperature and salinity differences, while tides are caused by the gravitational pull of the Sun and Moon. Depth contours, shoreline configurations, and interactions with other currents influence a current's direction and strength. Ocean currents flow for great distances, and together, create the global conveyor belt which plays a dominant role in determining the climate of many of the Earths regions. More specifically, ocean currents influence the temperature of the regions through which they travel. General near surface ocean current data was provided by Petroleum Affairs Division. Data was created as part of the Irish Offshore Strategic Environmental Assessment (IOSEA).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.
The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).
The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.
The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)
This dataset can be used to train models for intent classification, spans identification and examples generation.
This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.
In this repository you'll find the following items:
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.
Intent classification, spans identification and examples generation.
The dataset is in Catalan (ca-ES).
Three JSON files, one for each split.
Example
An example looks as follows:
{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.
Initial Data Collection and Normalization
We commissioned a company to create fictitious examples for the creation of this dataset.
Who are the source language producers?
We commissioned the writing of the examples to the company m47 labs.
Annotation process
The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.
Who are the annotators?
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
No personal or sensitive information included.
The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.
We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.
When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.
[N/A]
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a Residential PV generation and consumption data set from an Estonian house. At the time of submission, one year (2023) of data was available. The data was logged at a 10-second resolution. The untouched dataset can be found in the raw data folder, which is separated month-wise. A few missing points in the dataset were filled with a simple KNN algorithm. However, improved data imputation methods based on machine learning are also possible. To carry out the imputing, run the scripts in the script folder one by one in the numerical serial order (SC1..py, SC2..py, etc.).
Data Descriptor (Scientific Data): https://doi.org/10.1038/s41597-025-04747-w">https://doi.org/10.1038/s41597-025-04747-w
General Information:
Duration: January 2023 – December 2023
Resolution: 10 seconds
Dataset Type: Aggregated consumption and PV generation data
Logging Device: Camile Bauer PQ1000 (×2)
Load/Appliance Information:
Measurement Points:
Measured Parameters:
Script Description:
SC1_PV_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for PV generation data.
SC2_L2_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for meter-side measurement data.
SC3_PV_KNN_impute.py : Filling missing data points by simple KNN for PV generation data.
SC4_L2_KNN_impute.py : Filling missing data points by simple KNN for meter-side measurement data.
SC5_Final_data_gen.py : Merge PV and meter-side measurement data, and calculate load consumption.
The dataset provides all the outcomes (CSV files) from the scripts. All processed variables (PV generation, load, power import, and export) are expressed in kW units.
Update: 'SC1_PV_auto_sort.py' & 'SC2_L2_auto_sort.py' are adequate for cleaning up data and making the missing point visible. 'SC3_PV_KNN_impute.py' & 'SC4_L2_KNN_impute.py' work fine for short-range missing data points; however, these two scripts won't help much for missing data points for a longer period. They are provided as examples of one method of processing data. Future updates will include proper ML-based forecasting to predict missing data points.
Funding Agency and Grant Number:
200 speeches recorded by professional debaters discussing 50 controversial topics (with their manual and automatic transcriptions), and 55 general-purpose claim-rebuttal pairs, along with the results of several annotation experiments performed on these data. The dataset includes: - Audio files of 200 debating speeches. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - Manual and automatic transcripts of the speeches, in both raw and cleaned (processed) versions. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - 55 general-purpose claim-rebuttal pairs written by an expert human debater - The results of several annotation experiments performed using the general-purpose claim-rebuttal pairs and the speeches Size: 3.2 GB
This dataset represents CLIGEN input parameters for locations in 68 countries. CLIGEN is a point-scale stochastic weather generator that produces long-term weather simulations with daily output. The input parameters are essentially monthly climate statistics that also serve as climate benchmarks. Three unique input parameter sets are differentiated by having been produced from 30-year, 20-year and 10-year minimum record lengths that correspond to 7673, 2336, and 2694 stations, respectively. The primary source of data is the NOAA GHCN-Daily dataset, and due to data gaps, records longer than the three minimum record lengths were often queried to produce the needed number of complete monthly records. The vast majority of stations used at least some data from the 2000's, and temporal coverages are shown in the Excel table for each station. CLIGEN has various applications including being used to force soil erosion models. This dataset may reduce the effort needed in preparing climate inputs for such applications. Revised input files added on 11/16/20. These files were revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months. Second revision input files added on 2/12/20. A formatting error was fixed that affected transition probabilities for 238 stations with zero recorded precipitation for one or more months. The affected stations were predominantly in Australia and Mexico. Resources in this dataset:Resource Title: 30-year input files. File Name: 30-year.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files. File Name: 20-year.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files. File Name: 10-year.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: Map Layer. File Name: MapLayer.kmzResource Description: Map Layer showing locations of the new CLIGEN stations. This layer may be imported into Google Earth and used to find the station closest to an area of interest.Resource Software Recommended: Google Earth,url: https://www.google.com/earth/ Resource Title: Temporal Ranges of Years Queried. File Name: GHCN-Daily Year Ranges.xlsxResource Description: Excel tables of the first and last years queried from GHCN-Daily when searching for complete monthly records (with no gaps in data). Any ranges in excess of 30 years, 20 years and 10 years, for respective datasets, are due to data gaps.Resource Title: 30-year input files (revised). File Name: 30-year revised.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised). File Name: 20-year revised.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised). File Name: 10-year revised.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 30-year input files (revised 2). File Name: 30-year revised 2.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised 2). File Name: 20-year revised 2.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised 2). File Name: 10-year revised 2.zipResource Description: CLIGEN *.par input files based on 10-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository is now archived. The official repository for the EMHIRES dataset is now EMHIRES dataset: wind and solar power generation | Zenodo
EMHIRES Wind
The first version of EMHIRES dataset releases four different files about the wind power generation hourly time series during 30 years (1986-2015), taking into account the existing wind fleet at the end of 2015, for each country (onshore and offshore), bidding zone and by NUTS 1 and NUTS 2 region. The time series are given as capacity factors. The installed capacity used accounted for calculating the capacity factors are summarised in the annexes of the report.
- https://setis.ec.europa.eu/emhires-dataset-part-i-wind-power-generation_en
EMHIRES Solar
EMHIRES provides RES-E generation time series for the EU-28 and neighbouring countries. The solar power time series are released at hourly granularity and at different aggregation levels: by country, power market bidding zone, and by the European Nomenclature of territorial units for statistics (NUTS) defined by EUROSTAT; in particular, by NUTS 1 and NUTS 2 level. The time series provided by bidding zones include special aggregations to reflect the power market reality where this deviates from political or territorial boundaries.
The overall scope of EMHIRES is to allow users to assess the impact of meteorological and climate variability on the generation of solar power in Europe and not to mime the actual evolution of solar power production in the latest decades. For this reason, the hourly solar power generation time series are released for meteorological conditions of the years 1986-2015 (30 years) without considering any changes in the solar installed capacity. Thus, the installed capacity considered is fixed as the one installed at the end of 2015. For this reason, data from EMHIRES should not be compared with actual power generation data other than referring to the reference year 2015.
- https://setis.ec.europa.eu/emhires-dataset-part-ii-solar-power-generation_en
This data table provides the detailed data quality assessment scores for the Embedded Generation by Type dataset. The quality assessment was carried out on the 31st March. At SPEN, we are dedicated to sharing high-quality data with our stakeholders and being transparent about its' quality. This is why we openly share the results of our data quality assessments. We collaborate closely with Data Owners to address any identified issues and enhance our overall data quality. To demonstrate our progress we conduct, at a minimum, bi-annual assessments of our data quality - for datasets that are refreshed more frequently than this, please note that the quality assessment may be based on an earlier version of the dataset. To learn more about our approach to how we assess data quality, visit Data Quality - SP Energy Networks. We welcome feedback and questions from our stakeholders regarding this process. Our Open Data Team is available to answer any enquiries or receive feedback on the assessments. You can contact them via our Open Data mailbox at opendata@spenergynetworks.co.uk.The first phase of our comprehensive data quality assessment measures the quality of our datasets across three dimensions. Please refer to the data table schema for the definitions of these dimensions. We are now in the process of expanding our quality assessments to include additional dimensions to provide a more comprehensive evaluation and will update the data tables with the results when available.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Norwegian General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Norwegian speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Norwegian communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Norwegian speech models that understand and respond to authentic Norwegian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Norwegian. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Norwegian speech and language AI applications:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Boomer nation : the largest and richest generation ever, and how it changed America. It features 7 columns including author, publication date, language, and book publisher.
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.
The datasets follow the RLDS format to represent steps and episodes.
We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.
Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.
The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.
Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).
Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.
This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('rlu_atari_checkpoints_ordered', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Click “Export” on the right to download the vehicle trajectory data. The associated metadata and additional data can be downloaded below under "Attachments". Researchers for the Next Generation Simulation (NGSIM) program collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, eastbound I-80 in Emeryville, CA and Peachtree Street in Atlanta, Georgia. Data was collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program, transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles. Click the "Show More" button below to find additional contextual data and metadata for this dataset. For site-specific NGSIM video file datasets, please see the following: - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Lankershim Boulevard Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Lankershi/uv3e-y54k - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf
The dataset contains around 1600 images depicting a particular interior style. The photos belong to one of eight classes: rustic, industrial, classic, vintage, modernist, art-deco, scandinavian, glamour.
The source of the dataset is Houzz.com. The images were downloaded from the website and grouped into folders.
You may use the dataset under the following terms:
Research and Development Purposes Only: Access to the dataset hosted on Zenodo is granted exclusively for research and development purposes. Users are required to clearly state their intention for using the dataset in this context.
Acknowledgment and Citation: Users must commit to providing proper acknowledgment and citation of the dataset in their research or development work. They should include the dataset's DOI and a reference to the original source in all publications, presentations, or reports derived from the dataset.
No Commercial Use: The dataset is not to be used for any commercial, for-profit, or financially exploitative purposes. Users must refrain from any activities that generate direct monetary gains from the dataset.
Ethical Use: Users are required to use the dataset in a manner consistent with ethical research practices. This includes respecting privacy, complying with relevant laws and regulations, and ensuring that the use of the data does not harm individuals, groups, or communities.
No Redistribution: Users are strictly prohibited from redistributing the dataset to third parties without prior written consent from the dataset owner. Any sharing of the dataset should be done solely for collaboration within the context of the research or development project.
Non-Discrimination: Access to the dataset should not be denied or granted based on factors such as race, ethnicity, gender, religion, nationality, or any other discriminatory criteria. All requests for access will be evaluated solely based on the justification provided by the user.
No Charge for Access: Users will not be charged any fees for accessing the data hosted on Zenodo. Access is provided free of charge, and users should not be required to make any payments to obtain or use the dataset.
Compliance with Zenodo's Terms of Use: Users are expected to comply with Zenodo's terms of use, including any additional terms or policies specific to the platform
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.
This data table provides the detailed data quality assessment scores for the Network Flow: Power, Current and Embedded Generation dataset. The quality assessment was carried out on the 31st March. At SPEN, we are dedicated to sharing high-quality data with our stakeholders and being transparent about its' quality. This is why we openly share the results of our data quality assessments. We collaborate closely with Data Owners to address any identified issues and enhance our overall data quality. To demonstrate our progress we conduct, at a minimum, bi-annual assessments of our data quality - for datasets that are refreshed more frequently than this, please note that the quality assessment may be based on an earlier version of the dataset. To learn more about our approach to how we assess data quality, visit Data Quality - SP Energy Networks. We welcome feedback and questions from our stakeholders regarding this process. Our Open Data Team is available to answer any enquiries or receive feedback on the assessments. You can contact them via our Open Data mailbox at opendata@spenergynetworks.co.uk.The first phase of our comprehensive data quality assessment measures the quality of our datasets across three dimensions. Please refer to the data table schema for the definitions of these dimensions. We are now in the process of expanding our quality assessments to include additional dimensions to provide a more comprehensive evaluation and will update the data tables with the results when available.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Statement of Votes Cast of the election Results
Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.
This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.
To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset
The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.
The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.
The next column contains the next character after the sequence.
There are about 200 million samples are in the dataset.
Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html
This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.