36 datasets found
  1. Language Generation Dataset: 200M Samples

    • kaggle.com
    zip
    Updated Sep 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples
    Explore at:
    zip(3416608411 bytes)Available download formats
    Dataset updated
    Sep 7, 2019
    Authors
    Abhishek Chatterjee
    Description

    Context

    Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

    This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

    To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

    Content

    The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

    The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

    The next column contains the next character after the sequence.

    There are about 200 million samples are in the dataset.

    Acknowledgements

    Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

    Inspiration

    This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.

  2. h

    instruction-dataset-mini-with-generations

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdoulaye Diallo, instruction-dataset-mini-with-generations [Dataset]. https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Abdoulaye Diallo
    Description

    Dataset Card for instruction-dataset-mini-with-generations

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info… See the full description on the dataset page: https://huggingface.co/datasets/vonewman/instruction-dataset-mini-with-generations.

  3. w

    Dataset of books called The M-factor : how the millennial generation is...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called The M-factor : how the millennial generation is rocking the workplace [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+M-factor+%3A+how+the+millennial+generation+is+rocking+the+workplace
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the book is The M-factor : how the millennial generation is rocking the workplace. It features 7 columns including author, publication date, language, and book publisher.

  4. ManimBench v1

    • kaggle.com
    • huggingface.co
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravidu Silva (2025). ManimBench v1 [Dataset]. https://www.kaggle.com/datasets/ravidussilva/manim-sft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravidu Silva
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    📚 ManimBench Dataset v1

    Overview

    ManimBench v1 is a curated dataset designed to support fine-tuning and evaluation of models that generate or interpret Manim animations. It pairs Manim code snippets with natural language descriptions, enabling research in code generation, animation synthesis, and multimodal understanding.

    🔗 GitHub Repository for Fine-Tuning: SuienS/manim-fine-tune

    📄 Research Paper: Coming Soon

    The dataset can be also accessed directly from the HuggingFace Hub.

    🧠 Use Cases

    • Fine-tuning LLMs for code generation or animation synthesis
    • Benchmarking natural language to animation tasks
    • Studying alignment between code and human-readable descriptions
  5. Z

    CMS 2011A Open Data | Jet Primary Dataset | pT > 375 GeV | MOD HDF5 Format

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mastandrea, Radha (2020). CMS 2011A Open Data | Jet Primary Dataset | pT > 375 GeV | MOD HDF5 Format [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3340204
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Komiske, Patrick
    Naik, Preksha
    Thaler, Jesse
    Metodiev, Eric
    Mastandrea, Radha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset of 1,785,625 jets from the Jet Primary Dataset of the CMS 2011A Open Data reprocessed into the MOD HDF5 format. Jets are selected from the hardest two anti-kT R=0.5 jets in events passing the Jet300 High Level Trigger and are required to have (p_T^\text{jet}>375) GeV, where (p_T^\text{jet}) includes a jet energy correction factor. Particle Flow Candidates (PFCs) for each jet are provided and include information about the PFC kinematics, PDG ID, and vertex. Additionally, jets have metadata describing their kinematics and provenance in the original CMS AOD files.

    For additional details about the dataset, please see the accompanying paper, Exploring the Space of Jets with CMS Open Data. There, jets were further restricted to have (|\eta^\text{jet}|<1.9) to ensure tracking coverage and have "medium" quality to reject fake jets.

    The supported method for downloading, reading, and using this dataset is through the EnergyFlow Python package, which has additional documentation about how to read and use this and related datasets. Should any problems be encountered, please submit an issue on GitHub.

    There are corresponding datasets of simulated jets organized by hard parton (\hat p_T) also available on Zenodo:

    SIM/GEN QCD Jets 170-300 GeV

    SIM/GEN QCD Jets 300-470 GeV

    SIM/GEN QCD Jets 470-600 GeV

    SIM/GEN QCD Jets 600-800 GeV

    SIM/GEN QCD Jets 800-1000 GeV

    SIM/GEN QCD Jets 1000-1400 GeV

    SIM/GEN QCD Jets 1400-1800 GeV

    SIM/GEN QCD Jets 1800-(\infty) GeV

  6. General Near Surface Ocean Current - Dataset - data.gov.ie

    • data.gov.ie
    Updated Nov 11, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.gov.ie (2016). General Near Surface Ocean Current - Dataset - data.gov.ie [Dataset]. https://data.gov.ie/dataset/general-near-surface-ocean-current
    Explore at:
    Dataset updated
    Nov 11, 2016
    Dataset provided by
    data.gov.ie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An ocean current is a continuous, directed movement of seawater generated by forces acting upon this mean flow, such as breaking waves, wind, the Coriolis effect, cabbeling, and temperature and salinity differences, while tides are caused by the gravitational pull of the Sun and Moon. Depth contours, shoreline configurations, and interactions with other currents influence a current's direction and strength. Ocean currents flow for great distances, and together, create the global conveyor belt which plays a dominant role in determining the climate of many of the Earths regions. More specifically, ocean currents influence the temperature of the regions through which they travel. General near surface ocean current data was provided by Petroleum Affairs Division. Data was created as part of the Irish Offshore Strategic Environmental Assessment (IOSEA).

  7. NLUCat

    • zenodo.org
    • huggingface.co
    • +1more
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    • example: `str`. Example
    • annotation: `dict`. Annotation of the example
    • intent: `str`. Intent tag
    • slots: `list`. List of slots
    • Tag:`str`. tag to the slot
    • Text:`str`. Text of the slot
    • Start_char: `int`. First character of the span
    • End_char: `int`. Last character of the span

    Example


    An example looks as follows:

    {
    "example": "Demana una ambulància; la meva dona està de part.",
    "annotation": {
    "intent": "call_emergency",
    "slots": [
    {
    "Tag": "service",
    "Text": "ambulància",
    "Start_char": 11,
    "End_char": 21
    },
    {
    "Tag": "situation",
    "Text": "la meva dona està de part",
    "Start_char": 23,
    "End_char": 48
    }
    ]
    }
    },


    Data Splits

    • NLUCat.train: 9128 examples
    • NLUCat.dev: 1441 examples
    • NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
    * First step: translation or elaboration of the instructions given to the annotators to write the examples.
    * Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
    * Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
    Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  8. t

    PV Generation and Consumption Dataset of an Estonian Residential Dwelling

    • data.taltech.ee
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov (2025). PV Generation and Consumption Dataset of an Estonian Residential Dwelling [Dataset]. http://doi.org/10.48726/6hayh-x0h25
    Explore at:
    Dataset updated
    Mar 22, 2025
    Dataset provided by
    TalTech Data Repository
    Authors
    Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Estonia
    Description

    This is a Residential PV generation and consumption data set from an Estonian house. At the time of submission, one year (2023) of data was available. The data was logged at a 10-second resolution. The untouched dataset can be found in the raw data folder, which is separated month-wise. A few missing points in the dataset were filled with a simple KNN algorithm. However, improved data imputation methods based on machine learning are also possible. To carry out the imputing, run the scripts in the script folder one by one in the numerical serial order (SC1..py, SC2..py, etc.).

    Data Descriptor (Scientific Data): https://doi.org/10.1038/s41597-025-04747-w">https://doi.org/10.1038/s41597-025-04747-w

    General Information:

    Duration: January 2023 – December 2023

    Resolution: 10 seconds

    Dataset Type: Aggregated consumption and PV generation data

    Logging Device: Camile Bauer PQ1000 (×2)

    Load/Appliance Information:

    • 5 kW Rooftop PV array connected to AC Bus via 4.2kW 3-ϕ Inverter
    • Air conditioner: 0.44 kW (Cooling), 0.62 kW (Heating)
    • Air to Water (ATW) Heat Pump: 2.5kW (Cooling), 2.6 kW (Heating)
    • ATW Cylinder unit: 0.21 kW (Controller), 9 kW (Booster Heater)
    • Microwave oven: 0.9 kW
    • Coffee Maker: 1 kW
    • Cooktop Hot Plate: 4.6 kW
    • TV: 0.103 kW
    • Vacuum Cleaner: 1.5 kW
    • Ventilation: 0.1 kW
    • Washing Machine: 2.2 kW
    • Electric Sauna: 10 kW
    • Lighting: 0.25 kW
    • EV charger: 2.4 kW 1-ϕ

    Measurement Points:

    1. PV converter-side current transformer, potential transformer (Measurement of PV generation).
    2. Utility meter-side current transformer, potential transformer (Measurement of power exchange with the grid).

    Measured Parameters:

    • Per-phase mean power recorded within the sampling period
    • Per-phase Minimum power recorded within the sampling period
    • Per-phase maximum power recorded within the sampling period
    • Quadrant-wise mean power recorded within the sampling period (1st + 3rd), (2nd + 4th)
    • Quadrant-wise minimum power recorded within the sampling period (1st + 3rd), (2nd + 4th)
    • Quadrant-wise maximum power recorded within the sampling period (1st + 3rd), (2nd + 4th)
    • mean power Factor recorded within the sampling period
    • Minimum power Factor recorded within the sampling period
    • Maximum power Factor recorded within the sampling period
    • System Voltage
    • Minimum system Voltage
    • Maximum system Voltage
    • Mean Voltage between phase and neutral
    • Minimum voltage between phase and neutral
    • Maximum voltage between phase and neutral
    • Zero displacement voltage 4-wire systems (mean, min, max)

    Script Description:

    SC1_PV_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for PV generation data.

    SC2_L2_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for meter-side measurement data.

    SC3_PV_KNN_impute.py : Filling missing data points by simple KNN for PV generation data.

    SC4_L2_KNN_impute.py : Filling missing data points by simple KNN for meter-side measurement data.

    SC5_Final_data_gen.py : Merge PV and meter-side measurement data, and calculate load consumption.

    The dataset provides all the outcomes (CSV files) from the scripts. All processed variables (PV generation, load, power import, and export) are expressed in kW units.

    Update: 'SC1_PV_auto_sort.py' & 'SC2_L2_auto_sort.py' are adequate for cleaning up data and making the missing point visible. 'SC3_PV_KNN_impute.py' & 'SC4_L2_KNN_impute.py' work fine for short-range missing data points; however, these two scripts won't help much for missing data points for a longer period. They are provided as examples of one method of processing data. Future updates will include proper ML-based forecasting to predict missing data points.


    Funding Agency and Grant Number:

    1. European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 955614.
    2. Estonian Research Council under Grant PRG1086.
    3. Estonian Centre of Excellence in Energy Efficiency, ENER, funded by the Estonian Ministry of Education and Research under Grant TK230.
  9. i

    IBM Debater® - Recorded Debating Dataset - Release #4 (Full version) +...

    • research.ibm.com
    Updated Sep 25, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). IBM Debater® - Recorded Debating Dataset - Release #4 (Full version) + Annotated general-purpose claim-rebuttal pairs 200 speeches recorded by professional debaters discussing 50 controversial topics (with their manual and automatic transcriptions), and 55 general-purpose claim-rebuttal pairs, along with the results of several annotation experiments performed on these data. The dataset includes: - Audio files of 200 debating speeches. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - Manual and automatic transcripts of the speeches, in both raw and cleaned (processed) versions. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - 55 general-purpose claim-rebuttal pairs written by an expert human debater - The results of several annotation experiments performed using the general-purpose claim-rebuttal pairs and the speeches Size: 3.2 GB [Dataset]. https://research.ibm.com/haifa/dept/vst/debating_data.shtml
    Explore at:
    Dataset updated
    Sep 25, 2017
    Description

    200 speeches recorded by professional debaters discussing 50 controversial topics (with their manual and automatic transcriptions), and 55 general-purpose claim-rebuttal pairs, along with the results of several annotation experiments performed on these data. The dataset includes: - Audio files of 200 debating speeches. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - Manual and automatic transcripts of the speeches, in both raw and cleaned (processed) versions. [first released in IBM Debater® - Recorded Debating Dataset - Release #2] - 55 general-purpose claim-rebuttal pairs written by an expert human debater - The results of several annotation experiments performed using the general-purpose claim-rebuttal pairs and the speeches Size: 3.2 GB

  10. d

    Data from: International Climate Benchmarks and Input Parameters for a...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). International Climate Benchmarks and Input Parameters for a Stochastic Weather Generator, CLIGEN [Dataset]. https://catalog.data.gov/dataset/international-climate-benchmarks-and-input-parameters-for-a-stochastic-weather-generator-c-74051
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset represents CLIGEN input parameters for locations in 68 countries. CLIGEN is a point-scale stochastic weather generator that produces long-term weather simulations with daily output. The input parameters are essentially monthly climate statistics that also serve as climate benchmarks. Three unique input parameter sets are differentiated by having been produced from 30-year, 20-year and 10-year minimum record lengths that correspond to 7673, 2336, and 2694 stations, respectively. The primary source of data is the NOAA GHCN-Daily dataset, and due to data gaps, records longer than the three minimum record lengths were often queried to produce the needed number of complete monthly records. The vast majority of stations used at least some data from the 2000's, and temporal coverages are shown in the Excel table for each station. CLIGEN has various applications including being used to force soil erosion models. This dataset may reduce the effort needed in preparing climate inputs for such applications. Revised input files added on 11/16/20. These files were revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months. Second revision input files added on 2/12/20. A formatting error was fixed that affected transition probabilities for 238 stations with zero recorded precipitation for one or more months. The affected stations were predominantly in Australia and Mexico. Resources in this dataset:Resource Title: 30-year input files. File Name: 30-year.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files. File Name: 20-year.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files. File Name: 10-year.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: Map Layer. File Name: MapLayer.kmzResource Description: Map Layer showing locations of the new CLIGEN stations. This layer may be imported into Google Earth and used to find the station closest to an area of interest.Resource Software Recommended: Google Earth,url: https://www.google.com/earth/ Resource Title: Temporal Ranges of Years Queried. File Name: GHCN-Daily Year Ranges.xlsxResource Description: Excel tables of the first and last years queried from GHCN-Daily when searching for complete monthly records (with no gaps in data). Any ranges in excess of 30 years, 20 years and 10 years, for respective datasets, are due to data gaps.Resource Title: 30-year input files (revised). File Name: 30-year revised.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: CLIGEN v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised). File Name: 20-year revised.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised). File Name: 10-year revised.zipResource Description: CLIGEN .par input files based on 10-year minimum record lengths. May be viewed with text editor. Revised from the original dataset. Fixed metadata issues with the headings of each file. Fixed inconsistencies with MX.5P and transition probability values for extremely dry climates and/or months.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 30-year input files (revised 2). File Name: 30-year revised 2.zipResource Description: CLIGEN .par input files based on 30-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 20-year input files (revised 2). File Name: 20-year revised 2.zipResource Description: CLIGEN .par input files based on 20-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/ Resource Title: 10-year input files (revised 2). File Name: 10-year revised 2.zipResource Description: CLIGEN *.par input files based on 10-year minimum record lengths. May be viewed with text editor. Fixed formatting issue for 238 stations that affected transition probabilities.Resource Software Recommended: Cligen v5.3,url: https://www.ars.usda.gov/midwest-area/west-lafayette-in/national-soil-erosion-research/docs/wepp/cligen/

  11. EMHIRES dataset: wind and solar power generation [archived]

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger; Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger (2024). EMHIRES dataset: wind and solar power generation [archived] [Dataset]. http://doi.org/10.5281/zenodo.4803353
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger; Iratxe Gonzalez-Aparicio; Andreas Zucker; Francesco Careri; Fabio Monforti; Thomas Huld; Jake Badger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository is now archived. The official repository for the EMHIRES dataset is now EMHIRES dataset: wind and solar power generation | Zenodo

    EMHIRES Wind

    The first version of EMHIRES dataset releases four different files about the wind power generation hourly time series during 30 years (1986-2015), taking into account the existing wind fleet at the end of 2015, for each country (onshore and offshore), bidding zone and by NUTS 1 and NUTS 2 region. The time series are given as capacity factors. The installed capacity used accounted for calculating the capacity factors are summarised in the annexes of the report.

    - https://setis.ec.europa.eu/emhires-dataset-part-i-wind-power-generation_en

    EMHIRES Solar

    EMHIRES provides RES-E generation time series for the EU-28 and neighbouring countries. The solar power time series are released at hourly granularity and at different aggregation levels: by country, power market bidding zone, and by the European Nomenclature of territorial units for statistics (NUTS) defined by EUROSTAT; in particular, by NUTS 1 and NUTS 2 level. The time series provided by bidding zones include special aggregations to reflect the power market reality where this deviates from political or territorial boundaries.

    The overall scope of EMHIRES is to allow users to assess the impact of meteorological and climate variability on the generation of solar power in Europe and not to mime the actual evolution of solar power production in the latest decades. For this reason, the hourly solar power generation time series are released for meteorological conditions of the years 1986-2015 (30 years) without considering any changes in the solar installed capacity. Thus, the installed capacity considered is fixed as the one installed at the end of 2015. For this reason, data from EMHIRES should not be compared with actual power generation data other than referring to the reference year 2015.

    - https://setis.ec.europa.eu/emhires-dataset-part-ii-solar-power-generation_en

  12. d

    Embedded Generation by Type (SPEN_010) Data Quality Checks - Dataset -...

    • demo.dev.datopian.com
    Updated May 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Embedded Generation by Type (SPEN_010) Data Quality Checks - Dataset - Datopian CKAN instance [Dataset]. https://demo.dev.datopian.com/dataset/sp-energy-networks--spen_data_quality_embedded_generation
    Explore at:
    Dataset updated
    May 27, 2025
    Description

    This data table provides the detailed data quality assessment scores for the Embedded Generation by Type dataset. The quality assessment was carried out on the 31st March. At SPEN, we are dedicated to sharing high-quality data with our stakeholders and being transparent about its' quality. This is why we openly share the results of our data quality assessments. We collaborate closely with Data Owners to address any identified issues and enhance our overall data quality. To demonstrate our progress we conduct, at a minimum, bi-annual assessments of our data quality - for datasets that are refreshed more frequently than this, please note that the quality assessment may be based on an earlier version of the dataset. To learn more about our approach to how we assess data quality, visit Data Quality - SP Energy Networks. We welcome feedback and questions from our stakeholders regarding this process. Our Open Data Team is available to answer any enquiries or receive feedback on the assessments. You can contact them via our Open Data mailbox at opendata@spenergynetworks.co.uk.The first phase of our comprehensive data quality assessment measures the quality of our datasets across three dimensions. Please refer to the data table schema for the definitions of these dimensions. We are now in the process of expanding our quality assessments to include additional dimensions to provide a more comprehensive evaluation and will update the data tables with the results when available.

  13. F

    Norwegian General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Norwegian General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-norwegian-norway
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Norwegian General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Norwegian speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Norwegian communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Norwegian speech models that understand and respond to authentic Norwegian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Norwegian. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Norwegian speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Norway to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Norwegian speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Norwegian.
    Voice Assistants: Build smart assistants capable of understanding natural Norwegian conversations.

  14. w

    Dataset of books called Boomer nation : the largest and richest generation...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Boomer nation : the largest and richest generation ever, and how it changed America [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Boomer+nation+%3A+the+largest+and+richest+generation+ever%2C+and+how+it+changed+America
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This dataset is about books. It has 1 row and is filtered where the book is Boomer nation : the largest and richest generation ever, and how it changed America. It features 7 columns including author, publication date, language, and book publisher.

  15. T

    rlu_atari_checkpoints_ordered

    • tensorflow.org
    Updated Dec 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). rlu_atari_checkpoints_ordered [Dataset]. https://www.tensorflow.org/datasets/catalog/rlu_atari_checkpoints_ordered
    Explore at:
    Dataset updated
    Dec 9, 2021
    Description

    RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

    The datasets follow the RLDS format to represent steps and episodes.

    We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.

    Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.

    The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.

    Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).

    Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.

    This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('rlu_atari_checkpoints_ordered', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  16. d

    Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data

    • catalog.data.gov
    • data.transportation.gov
    • +5more
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Highway Administration (2025). Next Generation Simulation (NGSIM) Vehicle Trajectories and Supporting Data [Dataset]. https://catalog.data.gov/dataset/next-generation-simulation-ngsim-vehicle-trajectories-and-supporting-data
    Explore at:
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Federal Highway Administration
    Description

    Click “Export” on the right to download the vehicle trajectory data. The associated metadata and additional data can be downloaded below under "Attachments". Researchers for the Next Generation Simulation (NGSIM) program collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, eastbound I-80 in Emeryville, CA and Peachtree Street in Atlanta, Georgia. Data was collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program, transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles. Click the "Show More" button below to find additional contextual data and metadata for this dataset. For site-specific NGSIM video file datasets, please see the following: - NGSIM I-80 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-I-80-Vide/2577-gpny - NGSIM US-101 Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-US-101-Vi/4qzi-thur - NGSIM Lankershim Boulevard Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Lankershi/uv3e-y54k - NGSIM Peachtree Street Videos: https://data.transportation.gov/Automobiles/Next-Generation-Simulation-NGSIM-Program-Peachtree/mupt-aksf

  17. Z

    Architectural interior styles sample Dataset

    • data.niaid.nih.gov
    Updated Sep 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcin Kostrzewski (2023). Architectural interior styles sample Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8360664
    Explore at:
    Dataset updated
    Sep 20, 2023
    Dataset provided by
    Michał Ulaniuk
    Marcin Kostrzewski
    Adam Wojdyła
    Description

    The dataset contains around 1600 images depicting a particular interior style. The photos belong to one of eight classes: rustic, industrial, classic, vintage, modernist, art-deco, scandinavian, glamour.

    The source of the dataset is Houzz.com. The images were downloaded from the website and grouped into folders.

    You may use the dataset under the following terms:

    Research and Development Purposes Only: Access to the dataset hosted on Zenodo is granted exclusively for research and development purposes. Users are required to clearly state their intention for using the dataset in this context.

    Acknowledgment and Citation: Users must commit to providing proper acknowledgment and citation of the dataset in their research or development work. They should include the dataset's DOI and a reference to the original source in all publications, presentations, or reports derived from the dataset.

    No Commercial Use: The dataset is not to be used for any commercial, for-profit, or financially exploitative purposes. Users must refrain from any activities that generate direct monetary gains from the dataset.

    Ethical Use: Users are required to use the dataset in a manner consistent with ethical research practices. This includes respecting privacy, complying with relevant laws and regulations, and ensuring that the use of the data does not harm individuals, groups, or communities.

    No Redistribution: Users are strictly prohibited from redistributing the dataset to third parties without prior written consent from the dataset owner. Any sharing of the dataset should be done solely for collaboration within the context of the research or development project.

    Non-Discrimination: Access to the dataset should not be denied or granted based on factors such as race, ethnicity, gender, religion, nationality, or any other discriminatory criteria. All requests for access will be evaluated solely based on the justification provided by the user.

    No Charge for Access: Users will not be charged any fees for accessing the data hosted on Zenodo. Access is provided free of charge, and users should not be required to make any payments to obtain or use the dataset.

    Compliance with Zenodo's Terms of Use: Users are expected to comply with Zenodo's terms of use, including any additional terms or policies specific to the platform

  18. f

    Comparison of the Predictive Performance and Interpretability of Random...

    • acs.figshare.com
    • figshare.com
    zip
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley (2023). Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets [Dataset]. http://doi.org/10.1021/acs.jcim.6b00753.s006
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Richard L. Marchese Robinson; Anna Palczewska; Jan Palczewski; Nathan Kidley
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The ability to interpret the predictions made by quantitative structure–activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.

  19. o

    Network Flow: Power, Current and Embedded Generation (SPEN_008) Data Quality...

    • spenergynetworks.opendatasoft.com
    Updated Mar 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Network Flow: Power, Current and Embedded Generation (SPEN_008) Data Quality Checks [Dataset]. https://spenergynetworks.opendatasoft.com/explore/dataset/spen_data_quality_network_flow/
    Explore at:
    Dataset updated
    Mar 28, 2025
    Description

    This data table provides the detailed data quality assessment scores for the Network Flow: Power, Current and Embedded Generation dataset. The quality assessment was carried out on the 31st March. At SPEN, we are dedicated to sharing high-quality data with our stakeholders and being transparent about its' quality. This is why we openly share the results of our data quality assessments. We collaborate closely with Data Owners to address any identified issues and enhance our overall data quality. To demonstrate our progress we conduct, at a minimum, bi-annual assessments of our data quality - for datasets that are refreshed more frequently than this, please note that the quality assessment may be based on an earlier version of the dataset. To learn more about our approach to how we assess data quality, visit Data Quality - SP Energy Networks. We welcome feedback and questions from our stakeholders regarding this process. Our Open Data Team is available to answer any enquiries or receive feedback on the assessments. You can contact them via our Open Data mailbox at opendata@spenergynetworks.co.uk.The first phase of our comprehensive data quality assessment measures the quality of our datasets across three dimensions. Please refer to the data table schema for the definitions of these dimensions. We are now in the process of expanding our quality assessments to include additional dimensions to provide a more comprehensive evaluation and will update the data tables with the results when available.

  20. L

    Election 2017 May General Voting Results

    • data.lacity.org
    • s.cnmilf.com
    • +1more
    application/rdfxml +5
    Updated Jul 13, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Election 2017 May General Voting Results [Dataset]. https://data.lacity.org/Administration-Finance/Election-2017-May-General-Voting-Results/qpi4-ig3x
    Explore at:
    csv, application/rdfxml, xml, tsv, application/rssxml, jsonAvailable download formats
    Dataset updated
    Jul 13, 2017
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Statement of Votes Cast of the election Results

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples
Organization logo

Language Generation Dataset: 200M Samples

A processed Amazon Review Dataset for Language Generation

Explore at:
zip(3416608411 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
Abhishek Chatterjee
Description

Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.

Search
Clear search
Close search
Google apps
Main menu