100+ datasets found

📊 Best Open Source LLM Starter Pack 🧙🚀
kaggle.com
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radek Osmulski (2023). 📊 Best Open Source LLM Starter Pack 🧙🚀 [Dataset]. https://www.kaggle.com/datasets/radek1/best-llm-starter-pack
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Radek Osmulski
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a couple of great open source models!

version 2 -- the best open source LLM at the time of writing (NousResearch/Nous-Hermes-Llama2-13b) that we can load on Kaggle! didn't manage to load anything larger than 13B

version 14 -- loading models using a new library, curated-transformers that should allow for easier modifications of the underlying architectures.

This dataset also includes all the dependencies we need to load the model in 8bit, if that is what you would like to do (updated version of transfomers, accelerate, etc).

I show how to load and run Nous-Hermes-Llama2-13b in the following notebook:

👉 💡 Best Open Source LLM Starter Pack 🧪🚀

If you find this dataset helpful, please leave an upvote! 🙂 Thank you! 🙏
The LargeST Benchmark Dataset
kaggle.com
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
liuxu77 (2023). The LargeST Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/liuxu77/largest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
liuxu77
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the official website for downloading the CA sub-dataset of the LargeST benchmark dataset. There are a total of 7 files in this page. Among them, 5 files in .h5 format contain the traffic flow raw data from 2017 to 2021, 1 file in .csv format provides the metadata for sensors, and 1 file in .npy format represents the adjacency matrix constructed based on road network distances. Please refer to https://github.com/liuxu77/LargeST for more information.
O
BUTTER - Empirical Deep Learning Dataset
data.openei.org
datasets.ai
+2more
code, data, website
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek (2022). BUTTER - Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/1872441
Explore at:
code, website, dataAvailable download formats
Unique identifier
https://doi.org/10.25984/1872441
Dataset updated
May 20, 2022
Dataset provided by
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
National Renewable Energy Laboratory
Open Energy Data Initiative (OEDI)
Authors
Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek; Charles Tripp; Jordan Perr-Sauer; Lucas Hayne; Monte Lunacek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
h
Reflection-Dataset-v1
huggingface.co
Updated Sep 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maheswar KK (2024). Reflection-Dataset-v1 [Dataset]. https://huggingface.co/datasets/mahiatlinux/Reflection-Dataset-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2024
Authors
Maheswar KK
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
V2 is out!!! V2

Simple "Reflection" method dataset inspired by mattshumer This is the prompt and response version. Find ShareGPT version here

This dataset was synthetically generated using Glaive AI.
I
Cline Center Coup d’État Project Dataset
databank.illinois.edu
Updated May 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto (2025). Cline Center Coup d’État Project Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9651987_V7
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9651987_V7
Dataset updated
May 11, 2025
Authors
Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader. Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 to a conspiracy. Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022. Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup. Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event. Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include: • Reconciling missing event data • Removing events with irreconcilable event dates • Removing events with insufficient sourcing (each event needs at least two sources) • Removing events that were inaccurately coded as coup events • Removing variables that fell below the threshold of inter-coder reliability required by the project • Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries • Extending the period covered from 1945-2005 to 1945-2019 • Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
Items in this Dataset 1. Cline Center Coup d'État Codebook v.2.1.3 Codebook.pdf - This 15-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. Revised February 2024 2. Coup Data v2.1.3.csv - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1000 observations. Revised February 2024 3. Source Document v2.1.3.pdf - This 325-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. Revised February 2024 4. README.md - This file contains useful information for the user about the dataset. It is a text file written in markdown language. Revised February 2024
Citation Guidelines 1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation: Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2024. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7 2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access): Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Emilio Soto. 2024. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7
linto-dataset-audio-ar-tn-augmented
huggingface.co
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LINAGORA Labs (2025). linto-dataset-audio-ar-tn-augmented [Dataset]. https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2025
Dataset provided by
Linagorahttps://www.linagora.com/
Authors
LINAGORA Labs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LinTO DataSet Audio for Arabic Tunisian Augmented A collection of Tunisian dialect audio and its annotations for STT task

This is the augmented datasets used to train the Linto Tunisian dialect with code-switching STT linagora/linto-asr-ar-tn.

Dataset Summary Dataset composition Sources Content Types Languages and Dialects

Example use (python) License Citations

Dataset Summary

The LinTO DataSet Audio for Arabic Tunisian Augmented is a dataset that builds on LinTO… See the full description on the dataset page: https://huggingface.co/datasets/linagora/linto-dataset-audio-ar-tn-augmented.
Customer Support Ticket Dataset
kaggle.com
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waseem AlAstal (2024). Customer Support Ticket Dataset [Dataset]. https://www.kaggle.com/datasets/waseemalastal/customer-support-ticket-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Waseem AlAstal
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Overview This dataset comprises detailed records of customer support tickets, providing valuable insights into various aspects of customer service operations. It is designed to aid in the analysis and modeling of customer support processes, offering a wealth of information for data scientists, machine learning practitioners, and business analysts.

Dataset Description The dataset includes the following features:

Ticket ID: Unique identifier for each support ticket. Customer Name: Name of the customer who submitted the ticket. Customer Email: Email address of the customer. Customer Age: Age of the customer. Customer Gender: Gender of the customer. Product Purchased: Product for which the customer has requested support. Date of Purchase: Date when the product was purchased. Ticket Type: Type of support ticket (e.g., Technical Issue, Billing Inquiry). Ticket Subject: Brief subject or title of the ticket. Ticket Description: Detailed description of the issue or inquiry. Ticket Status: Current status of the ticket (e.g., Open, Closed, Pending). Resolution: Description of how the ticket was resolved. Ticket Priority: Priority level of the ticket (e.g., High, Medium, Low). Ticket Channel: The Channel through which the ticket was submitted (e.g., Email, Phone, Web). First Response Time: Time taken for the first response to the ticket. Time to Resolution: Total time taken to resolve the ticket. Customer Satisfaction Rating: Customer satisfaction rating for the support received. Usage This dataset can be utilized for various analytical and modeling purposes, including but not limited to:

Customer Support Analysis: Understand trends and patterns in customer support requests, and analyze ticket volumes, response times, and resolution effectiveness. NLP for Ticket Categorization: Develop natural language processing models to automatically classify tickets based on their content. Customer Satisfaction Prediction: Build predictive models to estimate customer satisfaction based on ticket attributes. Ticket Resolution Time Prediction: Predict the time required to resolve tickets based on historical data. Customer Segmentation: Segment customers based on their support interactions and demographics. Recommender Systems: Develop systems to recommend products or solutions based on past support tickets. Potential Applications: Enhancing customer support workflows by identifying bottlenecks and areas for improvement. Automating the ticket triaging process to ensure timely responses. Improving customer satisfaction through predictive analytics. Personalizing customer support based on segmentation and past interactions. File information: The dataset is provided in CSV format and contains 8470 records and [number of columns] features.
h
taco-datasets
huggingface.co
Updated Nov 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Secure and Assured Intelligent Learning (SAIL) Lab (2023). taco-datasets [Dataset]. https://huggingface.co/datasets/saillab/taco-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset authored and provided by
Secure and Assured Intelligent Learning (SAIL) Lab
Description
This repo consists of the datasets used for the TaCo paper. There are four datasets:

Multilingual Alpaca-52K GPT-4 dataset Multilingual Dolly-15K GPT-4 dataset TaCo dataset Multilingual Vicuna Benchmark dataset

We translated the first three datasets using Google Cloud Translation. The TaCo dataset is created by using the TaCo approach as described in our paper, combining the Alpaca-52K and Dolly-15K datasets. If you would like to create the TaCo dataset for a specific language, you can… See the full description on the dataset page: https://huggingface.co/datasets/saillab/taco-datasets.
R
Dataset Ow Dataset
universe.roboflow.com
zip
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Overwatch (2024). Dataset Ow Dataset [Dataset]. https://universe.roboflow.com/overwatch-4wpfl/dataset-ow
Explore at:
zipAvailable download formats
Dataset updated
Jan 8, 2024
Dataset authored and provided by
Overwatch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Player Bounding Boxes
Description
Dataset Ow

## Overview Dataset Ow is a dataset for object detection tasks - it contains Player annotations for 10,000 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Grocery Items Dataset
universe.roboflow.com
zip
Updated Mar 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grocery Items (2023). Grocery Items Dataset [Dataset]. https://universe.roboflow.com/grocery-items/grocery-items-bxx8e/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 25, 2023
Dataset authored and provided by
Grocery Items
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Items Bounding Boxes
Description
Here are a few use cases for this project:

Smart Inventory Management: Develop an automated inventory system for grocery stores, where the "Grocery Items" computer vision model identifies and tracks product quantities on shelves, helping store managers optimize restocking and reduce product waste.

Assisted Shopping Experience: Implement a user-friendly app for visually impaired users, where the computer vision model recognizes specific grocery items, making it easier for these individuals to identify and locate the products they need while shopping.

Autonomous Grocery Robots: Develop a shopping assistant robot that uses the computer vision model to identify and collect specific items from a shopping list for customers, improving shopping efficiency and convenience.

Data-driven Marketing Analysis: Leverage the computer vision technology to gather in-store data on product placement and store layout, enabling retailers to make better informed decisions about promotions, discounts, and product placement to maximize sales.

Checkout-less Stores: Create a fully automated grocery store where the "Grocery Items" computer vision model tracks picked up and returned items, allowing customers to simply walk out of the store with their selected items while automatically generating their bills, increasing checkout efficiency and reducing wait times.
VHS-22
kaggle.com
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H2020 SIMARGL (2022). VHS-22 [Dataset]. https://www.kaggle.com/datasets/h2020simargl/vhs-22-network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
H2020 SIMARGL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
VHS-22 is a heterogeneous, flow-level dataset which combines ISOT, CICIDS-17, Booters and CTU-13 datasets, as well as traffic from Malware Traffic Analysis (MTA) site, to increase variety of malicious and legitimate traffic flows. It contains 27.7 million flows (20.3 million legitimate and 7.4 million of attacks). The flows are represented in the form of 45 features; apart from classical NetFlow features, VHS-22 contains statistical parameters and network-level features. Their detailed description and the results of initial detection experiments are presented in the paper:

Paweł Szumełda, Natan Orzechowski, Mariusz Rawski, and Artur Janicki. 2022. VHS-22 – A Very Heterogeneous Set of Network Traffic Data for Threat Detection. In Proc. European Interdisciplinary Cybersecurity Conference (EICC 2022), June 15–16, 2022, Barcelona, Spain. ACM, New York, NY, USA, https://doi.org/10.1145/3528580.3532843

Every day contains different attacks mixed with legitimate traffic. 01-01-2022 Botnet attacks from ISOT dataset. 02-01-2022 Various attacks from MTA dataset. 03-01-2022 Web attacks from CICIDS-17 dataset. 04-01-2022 Bruteforce attacks from CICIDS-17 dataset. 05-01-2022 Botnet attacks from CICIDS-17 dataset. 06-01-2022 DDoS attacks from CICIDS-17 dataset 07-01-2022 to 11-01-2022 DDoS attacks from Booters dataset. 12-01-2022 to 23-01-2022 Botnet traffic from CTU-13 dataset.

The VHS-22 dataset consists of labeled network flows and all data is publicly available for researchers in .csv format. When using VHS-22, please cite our paper which describes the VHS-22 dataset in detail, as well as the publications describing the source datasets:

Paweł Szumełda, Natan Orzechowski, Mariusz Rawski, and Artur Janicki. 2022. VHS-22 – A Very Heterogeneous Set of Network Traffic Data for Threat Detection. In Proc. European Interdisciplinary Cybersecurity Conference (EICC 2022), June 15–16, 2022, Barcelona, Spain. ACM, New York, NY, USA, https://doi.org/10.1145/3528580.3532843

Sherif Saad, Issa Traore, Ali Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian. 2011. Detecting P2P botnets through network behavior analysis and machine learning. In Proc. International Conference on Privacy, Security and Trust. IEEE, Montreal, Canada, 174–1

Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani. 2018. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization, In Proc. 4th International Conference on Information Systems Security and Privacy (ICISSP 2018), Funchal, Portugal

José Jair Santanna, Romain Durban, Anna Sperotto, and Aiko Pras. 2015. Inside booters: An analysis on operational databases. In Proc. International Symposium on Integrated Network Management (INM 2015). IFIP/IEEE, Ottawa, Canada, 432–440. https://doi.org/10.1109/INM.2015.71403

Riaz Khan, Xiaosong Zhang, Rajesh Kumar, Abubakar Sharif, Noorbakhsh Amiri Golilarz, and Mamoun Alazab. 2019. An Adaptive Multi-Layer Botnet Detection Technique Using Machine Learning Classifiers. Applied Sciences 9 (06 2019), 2375. https://doi.org/10.3390/app91123

The Malware Traffic Analysis data originate from https://www.malware-traffic-analysis.net, authored by Brad.

The work has been funded by the SIMARGL Project -- Secure Intelligent Methods for Advanced RecoGnition of malware and stegomalware, with the support of the European Commission and the Horizon 2020 Program, under Grant Agreement No. 833042.
E
Dataset of ICDAR 2019 Competition on Post-OCR Text Correction
live.european-language-grid.eu
txt
Updated Sep 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Dataset of ICDAR 2019 Competition on Post-OCR Text Correction [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7738
Explore at:
txtAvailable download formats
Dataset updated
Sep 12, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}
Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.
Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.
Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.
ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation
References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299
Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr
R
Beyond Learning Dataset
universe.roboflow.com
zip
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Personal (2025). Beyond Learning Dataset [Dataset]. https://universe.roboflow.com/personal-czhvf/beyond-learning-vnapo/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 16, 2025
Dataset authored and provided by
Personal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Focused Bounding Boxes
Description
Beyond Learning

## Overview Beyond Learning is a dataset for object detection tasks - it contains Focused annotations for 472 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Randomized Battery Usage 1: Random Walk
data.nasa.gov
catalog.data.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Randomized Battery Usage 1: Random Walk [Dataset]. https://data.nasa.gov/dataset/randomized-battery-usage-1-random-walk
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This dataset is part of a series of datasets, where batteries are continuously cycled with randomly generated current profiles. Reference charging and discharging cycles are also performed after a fixed interval of randomized usage to provide reference benchmarks for battery state of health. In this dataset, four 18650 Li-ion batteries (Identified as RW9, RW10, RW11 and RW12) were continuously operated using a sequence of charging and discharging currents between -4.5A and 4.5A. This type of charging and discharging operation is referred to here as random walk (RW) operation. Each of the loading periods lasted 5 minutes, and after 1500 periods (about 5 days) a series of reference charging and discharging cycles were performed in order to provide reference benchmarks for battery state health.

Booking.com Datasets

brightdata.com

.json, .csv, .xlsx

Updated Nov 23, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data (2023). Booking.com Datasets [Dataset]. https://brightdata.com/products/datasets/booking

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset updated

Nov 23, 2023

Dataset authored and provided by

Bright Datahttps://brightdata.com/

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

The Booking Hotel Listings Dataset provides a structured and in-depth view of accommodations worldwide, offering essential data for travel industry professionals, market analysts, and businesses. This dataset includes key details such as hotel names, locations, star ratings, pricing, availability, room configurations, amenities, guest reviews, sustainability features, and cancellation policies.

With this dataset, users can:

Analyze market trends to understand booking behaviors, pricing dynamics, and seasonal demand.
Enhance travel recommendations by identifying top-rated hotels based on reviews, location, and amenities.
Optimize pricing and revenue strategies by benchmarking property performance and availability patterns.
Assess guest satisfaction through sentiment analysis of ratings and reviews.
Evaluate sustainability efforts by examining eco-friendly features and certifications.

Designed for hospitality businesses, travel platforms, AI-powered recommendation engines, and pricing strategists, this dataset enables data-driven decision-making to improve customer experience and business performance.

Use Cases

Booking Hotel Listings in Greece
Gain insights into Greece’s diverse hospitality landscape, from luxury resorts in Santorini to boutique hotels in Athens. Analyze review scores, availability trends, and traveler preferences to refine booking strategies.

Booking Hotel Listings in Croatia
Explore hotel data across Croatia’s coastal and inland destinations, ideal for travel planners targeting visitors to Dubrovnik, Split, and Plitvice Lakes. This dataset includes review scores, pricing, and sustainability features.

Booking Hotel Listings with Review Scores Greater Than 9
A curated selection of high-rated hotels worldwide, ideal for luxury travel planners and market researchers focused on premium accommodations that consistently exceed guest expectations.

Booking Hotel Listings in France with More Than 1000 Reviews
Analyze well-established and highly reviewed hotels across France, ensuring reliable guest feedback for market insights and customer satisfaction benchmarking.

This dataset serves as an indispensable resource for travel analysts, hospitality businesses, and data-driven decision-makers, providing the intelligence needed to stay competitive in the ever-evolving travel industry.

R
Aad And Michael Dataset
universe.roboflow.com
zip
Updated Jun 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skripskuy (2022). Aad And Michael Dataset [Dataset]. https://universe.roboflow.com/skripskuy/aad-and-michael-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jun 18, 2022
Dataset authored and provided by
Skripskuy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
FMCG Bounding Boxes
Description
Here are a few use cases for this project:

Retail Inventory Automation: The computer vision model could be utilized by retail store owners to automate their inventory management. By scanning the shelves using the model, they can easily identify what products they have in stock, and in what quantities. This could significantly save time and cost in inventory management in retail businesses.

Automated Checkout Systems: The computer vision model could be used for creating automated or “self-checkout” systems in stores. Shoppers could quickly and easily checkout by simply scanning their items, and this would reduce the need for cashier staff and reduce waiting times for customers.

Smart Vending Machines: The model can be used to create intelligent vending machines that can identify the specific product a customer picked from the display, automatically calculate the total cost, and process payment.

Customer Behaviour Analysis: Shops can use this model to track customer behavior in the store - which products they consider, how often they pick up a product, put it back, etc. This data can then be used for analytics and improving store layout, product placement, or promotions.

Waste Management: In recycling and waste management, it could be used to identify and sort different products and brands, making it easier to recycle items correctly and maintain sustainability practices.
a
Leaf counting dataset
academictorrents.com
bittorrent
Updated Jun 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teimouri, Nima and Dyrmann, Mads and Nielsen, Per Rydahl and Mathiassen, Solvejg Kopp and Somerville, Gayle J. and Jørgensen, Rasmus Nyholm (2020). Leaf counting dataset [Dataset]. https://academictorrents.com/details/a147c27ea0a9c155df9d77af832c321210cf5529
Explore at:
bittorrent(925394199)Available download formats
Dataset updated
Jun 22, 2020
Dataset authored and provided by
Teimouri, Nima and Dyrmann, Mads and Nielsen, Per Rydahl and Mathiassen, Solvejg Kopp and Somerville, Gayle J. and Jørgensen, Rasmus Nyholm
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Leaf counting dataset Dataset containing 9372 RGB images of weeds with the number of leaves counted. The images are collected in fields across Denmark using Nokia and Samsung cell phone cameras; Samsung, Nikon, Canon and Sony consumer cameras; and a Point Grey industrial camera. ## Citation If you use this dataset in your research or elsewhere, please cite/reference the following paper: PAPER: Weed Growth Stage Estimator Using Deep Convolutional Neural Networks Bibtex
R
Vehicles Dataset Dataset
universe.roboflow.com
zip
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Master Thesis (2023). Vehicles Dataset Dataset [Dataset]. https://universe.roboflow.com/master-thesis-yvrl6/vehicles-dataset-unaoi
Explore at:
zipAvailable download formats
Dataset updated
Mar 29, 2023
Dataset authored and provided by
Master Thesis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Vehicles Bounding Boxes
Description
Vehicles Dataset

## Overview Vehicles Dataset is a dataset for object detection tasks - it contains Vehicles annotations for 1,452 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
SpeakerVid-5M-Dataset
huggingface.co
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wang (2025). SpeakerVid-5M-Dataset [Dataset]. https://huggingface.co/datasets/dorni/SpeakerVid-5M-Dataset
Explore at:
Dataset updated
Jul 24, 2025
Authors
wang
Description
Data Usage (download from hugging face)

We provide separate list files for all data and SFT data. The all_data_list.json file contains the YouTube video IDs and the names of several clips obtained from the video segmentation (these names serve as unique identifiers and can be used to locate the corresponding annotations in the annotation folder). Every YouTube video ID specific to a single video on youtube.com, for example, you can access 8Hg_-5aUOYo through Link… See the full description on the dataset page: https://huggingface.co/datasets/dorni/SpeakerVid-5M-Dataset.
R
Fire Smoke Datasets Dataset
universe.roboflow.com
zip
Updated Aug 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BUITEMS (2022). Fire Smoke Datasets Dataset [Dataset]. https://universe.roboflow.com/buitems-ycaeh/fire-smoke-datasets/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 26, 2022
Dataset authored and provided by
BUITEMS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Fire Smoke Bounding Boxes
Description
Fire Smoke Datasets

## Overview Fire Smoke Datasets is a dataset for object detection tasks - it contains Fire Smoke annotations for 9,429 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).

Facebook

Twitter

Click to copy link

Link copied

Cite

Radek Osmulski (2023). 📊 Best Open Source LLM Starter Pack 🧙🚀 [Dataset]. https://www.kaggle.com/datasets/radek1/best-llm-starter-pack

📊 Best Open Source LLM Starter Pack 🧙🚀

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 17, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Radek Osmulski

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains a couple of great open source models!

version 2 -- the best open source LLM at the time of writing (NousResearch/Nous-Hermes-Llama2-13b) that we can load on Kaggle! didn't manage to load anything larger than 13B
version 14 -- loading models using a new library, curated-transformers that should allow for easier modifications of the underlying architectures.

This dataset also includes all the dependencies we need to load the model in 8bit, if that is what you would like to do (updated version of transfomers, accelerate, etc).

I show how to load and run Nous-Hermes-Llama2-13b in the following notebook:

👉 💡 Best Open Source LLM Starter Pack 🧪🚀

If you find this dataset helpful, please leave an upvote! 🙂 Thank you! 🙏

Clear search

Close search

Google apps

Main menu

📊 Best Open Source LLM Starter Pack 🧙🚀

The LargeST Benchmark Dataset

BUTTER - Empirical Deep Learning Dataset

Reflection-Dataset-v1

Cline Center Coup d’État Project Dataset

linto-dataset-audio-ar-tn-augmented

Customer Support Ticket Dataset

taco-datasets

Dataset Ow Dataset

Dataset Ow

Grocery Items Dataset

VHS-22

Dataset of ICDAR 2019 Competition on Post-OCR Text Correction

Beyond Learning Dataset

Beyond Learning

Randomized Battery Usage 1: Random Walk

Booking.com Datasets

Aad And Michael Dataset

Leaf counting dataset

Vehicles Dataset Dataset

Vehicles Dataset

SpeakerVid-5M-Dataset

Fire Smoke Datasets Dataset

Fire Smoke Datasets

📊 Best Open Source LLM Starter Pack 🧙🚀