100+ datasets found

WikiTableQuestions (Semi-structured Tables Q&A)
kaggle.com
zip
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl/discussion
Explore at:
zip(44684 bytes)Available download formats
Dataset updated
Nov 27, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

By [source]

About this dataset

The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

Happy Kaggling!

Research Ideas

The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 0.csv

File: 1.csv

File: 10.csv

File: 11.csv

File: 12.csv

File: 14.csv

File: 15.csv

File: 17.csv

File: 18.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Airoboros LLMs Math Dataset
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Airoboros LLMs Math Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/airoboros-llms-math-dataset
Explore at:
zip(36964941 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Airoboros LLMs Math Dataset

Mastering Complex Mathematical Operations in Machine Learning

By Huggingface Hub [source]

About this dataset

The Airoboros-3.1 dataset is the perfect tool to help machine learning models excel in the difficult realm of complicated mathematical operations. This data collection features thousands of conversations between machines and humans, formatted in ShareGPT to maximize optimization in an OS ecosystem. The dataset’s focus on advanced subjects like factorials, trigonometry, and larger numerical values will help drive machine learning models to the next level - facilitating critical acquisition of sophisticated mathematical skills that are essential for ML success. As AI technology advances at such a rapid pace, training neural networks to correspondingly move forward can be a daunting and complicated challenge - but with Airoboros-3.1’s powerful datasets designed around difficult mathematical operations it just became one step closer to achievable!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

To get started, download the dataset from Kaggle and use the train.csv file. This file contains over two thousand examples of conversations between ML models and humans which have been formatted using ShareGPT - fast and efficient OS ecosystem fine-tuning tools designed to help with understanding mathematical operations more easily. The file includes two columns: category and conversations, both of which are marked as strings in the data itself.

Once you have downloaded the train file you can begin setting up your own ML training environment by using any of your preferred frameworks or methods. Your model should focus on predicting what kind of mathematical operations will likely be involved in future conversations by referring back to previous dialogues within this dataset for reference (category column). You can also create your own test sets from this data, adding new conversation topics either by modifying existing rows or creating new ones entirely with conversation topics related to mathematics. Finally, compare your model’s results against other established models or algorithms that are already published online!

Happy training!

Research Ideas

It can be used to build custom neural networks or machine learning algorithms that are specifically designed for complex mathematical operations.

This data set can be used to teach and debug more general-purpose machine learning models to recognize large numbers, and intricate calculations within natural language processing (NLP).

The Airoboros-3.1 dataset can also be utilized as a supervised learning task: models could learn from the conversations provided in the dataset how to respond correctly when presented with complex mathematical operations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | category | The type of mathematical operation being discussed. (String) | | conversations | The conversations between the machine learning model and the human. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
E
The Human Know-How Dataset
find.data.gov.scot
dtechtive.com
pdf, zip
Updated Apr 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
Explore at:
zip(19.78 MB), zip(0.2837 MB), zip(19.67 MB), zip(69.8 MB), zip(9.433 MB), zip(62.92 MB), zip(20.43 MB), zip(43.28 MB), zip(92.88 MB), zip(13.06 MB), zip(14.86 MB), zip(5.372 MB), zip(0.0298 MB), pdf(0.0582 MB), zip(5.769 MB), zip(90.08 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1394
Dataset updated
Apr 29, 2016
Description
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160
zenodo.org
csv
Updated Sep 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Björn Engelmann; Björn Engelmann; Christin Katharina Kreutz; Christin Katharina Kreutz; Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer (2024). ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160 [Dataset]. http://doi.org/10.5281/zenodo.13847807
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13847807
Dataset updated
Sep 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Björn Engelmann; Björn Engelmann; Christin Katharina Kreutz; Christin Katharina Kreutz; Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer
Description
Datasets for readability and text simplicity evaluation in three sizes: 94, 300, 3000 and 160 disjunctive data entries. One data entry contains the following information:

Text_original: Text from a parallel corpus for text simplification

Text_formatted: Text_original where formatting issues have been resolved either manually (ARTS94) or automatically (ARTS300, ARTS3000, ARTS160)

Dataset: Parallel corpus for text simplification, from which the original text has been extracted

Label: information, if the text has been from the simplified (simp) or source (src) part of the corpus

ID: Unique ID

Score: Simplicity/readability score of the formatted text, between 0 and 1, the higher a score, the more complex/less readable the text

Licenses of the different datasets apply for the respective texts.
Galatanet dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png, txt +1
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Labatut; Vincent Labatut; Jean-Michel Balasque; Jean-Michel Balasque (2024). Galatanet dataset [Dataset]. http://doi.org/10.5281/zenodo.6811542
Explore at:
bin, txt, csv, png, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6811542
Dataset updated
Oct 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vincent Labatut; Vincent Labatut; Jean-Michel Balasque; Jean-Michel Balasque
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description. This project contains the dataset relative to the Galatanet survey, conducted in 2009 and 2010 at the Galatasaray University in Istanbul (Turkey). The goal of this survey was to retrieve information regarding the social relationships between students, their feeling regarding the university in general, and their purchase behavior. The survey was conducted during two phases: the first one in 2009 and the second in 2010.

The dataset includes two kinds of data. First, the answers to most of the questions are contained in a large table, available under both CSV and MS Excel formats. An description file allows understanding the meaning of each field appearing in the table. Note the
survey form is also contained in the archive, for reference (it is in French and Turkish only, though). Second, the social network of students is available under both Pajek and Graphml formats. Having both individual (nodal attributes) and relational (links) information in the same dataset is, to our knowledge, rare and difficult to find in public sources, and this makes (to our opinion) this dataset interesting and valuable.

All data are completely anonymous: students' names have been replaced by random numbers. Note that the survey is not exactly the same between the two phases: some small adjustments were applied thanks to the feedback from the first phase (but the datasets have been normalized since then). Also, the electronic form was very much improved for the second phase, which explains why the answers are much more complete than in the first phase.

The data were used in our following publications:

Labatut, V. & Balasque, J.-M. (2010). Business-oriented Analysis of a Social Network of University Students. In: International Conference on Advances in Social Network Analysis and Mining, 25-32. Odense, DK : IEEE. ⟨hal-00633643⟩ - DOI: 10.1109/ASONAM.2010.15

An extended version of the original article: Labatut, V. & Balasque, J.-M. (2013). Informative Value of Individual and Relational Data Compared Through Business-Oriented Community Detection. Özyer, T.; Rokne, J.; Wagner, G. & Reuser, A. H. (Eds.), The Influence of Technology on Social Network Analysis and Mining, Springer, 2013, chap.6, 303-330. ⟨hal-00633650⟩ - DOI: 10.1007/978-3-7091-1346-2_13

A more didactic article using some of these data just for illustration purposes: Labatut, V. & Balasque, J.-M. (2012). Detection and Interpretation of Communities in Complex Networks: Methods and Practical Application. Abraham, A. & Hassanien, A.-E. (Eds.), Computational Social Networks: Tools, Perspectives and Applications, Springer, chap.4, 81-113. ⟨hal-00633653⟩ - DOI: 10.1007/978-1-4471-4048-1_4

Citation. If you use this data, please cite article [1] above:

@InProceedings{Labatut2010,
author = {Labatut, Vincent and Balasque, Jean-Michel},
title = {Business-oriented Analysis of a Social Network of University Students},
booktitle = {International Conference on Advances in Social Networks Analysis and Mining},
year = {2010},
pages = {25-32},
address = {Odense, DK},
publisher = {IEEE Publishing},
doi = {10.1109/ASONAM.2010.15},
}

Contact. 2009-2010 by Jean-Michel Balasque (jmbalasque@gsu.edu.tr) & Vincent Labatut (vlabatut@gsu.edu.tr)

License. This dataset is open data: you can redistribute it and/or use it under the terms of the Creative Commons Zero license (see `license.txt`).
Data from: Regression with Empirical Variable Selection: Description of a...
plos.figshare.com
txt
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0034338
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Anne E. Goodenough; Adam G. Hart; Richard Stafford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Titanic Passenger Survival Dataset
kaggle.com
zip
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khushi Yadav (2025). Titanic Passenger Survival Dataset [Dataset]. https://www.kaggle.com/datasets/khushikyad001/titanic-passenger-survival-dataset
Explore at:
zip(122797 bytes)Available download formats
Dataset updated
Apr 21, 2025
Authors
Khushi Yadav
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a realistic simulation based on the original Titanic dataset commonly used in machine learning tutorials. It has been expanded to 3000 passengers and 24 features, including both original fields and engineered ones. It can be used for classification tasks (e.g., predicting survival), feature engineering practice, and algorithm testing.

The dataset preserves the statistical distribution of the original data while offering a larger scale for more complex modeling.

Data from: iRead4Skills Dataset 2: annotated corpora by level of complexity...

zenodo.org
investigacion.usc.gal
+3more

Updated Jul 25, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Alice Pintard; Alice Pintard; Thomas François; Thomas François; Nagant de Deuxchaisnes Justine; Nagant de Deuxchaisnes Justine; Sílvia Barbosa; Sílvia Barbosa; Maria Leonor Reis; Maria Leonor Reis; Michell Moutinho; Michell Moutinho; Ricardo Monteiro; Ricardo Monteiro; Raquel Amaro; Raquel Amaro; Susana Correia; Susana Correia; Sandra Rodríguez Rey; Sandra Rodríguez Rey; Keran Mu; Keran Mu; Marcos Garcia González; Marcos Garcia González; André Bernárdez Braña; Xavier Blanco Escoda; Xavier Blanco Escoda; André Bernárdez Braña (2024). iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP [Dataset]. http://doi.org/10.5281/zenodo.12821882

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.12821882

Dataset updated

Jul 25, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in xlsx format. These corpora were compiled, classified and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com/).

This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform.

The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a subset of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution:

· 140 Very Easy texts

· 140 Easy texts

· 140 Plain texts

· 42 More Complex texts.

Trainers were asked to classify the texts according to the complexity levels of the project, here informally defined as:

Very Easy (everyone can understand the text or most of the text).

Easy (a person with less than the 9th year of schooling can understand the text or most of the text)

Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it)

More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it).

They were also asked to annotate the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), according to following categories:

Lexical/word-related features

- unknown word

- word too technical/specialized or archaic

- complex derived word

- points to a previous reference that is not obvious

- word (other)

Syntactic/sentence-level features

- unusual word order

- too much embedded secondary information

- too many connectors in the same sentence

- sentence (other)

- other (please specify)

The sets were divided in three parts in Qualtrics and, in each part, the texts are shown randomly to the annotator.

Students were asked to confirm that they could read without difficulty texts adequate to their literacy level. Each set contained texts from a given level, plus one text of the level immediately above.

They were also asked to annotate words or sequences of words in the text that they did not understand, according to the following categories:

- difficult word

- difficult part of the text

The complete results and datasets are in TSV/Excel format, in pairs of two files, with one file concerning the results from the classification (trainers)/validation (students) task and one file concerning the results from the annotation task. The complete datasets will be available under creative CC BY-NC-ND 4.0

Galatanet
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Labatut (2023). Galatanet [Dataset]. http://doi.org/10.6084/m9.figshare.1289732.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1289732.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Vincent Labatut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Galatanet datasets 2009-2010 by Jean-Michel Balasque (jmbalasque@gsu.edu.tr) & Vincent Labatut(vlabatut@gsu.edu.tr)http://www.gsu.edu.tr

This dataset is open data: you can redistribute it and/or useit under the terms of the Creative Commons Zero license (see license.txt).

This project contains the datasets relative to the Galatanet survey, conducted in 2009 and2010 at the Galatasaray University in Istanbul (Turkey). The goal of this survey was toretrieve information regarding the social relationships between students, their feelingregarding the university in general, and their purchase behavior. The survey was conductedduring two phases: the first one in 2009 and the second in 2010. For the moment, only thedata corresponding to the first phase are available here, because those from the secondphase will be used in some publication to come. The dataset includes two kinds of data. First, the answers to most of the questions arecontained in a large table, available under both CSV and MS Excel formats. An explicativefile allows understanding the meaning of each field appearing in the table. Note thesurvey form is also contained in the archive, for reference (it is in french and turkishonly, though). Second, the social network of students is available under both Pajek andGraphml formats. having both individual (nodal attributes) and relational (links)information in the same dataset is, to our knowledge, rare and difficult to find in publicsources, and this makes (to our opinion) this dataset interesting and valuable. All data are completely anonymous: students' names have been replaced by random numbers.Note the survey is not exactly the same between the two phases: some small adjustmentswere applied thanks to the feedback from the first phase (but the datasets have beennormalized since then). Also, the electronic form was very much improved for the secondphase, which explains why the answers are much more complete than in the first phase. If you use this data, please cite the following article: Labatut, V. & Balasque, J.-M.(2010). Business-oriented Analysis of a Social Network of University Students. In:International Conference on Advances in Social Network Analysis and Mining, 25-32. Odense,DK : IEEE. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5562794 Note the data was used in other publications, too:* An extended version of the original article: Labatut, V. & Balasque, J.-M. (2013).Informative Value of Individual and Relational Data Compared Through Business-OrientedCommunity Detection. Özyer, T.; Rokne, J.; Wagner, G. & Reuser, A. H. (Eds.), TheInfluence of Technology on Social Network Analysis and Mining, Springer, 2013, chap.6,303-330. http://link.springer.com/chapter/10.1007/978-3-7091-1346-2_13* A more didactic article using some of these data just for illustration purposes:Labatut, V. & Balasque, J.-M. (2012). Detection and Interpretation of Communities inComplex Networks: Methods and Practical Application. Abraham, A. & Hassanien, A.-E.(Eds.), Computational Social Networks: Tools, Perspectives and Applications, Springer,chap.4, 81-113. http://link.springer.com/chapter/10.1007/978-1-4471-4048-1_4
h
MMLU-Pro-ita
huggingface.co
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edoardo Federici (2024). MMLU-Pro-ita [Dataset]. https://huggingface.co/datasets/efederici/MMLU-Pro-ita
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2024
Authors
Edoardo Federici
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MMLU-Pro-ita Dataset Introduction

This is an Italian translation of MMLU-Pro, a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.

1. What's new about MMLU-Pro

Compared to the original MMLU, there are three major differences:

The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10… See the full description on the dataset page: https://huggingface.co/datasets/efederici/MMLU-Pro-ita.
G
Synthetic Dataplace Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Synthetic Dataplace Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-dataplace-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Synthetic Dataplace Market Outlook

According to our latest research, the global Synthetic Dataplace market size reached USD 2.3 billion in 2024, demonstrating robust momentum driven by increasing demand for privacy-preserving data solutions and advanced AI training datasets. The market is expected to expand at a remarkable CAGR of 31.2% from 2025 to 2033, and is projected to reach USD 23.8 billion by 2033. This extraordinary growth is underpinned by the surge in AI and machine learning adoption across industries, coupled with stringent data privacy regulations that are pushing enterprises to seek synthetic data alternatives.

One of the primary growth factors for the Synthetic Dataplace market is the escalating need for high-quality, diverse, and privacy-compliant datasets to power artificial intelligence and machine learning applications. As organizations across healthcare, finance, and automotive sectors increasingly rely on data-driven insights, the limitations of traditional dataÂ—such as scarcity, bias, and privacy risksÂ—have become more pronounced. Synthetic data, generated through advanced algorithms and generative models, offers a promising solution by providing realistic, representative, and fully anonymized datasets. This capability not only accelerates model development and testing but also ensures compliance with global data protection laws, making synthetic dataplace solutions indispensable in modern digital transformation strategies.

Another significant driver propelling the Synthetic Dataplace market is the rapid proliferation of digital technologies and the growing sophistication of cyber threats. Enterprises are recognizing the value of synthetic data in fortifying their cybersecurity postures, enabling them to simulate attack scenarios and stress-test security systems without exposing sensitive information. Additionally, the rise of cloud computing and edge technologies has amplified the need for scalable and secure data generation platforms. Synthetic dataplace solutions, with their ability to generate vast volumes of data on-demand, are increasingly being integrated into cloud architectures, facilitating seamless data sharing and collaboration while minimizing risk. This trend is particularly evident in sectors like finance and healthcare, where data sensitivity and regulatory compliance are paramount.

The Synthetic Dataplace market is also benefiting from advancements in generative AI technologies, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which have significantly improved the fidelity and utility of synthetic data. These innovations are enabling enterprises to create highly realistic datasets that mimic complex real-world scenarios, thereby enhancing the robustness of AI models. Furthermore, the growing emphasis on ethical AI practices and the need to eliminate bias from training data are prompting organizations to adopt synthetic dataplace solutions as a means to achieve greater fairness and transparency. As a result, vendors in this market are investing heavily in R&D to develop cutting-edge synthetic data generation tools that cater to a wide range of industry-specific requirements.

The emergence of a Synthetic Tabular Data Platform is transforming how organizations approach data generation and utilization. These platforms are designed to create highly realistic tabular datasets that mimic real-world data structures, enabling businesses to conduct robust data analysis and machine learning training without compromising privacy. By leveraging advanced algorithms and statistical techniques, synthetic tabular data platforms ensure that the generated data retains the statistical properties of the original datasets, making them invaluable for industries with stringent data privacy requirements. As companies continue to navigate complex data landscapes, the adoption of synthetic tabular data platforms is expected to rise, offering a scalable and secure solution for data-driven decision-making.

From a regional perspective, North America continues to dominate the Synthetic Dataplace market, accounting for the largest share in 2024, driven by the strong presence of leading technology companies, progressive regulatory frameworks, and early adoption of AI and data-driven solutions. Europe follows closely, supported by stringent data privacy regulat
Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
T
unified_qa
tensorflow.org
opendatalab.com
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). unified_qa [Dataset]. https://www.tensorflow.org/datasets/catalog/unified_qa
Explore at:
Dataset updated
Dec 6, 2022
Description
The UnifiedQA benchmark consists of 20 main question answering (QA) datasets (each may have multiple versions) that target different formats as well as various complex linguistic phenomena. These datasets are grouped into several formats/categories, including: extractive QA, abstractive QA, multiple-choice QA, and yes/no QA. Additionally, contrast sets are used for several datasets (denoted with "contrast_sets_"). These evaluation sets are expert-generated perturbations that deviate from the patterns common in the original dataset. For several datasets that do not come with evidence paragraphs, two variants are included: one where the datasets are used as-is and another that uses paragraphs fetched via an information retrieval system as additional evidence, indicated with "_ir" tags.

More information can be found at: https://github.com/allenai/unifiedqa.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('unified_qa', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
s
Data from: Data files used to study change dynamics in software systems
figshare.swinburne.edu.au
pdf
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25916/sut.26288227.v1
Dataset updated
Jul 22, 2024
Dataset provided by
Swinburne
Authors
Rajesh Vasa
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
h
MMLU-Pro
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIGER-Lab, MMLU-Pro [Dataset]. http://doi.org/10.57967/hf/2439
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2439
Dataset authored and provided by
TIGER-Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MMLU-Pro Dataset

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper |

🚀 What's New

[2025.10.25] Posted a consolidated note on Health-category issues and minor category updates (does not change overall micro-averaged scores; may slightly affect… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
4
Difficulty and Time Perceptions of Preparatory Activities for Quitting...
data.4tu.nl
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nele Albers; Mark A. Neerincx; Willem-Paul Brinkman, Difficulty and Time Perceptions of Preparatory Activities for Quitting Smoking: Dataset [Dataset]. http://doi.org/10.4121/5198f299-9c7a-40f8-8206-c18df93ee2a0.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/5198f299-9c7a-40f8-8206-c18df93ee2a0.v1
Dataset provided by
4TU.ResearchData
Authors
Nele Albers; Mark A. Neerincx; Willem-Paul Brinkman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 6, 2022 - Nov 16, 2022
Description
This dataset contains the data on 144 daily smokers each rating 44 preparatory activities for quitting smoking (e.g., envisioning one's desired future self after quitting smoking, tracking one's smoking behavior, learning about progressive muscle relaxation) on their perceived ease/difficulty and required completion time. Since becoming more physically active can make it easier to quit smoking, some activities were also about becoming more physically active (e.g., tracking one's physical activity behavior, learning about what physical activity is recommended, envisioning one's desired future self after becoming more physically active). Moreover, participants provided a free-text response on what makes some activities more difficult than others.

Study
The data was gathered during a study on the online crowdsourcing platform Prolific between 6 September and 16 November 2022. The Human Research Ethics Committee of Delft University of Technology granted ethical approval for the research (Letter of Approval number: 2338).
In this study, daily smokers who were contemplating or preparing to quit smoking first filled in a prescreening questionnaire and were then invited to a repertory grid study if they passed the prescreening. In the repertory grid study, participants were asked to divide sets of 3 preparatory activities for quitting smoking into two subgroups. Afterward, they rated all preparatory activities on the perceived ease of doing them and the perceived required time to do them. Participants also provided a free-text response on what makes some activities more difficult than others.
The study was pre-registered in the Open Science Framework (OSF): https://osf.io/cax6f. This pre-registration describes the study setup, measures, etc. Note that this dataset contains only part of the collected data: the data related to studying the perceived difficulty of preparatory activities.
The file "Preparatory_Activity_Formulations.xlsx" contains the formulations of the 44 preparatory activities used in this study.

Data
This dataset contains three types of data:
- Data from participants' Prolific profiles. This includes, for example, the age, gender, weekly exercise amount, and smoking frequency.
- Data from a prescreening questionnaire. This includes, for example, the stage of change for quitting smoking and whether people previously tried to quit smoking.
- Data from the repertory grid study. This includes the ratings of the 44 activities on ease and required time as well as the free-text responses on what makes some activities more difficult than others.
There is for each data file a file that explains each data column. For example, the file "prolific_profile_data_explanation.xlsx" contains the column explanations for the data gathered from participants' Prolific profiles.
Each data file contains a column called "rand_id" that can be used to link the data from the data files.

In the case of questions, please contact Nele Albers (n.albers@tudelft.nl) or Willem-Paul Brinkman (w.p.brinkman@tudelft.nl).
Summary statistics for all SBM blocks.
plos.figshare.com
xlsx
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diogo Melo; Luisa F. Pallares; Julien F. Ayroles (2024). Summary statistics for all SBM blocks. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012300.s008
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012300.s008
Dataset updated
Aug 8, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Diogo Melo; Luisa F. Pallares; Julien F. Ayroles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Finding communities in gene co-expression networks is a common first step toward extracting biological insight from these complex datasets. Most community detection algorithms expect genes to be organized into assortative modules, that is, groups of genes that are more associated with each other than with genes in other groups. While it is reasonable to expect that these modules exist, using methods that assume they exist a priori is risky, as it guarantees that alternative organizations of gene interactions will be ignored. Here, we ask: can we find meaningful communities without imposing a modular organization on gene co-expression networks, and how modular are these communities? For this, we use a recently developed community detection method, the weighted degree corrected stochastic block model (SBM), that does not assume that assortative modules exist. Instead, the SBM attempts to efficiently use all information contained in the co-expression network to separate the genes into hierarchically organized blocks of genes. Using RNAseq gene expression data measured in two tissues derived from an outbred population of Drosophila melanogaster, we show that (a) the SBM is able to find ten times as many groups as competing methods, that (b) several of those gene groups are not modular, and that (c) the functional enrichment for non-modular groups is as strong as for modular communities. These results show that the transcriptome is structured in more complex ways than traditionally thought and that we should revisit the long-standing assumption that modularity is the main driver of the structuring of gene co-expression networks.
h
ChartQAR
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hau Kieu, ChartQAR [Dataset]. https://huggingface.co/datasets/YuukiAsuna/ChartQAR
Explore at:
Authors
Hau Kieu
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
ChartQAR

ChartQAR is an extended version of the ChartQA dataset.It builds upon the original chart question answering task by introducing rationales and a wider variety of question types.
This dataset is designed to help models not only answer questions about charts, but also explain their reasoning and handle more complex queries such as multi-step, trend analysis, and type-based reasoning.

Question Types

The dataset covers a broad range of question categories:… See the full description on the dataset page: https://huggingface.co/datasets/YuukiAsuna/ChartQAR.
Z
AIT Log Data Set V2.0
data.niaid.nih.gov
zenodo.org
+1more
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Landauer, Max; Skopik, Florian; Frank, Maximilian; Hotwagner, Wolfgang; Wurzenberger, Markus; Rauber, Andreas (2024). AIT Log Data Set V2.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5789063
Explore at:
Dataset updated
Jun 28, 2024
Dataset provided by
Vienna University of Technology
AIT Austrian Institute of Technology
Authors
Landauer, Max; Skopik, Florian; Frank, Maximilian; Hotwagner, Wolfgang; Wurzenberger, Markus; Rauber, Andreas
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AIT Log Data Sets

This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

The datasets in this repository have the following structure:

The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.

The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.

The processing directory contains the source code that was used to generate the labels.

The rules directory contains the labeling rules.

The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.

The dataset.yml file specifies the start and end time of the simulation.

The following table summarizes relevant properties of the datasets:

fox

Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00

Attack time: 2022-01-18 11:59 - 2022-01-18 13:15

Scan volume: High

Unpacked size: 26 GB

harrison

Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00

Attack time: 2022-02-08 07:07 - 2022-02-08 08:38

Scan volume: High

Unpacked size: 27 GB

russellmitchell

Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00

Attack time: 2022-01-24 03:01 - 2022-01-24 04:39

Scan volume: Low

Unpacked size: 14 GB

santos

Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00

Attack time: 2022-01-17 11:15 - 2022-01-17 11:59

Scan volume: Low

Unpacked size: 17 GB

shaw

Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00

Attack time: 2022-01-29 14:37 - 2022-01-29 15:21

Scan volume: Low

Data exfiltration is not visible in DNS logs

Unpacked size: 27 GB

wardbeck

Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00

Attack time: 2022-01-23 12:10 - 2022-01-23 12:56

Scan volume: Low

Unpacked size: 26 GB

wheeler

Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00

Attack time: 2022-01-30 07:35 - 2022-01-30 17:53

Scan volume: High

No password cracking in attack chain

Unpacked size: 30 GB

wilson

Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00

Attack time: 2022-02-07 10:57 - 2022-02-07 11:49

Scan volume: High

Unpacked size: 39 GB

The following attacks are launched in the network:

Scans (nmap, WPScan, dirb)

Webshell upload (CVE-2020-24186)

Password cracking (John the Ripper)

Privilege escalation

Remote command execution

Data exfiltration (DNSteal)

Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.

Version history:

AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

If you use the dataset, please cite the following publications:

[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]

[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
SAPPAN: Combined Network and Host Data
data.europa.eu
unknown
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). SAPPAN: Combined Network and Host Data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4159878?locale=cs
Explore at:
unknown(14640)Available download formats
Dataset updated
Oct 4, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data were acquired from a small simulated environment consisting of one Windows host (host data collection) and a router that observes all network traffic passing to the host. Two attack scenarios were performed in these small simulated environments, and data relevant to these attacks were extracted and further processed. In the case, the attack scenario was based on the Drupal web application's vulnerability, which enabled downloading and running of a malicious code that provided a remote shell to the attacker. In the case, the scenario was based on the old version of the Samba file-sharing that was vulnerable to Eternalblue attack allowing to execute commands and provide a remote shell to the attacker. The dataset is divided into separate directories according to the attacks contained. In the case of the Drupal vulnerability scenario, datasets from a failed and successful attempt to exploit the vulnerability are included. Four datasets were created during individual phases of SMB file sharing vulnerability scenario. Each directory contains a normalized network traffic capture and corresponding host data in preformatted JSON. Drupal Vulnerability Scenario The attack scenario is based on an old Drupal server (v 8.5.0) with known vulnerability CVE-2018-7600 (also called Drupalgeddon). This vulnerability is exploited by an attacker to remotely run code and gain access to the vulnerable server via a remote shell. This connection is realized by the Meterpreter trojan of type python/meterpreter/reverse_tcp. The binary is created by Metasploit generator msfvenom and obfuscated using the attacker's custom obfuscation technique to bypass windows antivirus. The created binary file is delivered to the victim host using remote code execution in Drupal, based on which the "finger" command is executed to download the payload from the payload delivery server and C2 server. This trojan is then launched by an attacker using additional commands injected through the Drupal vulnerability. Once launched, it automatically establishes a connection with the attacker (remote shell) through the payload delivery and C2 server. As a result, the attacker gains full access to the system and can execute any commands (in the scenario, only the "whoami" command is executed). Two datasets were generated during the scenario and its preparation. The first was obtained during the preparatory work when the server's defense mechanisms blocked an attacker's attempt to download the file (a command "MpCmdRun.exe" is used instead of the "finger" command). The second dataset contains a complete attack performed after modifying the executed commands to overcome the mentioned defense mechanisms. Samba File Sharing Vulnerability Scenario The attack scenario is based on an unpatched Windows 7 host with known vulnerability CVE-2017-0144 (also called EternalBlue). The scenario is divided into four parts covering the individual phases of the attack and failed exploitation attempts. In the first part, the attacker performs a scan of open ports on the client device and verifies if the SMB file sharing service is vulnerable to the EternalBlue attack. In the next phase, the attacker unsuccessfully tries to exploit the vulnerability using a standard Metasploit module. This procedure does not result in a remote connection. In the third phase, a specialized exploit is used to attack the service using previously known credentials. In the fourth phase, the attacker tried another script to make the scenario more complex, enabling the attack to be performed without credentials. For each mentioned phase, a separate dataset was generated, capturing all events in the form of packet traces and corresponding host data. Dataset Features In the case of packet capture, the dataset contains standard PCAP files containing all captured packets, including the complete application layer. The raw host data were reduced to contain only the following attributes: event_id - unique Identifier of the event, assigned by a preprocessor event_type - a type of the event time_created - time when the sensor recorded the event event_data - event type-specific payload

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl/discussion

WikiTableQuestions (Semi-structured Tables Q&A)

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

Explore at:

zip(44684 bytes)Available download formats

Dataset updated

Nov 27, 2022

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

By [source]

About this dataset

The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

Happy Kaggling!

Research Ideas

The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 0.csv

File: 1.csv

File: 10.csv

File: 11.csv

File: 12.csv

File: 14.csv

File: 15.csv

File: 17.csv

File: 18.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

Clear search

Close search

Google apps

Main menu

WikiTableQuestions (Semi-structured Tables Q&A)

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Airoboros LLMs Math Dataset

Airoboros LLMs Math Dataset

Mastering Complex Mathematical Operations in Machine Learning

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

The Human Know-How Dataset

ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160

Galatanet dataset

Data from: Regression with Empirical Variable Selection: Description of a...

Titanic Passenger Survival Dataset

Data from: iRead4Skills Dataset 2: annotated corpora by level of complexity...

Galatanet

This dataset is open data: you can redistribute it and/or useit under the terms of the Creative Commons Zero license (see license.txt).

MMLU-Pro-ita

Synthetic Dataplace Market Research Report 2033

Synthetic Dataplace Market Outlook

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

unified_qa

Data from: Data files used to study change dynamics in software systems

MMLU-Pro

Difficulty and Time Perceptions of Preparatory Activities for Quitting...

Summary statistics for all SBM blocks.

ChartQAR

AIT Log Data Set V2.0

SAPPAN: Combined Network and Host Data

WikiTableQuestions (Semi-structured Tables Q&A)

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements