Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.
To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.
Happy Kaggling!
The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 0.csv
File: 1.csv
File: 10.csv
File: 11.csv
File: 12.csv
File: 14.csv
File: 15.csv
File: 17.csv
File: 18.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Airoboros-3.1 dataset is the perfect tool to help machine learning models excel in the difficult realm of complicated mathematical operations. This data collection features thousands of conversations between machines and humans, formatted in ShareGPT to maximize optimization in an OS ecosystem. The dataset’s focus on advanced subjects like factorials, trigonometry, and larger numerical values will help drive machine learning models to the next level - facilitating critical acquisition of sophisticated mathematical skills that are essential for ML success. As AI technology advances at such a rapid pace, training neural networks to correspondingly move forward can be a daunting and complicated challenge - but with Airoboros-3.1’s powerful datasets designed around difficult mathematical operations it just became one step closer to achievable!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
To get started, download the dataset from Kaggle and use the train.csv file. This file contains over two thousand examples of conversations between ML models and humans which have been formatted using ShareGPT - fast and efficient OS ecosystem fine-tuning tools designed to help with understanding mathematical operations more easily. The file includes two columns: category and conversations, both of which are marked as strings in the data itself.
Once you have downloaded the train file you can begin setting up your own ML training environment by using any of your preferred frameworks or methods. Your model should focus on predicting what kind of mathematical operations will likely be involved in future conversations by referring back to previous dialogues within this dataset for reference (category column). You can also create your own test sets from this data, adding new conversation topics either by modifying existing rows or creating new ones entirely with conversation topics related to mathematics. Finally, compare your model’s results against other established models or algorithms that are already published online!
Happy training!
- It can be used to build custom neural networks or machine learning algorithms that are specifically designed for complex mathematical operations.
- This data set can be used to teach and debug more general-purpose machine learning models to recognize large numbers, and intricate calculations within natural language processing (NLP).
- The Airoboros-3.1 dataset can also be utilized as a supervised learning task: models could learn from the conversations provided in the dataset how to respond correctly when presented with complex mathematical operations
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | category | The type of mathematical operation being discussed. (String) | | conversations | The conversations between the machine learning model and the human. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterThe Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
Facebook
TwitterDatasets for readability and text simplicity evaluation in three sizes: 94, 300, 3000 and 160 disjunctive data entries. One data entry contains the following information:
Licenses of the different datasets apply for the respective texts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description. This project contains the dataset relative to the Galatanet survey, conducted in 2009 and 2010 at the Galatasaray University in Istanbul (Turkey). The goal of this survey was to retrieve information regarding the social relationships between students, their feeling regarding the university in general, and their purchase behavior. The survey was conducted during two phases: the first one in 2009 and the second in 2010.
The dataset includes two kinds of data. First, the answers to most of the questions are contained in a large table, available under both CSV and MS Excel formats. An description file allows understanding the meaning of each field appearing in the table. Note the
survey form is also contained in the archive, for reference (it is in French and Turkish only, though). Second, the social network of students is available under both Pajek and Graphml formats. Having both individual (nodal attributes) and relational (links) information in the same dataset is, to our knowledge, rare and difficult to find in public sources, and this makes (to our opinion) this dataset interesting and valuable.
All data are completely anonymous: students' names have been replaced by random numbers. Note that the survey is not exactly the same between the two phases: some small adjustments were applied thanks to the feedback from the first phase (but the datasets have been normalized since then). Also, the electronic form was very much improved for the second phase, which explains why the answers are much more complete than in the first phase.
The data were used in our following publications:
Citation. If you use this data, please cite article [1] above:
@InProceedings{Labatut2010, author = {Labatut, Vincent and Balasque, Jean-Michel}, title = {Business-oriented Analysis of a Social Network of University Students}, booktitle = {International Conference on Advances in Social Networks Analysis and Mining}, year = {2010}, pages = {25-32}, address = {Odense, DK}, publisher = {IEEE Publishing}, doi = {10.1109/ASONAM.2010.15},}
Contact. 2009-2010 by Jean-Michel Balasque (jmbalasque@gsu.edu.tr) & Vincent Labatut (vlabatut@gsu.edu.tr)
License. This dataset is open data: you can redistribute it and/or use it under the terms of the Creative Commons Zero license (see `license.txt`).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a realistic simulation based on the original Titanic dataset commonly used in machine learning tutorials. It has been expanded to 3000 passengers and 24 features, including both original fields and engineered ones. It can be used for classification tasks (e.g., predicting survival), feature engineering practice, and algorithm testing.
The dataset preserves the statistical distribution of the original data while offering a larger scale for more complex modeling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
|
The iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in xlsx format. These corpora were compiled, classified and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com/). This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform. The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a subset of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution: · 140 Very Easy texts · 140 Easy texts · 140 Plain texts · 42 More Complex texts. Trainers were asked to classify the texts according to the complexity levels of the project, here informally defined as:
They were also asked to annotate the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), according to following categories: Lexical/word-related features - unknown word - word too technical/specialized or archaic - complex derived word - points to a previous reference that is not obvious - word (other) Syntactic/sentence-level features - unusual word order - too much embedded secondary information - too many connectors in the same sentence - sentence (other) - other (please specify) The sets were divided in three parts in Qualtrics and, in each part, the texts are shown randomly to the annotator. Students were asked to confirm that they could read without difficulty texts adequate to their literacy level. Each set contained texts from a given level, plus one text of the level immediately above. They were also asked to annotate words or sequences of words in the text that they did not understand, according to the following categories: - difficult word - difficult part of the text The complete results and datasets are in TSV/Excel format, in pairs of two files, with one file concerning the results from the classification (trainers)/validation (students) task and one file concerning the results from the annotation task. The complete datasets will be available under creative CC BY-NC-ND 4.0 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Galatanet datasets 2009-2010 by Jean-Michel Balasque (jmbalasque@gsu.edu.tr) & Vincent Labatut(vlabatut@gsu.edu.tr)http://www.gsu.edu.tr
This project contains the datasets relative to the Galatanet survey, conducted in 2009 and2010 at the Galatasaray University in Istanbul (Turkey). The goal of this survey was toretrieve information regarding the social relationships between students, their feelingregarding the university in general, and their purchase behavior. The survey was conductedduring two phases: the first one in 2009 and the second in 2010. For the moment, only thedata corresponding to the first phase are available here, because those from the secondphase will be used in some publication to come. The dataset includes two kinds of data. First, the answers to most of the questions arecontained in a large table, available under both CSV and MS Excel formats. An explicativefile allows understanding the meaning of each field appearing in the table. Note thesurvey form is also contained in the archive, for reference (it is in french and turkishonly, though). Second, the social network of students is available under both Pajek andGraphml formats. having both individual (nodal attributes) and relational (links)information in the same dataset is, to our knowledge, rare and difficult to find in publicsources, and this makes (to our opinion) this dataset interesting and valuable. All data are completely anonymous: students' names have been replaced by random numbers.Note the survey is not exactly the same between the two phases: some small adjustmentswere applied thanks to the feedback from the first phase (but the datasets have beennormalized since then). Also, the electronic form was very much improved for the secondphase, which explains why the answers are much more complete than in the first phase. If you use this data, please cite the following article: Labatut, V. & Balasque, J.-M.(2010). Business-oriented Analysis of a Social Network of University Students. In:International Conference on Advances in Social Network Analysis and Mining, 25-32. Odense,DK : IEEE. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5562794 Note the data was used in other publications, too:* An extended version of the original article: Labatut, V. & Balasque, J.-M. (2013).Informative Value of Individual and Relational Data Compared Through Business-OrientedCommunity Detection. Özyer, T.; Rokne, J.; Wagner, G. & Reuser, A. H. (Eds.), TheInfluence of Technology on Social Network Analysis and Mining, Springer, 2013, chap.6,303-330. http://link.springer.com/chapter/10.1007/978-3-7091-1346-2_13* A more didactic article using some of these data just for illustration purposes:Labatut, V. & Balasque, J.-M. (2012). Detection and Interpretation of Communities inComplex Networks: Methods and Practical Application. Abraham, A. & Hassanien, A.-E.(Eds.), Computational Social Networks: Tools, Perspectives and Applications, Springer,chap.4, 81-113. http://link.springer.com/chapter/10.1007/978-1-4471-4048-1_4
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMLU-Pro-ita Dataset Introduction
This is an Italian translation of MMLU-Pro, a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
1. What's new about MMLU-Pro
Compared to the original MMLU, there are three major differences:
The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10… See the full description on the dataset page: https://huggingface.co/datasets/efederici/MMLU-Pro-ita.
Facebook
Twitter
According to our latest research, the global Synthetic Dataplace market size reached USD 2.3 billion in 2024, demonstrating robust momentum driven by increasing demand for privacy-preserving data solutions and advanced AI training datasets. The market is expected to expand at a remarkable CAGR of 31.2% from 2025 to 2033, and is projected to reach USD 23.8 billion by 2033. This extraordinary growth is underpinned by the surge in AI and machine learning adoption across industries, coupled with stringent data privacy regulations that are pushing enterprises to seek synthetic data alternatives.
One of the primary growth factors for the Synthetic Dataplace market is the escalating need for high-quality, diverse, and privacy-compliant datasets to power artificial intelligence and machine learning applications. As organizations across healthcare, finance, and automotive sectors increasingly rely on data-driven insights, the limitations of traditional data—such as scarcity, bias, and privacy risks—have become more pronounced. Synthetic data, generated through advanced algorithms and generative models, offers a promising solution by providing realistic, representative, and fully anonymized datasets. This capability not only accelerates model development and testing but also ensures compliance with global data protection laws, making synthetic dataplace solutions indispensable in modern digital transformation strategies.
Another significant driver propelling the Synthetic Dataplace market is the rapid proliferation of digital technologies and the growing sophistication of cyber threats. Enterprises are recognizing the value of synthetic data in fortifying their cybersecurity postures, enabling them to simulate attack scenarios and stress-test security systems without exposing sensitive information. Additionally, the rise of cloud computing and edge technologies has amplified the need for scalable and secure data generation platforms. Synthetic dataplace solutions, with their ability to generate vast volumes of data on-demand, are increasingly being integrated into cloud architectures, facilitating seamless data sharing and collaboration while minimizing risk. This trend is particularly evident in sectors like finance and healthcare, where data sensitivity and regulatory compliance are paramount.
The Synthetic Dataplace market is also benefiting from advancements in generative AI technologies, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which have significantly improved the fidelity and utility of synthetic data. These innovations are enabling enterprises to create highly realistic datasets that mimic complex real-world scenarios, thereby enhancing the robustness of AI models. Furthermore, the growing emphasis on ethical AI practices and the need to eliminate bias from training data are prompting organizations to adopt synthetic dataplace solutions as a means to achieve greater fairness and transparency. As a result, vendors in this market are investing heavily in R&D to develop cutting-edge synthetic data generation tools that cater to a wide range of industry-specific requirements.
The emergence of a Synthetic Tabular Data Platform is transforming how organizations approach data generation and utilization. These platforms are designed to create highly realistic tabular datasets that mimic real-world data structures, enabling businesses to conduct robust data analysis and machine learning training without compromising privacy. By leveraging advanced algorithms and statistical techniques, synthetic tabular data platforms ensure that the generated data retains the statistical properties of the original datasets, making them invaluable for industries with stringent data privacy requirements. As companies continue to navigate complex data landscapes, the adoption of synthetic tabular data platforms is expected to rise, offering a scalable and secure solution for data-driven decision-making.
From a regional perspective, North America continues to dominate the Synthetic Dataplace market, accounting for the largest share in 2024, driven by the strong presence of leading technology companies, progressive regulatory frameworks, and early adoption of AI and data-driven solutions. Europe follows closely, supported by stringent data privacy regulat
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Facebook
TwitterThe UnifiedQA benchmark consists of 20 main question answering (QA) datasets (each may have multiple versions) that target different formats as well as various complex linguistic phenomena. These datasets are grouped into several formats/categories, including: extractive QA, abstractive QA, multiple-choice QA, and yes/no QA. Additionally, contrast sets are used for several datasets (denoted with "contrast_sets_"). These evaluation sets are expert-generated perturbations that deviate from the patterns common in the original dataset. For several datasets that do not come with evidence paragraphs, two variants are included: one where the datasets are used as-is and another that uses paragraphs fetched via an information retrieval system as additional evidence, indicated with "_ir" tags.
More information can be found at: https://github.com/allenai/unifiedqa.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('unified_qa', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MMLU-Pro Dataset
MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper |
🚀 What's New
[2025.10.25] Posted a consolidated note on Health-category issues and minor category updates (does not change overall micro-averaged scores; may slightly affect… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the data on 144 daily smokers each rating 44 preparatory activities for quitting smoking (e.g., envisioning one's desired future self after quitting smoking, tracking one's smoking behavior, learning about progressive muscle relaxation) on their perceived ease/difficulty and required completion time. Since becoming more physically active can make it easier to quit smoking, some activities were also about becoming more physically active (e.g., tracking one's physical activity behavior, learning about what physical activity is recommended, envisioning one's desired future self after becoming more physically active). Moreover, participants provided a free-text response on what makes some activities more difficult than others.
Study
The data was gathered during a study on the online crowdsourcing platform Prolific between 6 September and 16 November 2022. The Human Research Ethics Committee of Delft University of Technology granted ethical approval for the research (Letter of Approval number: 2338).
In this study, daily smokers who were contemplating or preparing to quit smoking first filled in a prescreening questionnaire and were then invited to a repertory grid study if they passed the prescreening. In the repertory grid study, participants were asked to divide sets of 3 preparatory activities for quitting smoking into two subgroups. Afterward, they rated all preparatory activities on the perceived ease of doing them and the perceived required time to do them. Participants also provided a free-text response on what makes some activities more difficult than others.
The study was pre-registered in the Open Science Framework (OSF): https://osf.io/cax6f. This pre-registration describes the study setup, measures, etc. Note that this dataset contains only part of the collected data: the data related to studying the perceived difficulty of preparatory activities.
The file "Preparatory_Activity_Formulations.xlsx" contains the formulations of the 44 preparatory activities used in this study.
Data
This dataset contains three types of data:
- Data from participants' Prolific profiles. This includes, for example, the age, gender, weekly exercise amount, and smoking frequency.
- Data from a prescreening questionnaire. This includes, for example, the stage of change for quitting smoking and whether people previously tried to quit smoking.
- Data from the repertory grid study. This includes the ratings of the 44 activities on ease and required time as well as the free-text responses on what makes some activities more difficult than others.
There is for each data file a file that explains each data column. For example, the file "prolific_profile_data_explanation.xlsx" contains the column explanations for the data gathered from participants' Prolific profiles.
Each data file contains a column called "rand_id" that can be used to link the data from the data files.
In the case of questions, please contact Nele Albers (n.albers@tudelft.nl) or Willem-Paul Brinkman (w.p.brinkman@tudelft.nl).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Finding communities in gene co-expression networks is a common first step toward extracting biological insight from these complex datasets. Most community detection algorithms expect genes to be organized into assortative modules, that is, groups of genes that are more associated with each other than with genes in other groups. While it is reasonable to expect that these modules exist, using methods that assume they exist a priori is risky, as it guarantees that alternative organizations of gene interactions will be ignored. Here, we ask: can we find meaningful communities without imposing a modular organization on gene co-expression networks, and how modular are these communities? For this, we use a recently developed community detection method, the weighted degree corrected stochastic block model (SBM), that does not assume that assortative modules exist. Instead, the SBM attempts to efficiently use all information contained in the co-expression network to separate the genes into hierarchically organized blocks of genes. Using RNAseq gene expression data measured in two tissues derived from an outbred population of Drosophila melanogaster, we show that (a) the SBM is able to find ten times as many groups as competing methods, that (b) several of those gene groups are not modular, and that (c) the functional enrichment for non-modular groups is as strong as for modular communities. These results show that the transcriptome is structured in more complex ways than traditionally thought and that we should revisit the long-standing assumption that modularity is the main driver of the structuring of gene co-expression networks.
Facebook
Twitterhttps://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
ChartQAR
ChartQAR is an extended version of the ChartQA dataset.It builds upon the original chart question answering task by introducing rationales and a wider variety of question types.
This dataset is designed to help models not only answer questions about charts, but also explain their reasoning and handle more complex queries such as multi-step, trend analysis, and type-based reasoning.
Question Types
The dataset covers a broad range of question categories:… See the full description on the dataset page: https://huggingface.co/datasets/YuukiAsuna/ChartQAR.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.
The datasets in this repository have the following structure:
The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.
The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
The processing directory contains the source code that was used to generate the labels.
The rules directory contains the labeling rules.
The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
The dataset.yml file specifies the start and end time of the simulation.
The following table summarizes relevant properties of the datasets:
fox
Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
Scan volume: High
Unpacked size: 26 GB
harrison
Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
Scan volume: High
Unpacked size: 27 GB
russellmitchell
Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
Scan volume: Low
Unpacked size: 14 GB
santos
Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
Scan volume: Low
Unpacked size: 17 GB
shaw
Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
Scan volume: Low
Data exfiltration is not visible in DNS logs
Unpacked size: 27 GB
wardbeck
Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
Scan volume: Low
Unpacked size: 26 GB
wheeler
Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
Scan volume: High
No password cracking in attack chain
Unpacked size: 30 GB
wilson
Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
Scan volume: High
Unpacked size: 39 GB
The following attacks are launched in the network:
Scans (nmap, WPScan, dirb)
Webshell upload (CVE-2020-24186)
Password cracking (John the Ripper)
Privilege escalation
Remote command execution
Data exfiltration (DNSteal)
Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.
The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:
{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:
type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.
Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.
Version history:
AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]
[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data were acquired from a small simulated environment consisting of one Windows host (host data collection) and a router that observes all network traffic passing to the host. Two attack scenarios were performed in these small simulated environments, and data relevant to these attacks were extracted and further processed. In the case, the attack scenario was based on the Drupal web application's vulnerability, which enabled downloading and running of a malicious code that provided a remote shell to the attacker. In the case, the scenario was based on the old version of the Samba file-sharing that was vulnerable to Eternalblue attack allowing to execute commands and provide a remote shell to the attacker. The dataset is divided into separate directories according to the attacks contained. In the case of the Drupal vulnerability scenario, datasets from a failed and successful attempt to exploit the vulnerability are included. Four datasets were created during individual phases of SMB file sharing vulnerability scenario. Each directory contains a normalized network traffic capture and corresponding host data in preformatted JSON. Drupal Vulnerability Scenario The attack scenario is based on an old Drupal server (v 8.5.0) with known vulnerability CVE-2018-7600 (also called Drupalgeddon). This vulnerability is exploited by an attacker to remotely run code and gain access to the vulnerable server via a remote shell. This connection is realized by the Meterpreter trojan of type python/meterpreter/reverse_tcp. The binary is created by Metasploit generator msfvenom and obfuscated using the attacker's custom obfuscation technique to bypass windows antivirus. The created binary file is delivered to the victim host using remote code execution in Drupal, based on which the "finger" command is executed to download the payload from the payload delivery server and C2 server. This trojan is then launched by an attacker using additional commands injected through the Drupal vulnerability. Once launched, it automatically establishes a connection with the attacker (remote shell) through the payload delivery and C2 server. As a result, the attacker gains full access to the system and can execute any commands (in the scenario, only the "whoami" command is executed). Two datasets were generated during the scenario and its preparation. The first was obtained during the preparatory work when the server's defense mechanisms blocked an attacker's attempt to download the file (a command "MpCmdRun.exe" is used instead of the "finger" command). The second dataset contains a complete attack performed after modifying the executed commands to overcome the mentioned defense mechanisms. Samba File Sharing Vulnerability Scenario The attack scenario is based on an unpatched Windows 7 host with known vulnerability CVE-2017-0144 (also called EternalBlue). The scenario is divided into four parts covering the individual phases of the attack and failed exploitation attempts. In the first part, the attacker performs a scan of open ports on the client device and verifies if the SMB file sharing service is vulnerable to the EternalBlue attack. In the next phase, the attacker unsuccessfully tries to exploit the vulnerability using a standard Metasploit module. This procedure does not result in a remote connection. In the third phase, a specialized exploit is used to attack the service using previously known credentials. In the fourth phase, the attacker tried another script to make the scenario more complex, enabling the attack to be performed without credentials. For each mentioned phase, a separate dataset was generated, capturing all events in the form of packet traces and corresponding host data. Dataset Features In the case of packet capture, the dataset contains standard PCAP files containing all captured packets, including the complete application layer. The raw host data were reduced to contain only the following attributes: event_id - unique Identifier of the event, assigned by a preprocessor event_type - a type of the event time_created - time when the sensor recorded the event event_data - event type-specific payload
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.
To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.
Happy Kaggling!
The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 0.csv
File: 1.csv
File: 10.csv
File: 11.csv
File: 12.csv
File: 14.csv
File: 15.csv
File: 17.csv
File: 18.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .