81 datasets found

Data from: Multi-Source Distributed System Data for AI-powered Analytics
zenodo.org
explore.openaire.eu
+1more
zip
Updated Nov 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3549604
Dataset updated
Nov 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

General Information:

This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

You may find details of this dataset from the original paper:

Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

If you use the data, implementation, or any details of the paper, please cite!

BIBTEX:

_

@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }

_

The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
p
AI-Driven Mental Health Literacy - An Interventional Study from India (Data...
psycharchives.org
Updated Oct 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). AI-Driven Mental Health Literacy - An Interventional Study from India (Data from main study).csv [Dataset]. https://psycharchives.org/handle/20.500.12034/8771
Explore at:
Dataset updated
Oct 2, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
India
Description
The dataset is from an Indian study which made use of ChatGPT- a natural language processing model by OpenAI to design a mental health literacy intervention for college students. Prompt engineering tactics were used to formulate prompts that acted as anchors in the conversations with the AI agent regarding mental health. An intervention lasting for 20 days was designed with sessions of 15-20 minutes on alternative days. Fifty-one students completed pre-test and post-test measures of mental health literacy, mental help-seeking attitude, stigma, mental health self-efficacy, positive and negative experiences, and flourishing in the main study, which were then analyzed using paired t-tests. The results suggest that the intervention is effective among college students as statistically significant changes were noted in mental health literacy and mental health self-efficacy scores. The study affirms the practicality, acceptance, and initial indications of AI-driven methods in advancing mental health literacy and suggests the promising prospects of innovative platforms such as ChatGPT within the field of applied positive psychology.: Data used in analysis for the intervention study
Mental Health Chatbot Pairs
kaggle.com
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mental Health Chatbot Pairs [Dataset]. https://www.kaggle.com/datasets/thedevastator/mental-health-chatbot-pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Mental Health Chatbot Pairs

AI-based Tailored Support for Mental Health Conversation

By Huggingface Hub [source]

About this dataset

This dataset contains a compilation of carefully-crafted Q&A pairs which are designed to provide AI-based tailored support for mental health. These carefully chosen questions and answers offer an avenue for those looking for help to gain the assistance they need. With these pre-processed conversations, Artificial Intelligence (AI) solutions can be developed and deployed to better understand and respond appropriately to individual needs based on their input. This comprehensive dataset is crafted by experts in the mental health field, providing insightful content that will further research in this growing area. These data points will be invaluable for developing the next generation of personalized AI-based mental health chatbots capable of truly understanding what people need

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains pre-processed Q&A pairs for AI-based tailored support for mental health. As such, it represents an excellent starting point in building a conversational model which can handle conversations about mental health issues. Here are some tips on how to use this dataset to its fullest potential:

Understand your data: Spend time getting to know the text of the conversation between the user and the chatbot and familiarize yourself with what type of questions and answers are included in this specific dataset. This will help you better formulate queries for your own conversational model or develop new ones you can add yourself.

Refine your language processing models: By studying the patterns in syntax, grammar, tone, voice, etc., within this conversational data set you can hone your natural language processing capabilities - such as keyword extractions or entity extraction – prior to implementing them into a larger bot system .

Test assumptions: Have an idea of what you think may work best with a particular audience or context? See if these assumptions pan out by applying different variations of text to this dataset to see if it works before rolling out changes across other channels or programs that utilize AI/chatbot services

Research & Analyze Results : After testing out different scenarios on real-world users by using various forms of q&a within this chatbot pair data set , analyze & record any relevant results pertaining towards understanding user behavior better through further analysis after being exposed to tailored texted conversations about Mental Health topics both passively & actively . The more information you collect here , leads us closer towards creating effective AI powered conversations that bring our desired outcomes from our customer base .

Research Ideas

Developing a chatbot for personalized mental health advice and guidance tailored to individuals' unique needs, experiences, and struggles.

Creating an AI-driven diagnostic system that can interpret mental health conversations and provide targeted recommendations for interventions or treatments based on clinical expertise.

Designing an AI-powered recommendation engine to suggest relevant content such as articles, videos, or podcasts based on users’ questions or topics of discussion during their conversation with the chatbot

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | text | The text of the conversation between the user and the chatbot. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
A
‘School Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘School Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-school-dataset-3c70/2a80983f/?iid=004-128&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘School Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/smeilisa07/number of school teacher student class on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is my first analyst data. This dataset i got from open data Jakarta website (http://data.jakarta.go.id/), so mostly the dataset is in Indonesian. But i have try describe it that you can find it on VARIABLE DESCRIPTION.txt file.

Content

The title of this dataset is jumlah-sekolah-guru-murid-dan-ruang-kelas-menurut-jenis-sekolah-2011-2016, with type is CSV, so you can easily access it. If you not understand, the title means the number of school, teacher, student, and classroom according to the type of school 2011 - 2016. I think, if you just read from the title, you can imagine the contents. So this dataset have 50 observations and 8 variables, taken from 2011 until 2016.

In general, this dataset is about the quality of education in Jakarta, which each year some of school level always decreasing and some is increase, but not significant.

Acknowledgements

This dataset comes from Indonesian education authorities, which is already established in the CSV file by Open Data Jakarta.

Inspiration

Althought this data given from Open Data Jakarta publicly, i want always continue to improve my Data Scientist skill, especially in R programming, because i think R programming is easy to learn and really help me to be always curious about Data Scientist. So, this dataset that I am still struggle with below problem, and i need solution.

Question :

How can i cleaning this dataset ? I have try cleaning this dataset, but i still not sure. You can check on
my_hypothesis.txt file, when i try cleaning and visualize this dataset.

How can i specify the model for machine learning ? What recommended steps i should take ?

How should i cluster my dataset, if i want the label is not number but tingkat_sekolah for every tahun and
jenis_sekolah ? You can check on my_hypothesis.txt file.

--- Original source retains full ownership of the source dataset ---

XAI-FUNGI: Dataset from the user study on comprehensibility of XAI...

zenodo.org
data.niaid.nih.gov

csv, pdf, zip

Updated Mar 7, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Szymon Bobek; Szymon Bobek; Paloma Korycińska; Paloma Korycińska; Monika Krakowska; Monika Krakowska; Maciej Mozolewski; Maciej Mozolewski; Dorota Rak; Dorota Rak; Magdalena Zych; Magdalena Zych; Magdalena Wójcik; Magdalena Wójcik; Grzegorz J. Nalepa; Grzegorz J. Nalepa (2025). XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms [Dataset]. http://doi.org/10.5281/zenodo.14980793

Explore at:

csv, zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14980793

Dataset updated

Mar 7, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms

We present the dataset which was created during a user study on evaluation of explainability of artificial intelligence (AI) at the Jagielloninan University as a collaborative work of computer science (GEIST team) and information sciences research groups. The main goal of the research was to explore effective explanations of AI model patterns to diverse audiences.

The dataset contains material collected from 39 participants during the interviews conducted by the Information Sciences research group. The participants were recruited from 149 candidates to form three groups that represented domain experts in the field of mycology (DE), students with data science and visualization background (IT) and students from social sciences and humanities (SSH). Each group was given an explanation of a machine learning model trained to predict edible and non-edible mushrooms and asked to interpret the explanations and answer various questions during the interview. The machine learning model and explanations for its decision were prepared by the computer science research team.

The resulting dataset was constructed from the surveys obtained from the candidates, anonymized transcripts of the interviews, the results from thematic analysis, and original explanations with modifications suggested by the participants. The dataset is complemented with the source code allowing one to reproduce the initial machine leaning model and explanations.

The general structure of the dataset is described in the following table. The files that contain in their names [RR]_[SS]_[NN] contain the individual results obtained from particular participant. The meaning of the prefix is as follows:

RR - initials of the researcher conducting the interview,
SS - type of the participant (DE for domain expert, SSH for social sciences and humanities students, or IT for computer science students),
NN - number of the participant

File	Description
SURVEY.csv	The results from a survey that was filled by 149 participants out of which 39 were selected to form a final group of particiapnts.
SURVEY_en.csv	Content of the SURVEY translated into English.
CODEBOOK.csv	The codebook used in thematic analysis and MAXQDA coding
QUESTIONS.csv	List of questions that the participants were asked during interviews.
SLIDES.csv	List of slides used in the study with their interpretation and reference to MAXQDA themes and VISUAL_MODIFICATIONS tables.
MAXQDA_SUMMARY.csv	Summary of thematic analysis performed with codes used in CODEBOOK for each participant
PROBLEMS.csv	List of problems that participants were asked to solve during interviews. They correspond to three instances from the dataset that the participants had to classify using knowledge gained from explanations.
PROBLEMS_en.csv	Content of the PROBLEMS file translated into English.
PROBLEMS_RESPONSES.csv	The responses to the problems for each participant to the problems listed in PROBLEMS.csv
VISUALIZATION_MODIFICATIONS.csv	Information on how the order of the slides was modified by the participant, which slides (explanations) were removed, and what kind of additional explanation was suggested.
ORIGINAL_VISUZALIZATIONS.pdf	The PDF file containing the visualization of explanations presented to the participants during the interviews
ORIGINAL_VISUZALIZATIONS_EN.pdf	Content of the ORIGINAL_VISUZALIZATIONS translated into English.
VISUALIZATION_MODIFICATIONS.zip	The PDF file containing the original slides from ORIGINAL_VISUZALIZATIONS.pdf with the modifications suggested by the participant. Each file is a PDF file named with the participant ID, i.e. [RR]_[SS]_[NN].pdf
TRANSCRIPTS.zip	The anonymized transcripts of interviews for each given participant, zipped into one archive. Each transcript is named after the particiapnt ID, i.e. [RR]_[SS]_[NN].csv and contains text tagged with slide number that it related to, question number from QUESTIONS.csv, and problem number from PROBLEMS.csv.

The detailed structure of the files presented in the previous Table is given in the Technical info section.

The source code used to train ML model and to generate explanations is available on Gitlab

o
AI Question Answering Data
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). AI Question Answering Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides essential information for entries related to question answering tasks using AI models. It is designed to offer valuable insights for researchers and practitioners, enabling them to effectively train and rigorously evaluate their machine learning models. The dataset serves as a valuable resource for building and assessing question-answering systems. It is available free of charge.

Columns

instruction: Contains the specific instructions given to a model to generate a response.

responses: Includes the responses generated by the model based on the given instructions.

next_response: Provides the subsequent response from the model, following a previous response, which facilitates a conversational interaction.

answer: Lists the correct answer for each question presented in the instruction, acting as a reference for assessing the model's accuracy.

is_human_response: A boolean column that indicates whether a particular response was created by a human or by a machine learning model, helping to differentiate between the two. Out of nearly 19,300 entries, 254 are human-generated responses, while 18,974 were generated by models.

Distribution

The data files are typically in CSV format, with a dedicated train.csv file for training data and a test.csv file for testing purposes. The training file contains a large number of examples. Specific dates are not included within this dataset description, focusing solely on providing accurate and informative details about its content and purpose. Specific numbers for rows or records are not detailed in the available information.

Usage

This dataset is ideal for a variety of applications and use cases: * Training and Testing: Utilise train.csv to train question-answering models or algorithms, and test.csv to evaluate their performance on unseen questions. * Machine Learning Model Creation: Develop machine learning models specifically for question-answering by leveraging the instructional components, including instructions, responses, next responses, and human-generated answers, along with their is_human_response labels. * Model Performance Evaluation: Assess model performance by comparing predicted responses with actual human-generated answers from the test.csv file. * Data Augmentation: Expand existing data by paraphrasing instructions or generating alternative responses within similar contexts. * Conversational Agents: Build conversational agents or chatbots by utilising the instruction-response pairs for training. * Language Understanding: Train models to understand language and generate responses based on instructions and previous responses. * Educational Materials: Develop interactive quizzes or study guides, with models providing instant feedback to students. * Information Retrieval Systems: Create systems that help users find specific answers from large datasets. * Customer Support: Train customer support chatbots to provide quick and accurate responses to inquiries. * Language Generation Research: Develop novel algorithms for generating coherent responses in question-answering scenarios. * Automatic Summarisation Systems: Train systems to generate concise summaries by understanding main content through question answering. * Dialogue Systems Evaluation: Use the instruction-response pairs as a benchmark for evaluating dialogue system performance. * NLP Algorithm Benchmarking: Establish baselines against which other NLP tools and methods can be measured.

Coverage

The dataset's geographic scope is global. There is no specific time range or demographic scope noted within the available details, as specific dates are not included.

License

CC0

Who Can Use It

This dataset is highly suitable for: * Researchers and Practitioners: To gain insights into question answering tasks using AI models. * Developers: To train models, create chatbots, and build conversational agents. * Students: For developing educational materials and enhancing their learning experience through interactive tools. * Individuals and teams working on Natural Language Processing (NLP) projects. * Those creating information retrieval systems or customer support solutions. * Experts in natural language generation (NLG) and automatic summarisation systems. * Anyone involved in the evaluation of dialogue systems and machine learning model training.

Dataset Name Suggestions

AI Question Answering Data

Conversational AI Training Data

NLP Question-Answering Dataset

Model Evaluation QA Data

Dialogue Response Dataset

Attributes

Original Data Source: Question-Answering Training and Testing Data
T
Impact of AI in Education Processes
dataverse.tdl.org
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saksham Adhikari; Saksham Adhikari (2024). Impact of AI in Education Processes [Dataset]. http://doi.org/10.18738/T8/RXUCHK
Explore at:
application/x-ipynb+json(428065), pptx(80640), tsv(7079)Available download formats
Unique identifier
https://doi.org/10.18738/T8/RXUCHK
Dataset updated
Feb 20, 2024
Dataset provided by
Texas Data Repository
Authors
Saksham Adhikari; Saksham Adhikari
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We did data analysis on a open dataset which contained responses regarding a survey about how useful students find AI in the educational process. We cleaned the data, preprocessed and then did analysis on it. We did an EDA (Exploratory Data Analysis) on the dataset and visualized the results and our findings. Then we interpreted the findings into our digital poster.
Data from: IA Tweets Analysis Dataset (Spanish)
zenodo.org
data.niaid.nih.gov
csv
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Guerrero-Contreras; Gabriel Guerrero-Contreras; Sara Balderas-Díaz; Sara Balderas-Díaz; Alejandro Serrano-Fernández; Andrés Muñoz; Andrés Muñoz; Alejandro Serrano-Fernández (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. http://doi.org/10.5281/zenodo.10821485
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10821485
Dataset updated
Aug 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriel Guerrero-Contreras; Gabriel Guerrero-Contreras; Sara Balderas-Díaz; Sara Balderas-Díaz; Alejandro Serrano-Fernández; Andrés Muñoz; Andrés Muñoz; Alejandro Serrano-Fernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
o
NLP Expert QA Dataset
opendatabay.com
.undefined
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). NLP Expert QA Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/c030902d-7b02-48a2-b32f-8f7140dd1de7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset, QASPER: NLP Questions and Evidence, is an exceptional collection of over 5,000 questions and answers focused on Natural Language Processing (NLP) papers. It has been crowdsourced from experienced NLP practitioners, with each question meticulously crafted based solely on the titles and abstracts of the respective papers. The answers provided are expertly enriched with evidence taken directly from the full text of each paper. QASPER features structured fields including 'qas' for questions and answers, 'evidence' for supporting information, paper titles, abstracts, figures and tables, and full text. This makes it a valuable resource for researchers aiming to understand how practitioners interpret NLP topics and to validate solutions for problems found in existing literature. The dataset contains 5,049 questions spanning 1,585 distinct papers.

Columns

title: The title of the paper. (String)

abstract: A summary of the paper. (String)

full_text: The full text of the paper. (String)

qas: Questions and answers about the paper. (Object)

figures_and_tables: Figures and tables from the paper. (Object)

id: Unique identifier for the paper.

Distribution

The QASPER dataset comprises 5,049 questions across 1,585 papers. It is distributed across five files in .csv format, with one additional .json file for figures and tables. These include two test datasets (test.csv and validation.csv), two train datasets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv), and a figures dataset (figures_and_tables_.json). Each CSV file contains distinct datasets with columns dedicated to titles, abstracts, full texts, and Q&A fields, along with evidence for each paper mentioned in the respective rows.

Usage

This dataset is ideal for various applications, including: * Developing AI models to automatically generate questions and answers from paper titles and abstracts. * Enhancing machine learning algorithms by combining answers with evidence to discover relationships between papers. * Creating online forums for NLP practitioners, using dataset questions to spark discussion within the community. * Conducting basic descriptive statistics or advanced predictive analytics, such as logistic regression or naive Bayes models. * Summarising basic crosstabs between any two variables, like titles and abstracts. * Correlating title lengths with the number of words in their corresponding abstracts to identify patterns. * Utilising text mining technologies like topic modelling, machine learning techniques, or automated processes to summarise underlying patterns. * Filtering terms relevant to specific research hypotheses and processing them via web crawlers, search engines, or document similarity algorithms.

Coverage

The dataset has a GLOBAL region scope. It focuses on papers within the field of Natural Language Processing. The questions and answers are crowdsourced from experienced NLP practitioners. The dataset was listed on 22/06/2025.

License

CC0

Who Can Use It

This dataset is highly suitable for: * Researchers seeking insights into how NLP practitioners interpret complex topics. * Those requiring effective validation for developing clear-cut solutions to problems encountered in existing NLP literature. * NLP practitioners looking for a resource to stimulate discussions within their community. * Data scientists and analysts interested in exploring NLP datasets through descriptive statistics or advanced predictive analytics. * Developers and researchers working with text mining, machine learning techniques, or automated text processing.

Dataset Name Suggestions

NLP Expert QA Dataset

QASPER: NLP Paper Questions and Evidence

Academic NLP Q&A Corpus

Natural Language Processing Research Questions

Attributes

Original Data Source: QASPER: NLP Questions and Evidence
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Canada, Bangladesh, Isle of Man, Tunisia, British Indian Ocean Territory, Andorra, Northern Mariana Islands, Moldova (Republic of), Nepal, Taiwan
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Art Dataset for Interactive System & Sensor Data
kaggle.com
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). Art Dataset for Interactive System & Sensor Data [Dataset]. https://www.kaggle.com/datasets/ziya07/art-dataset-for-interactive-system-and-sensor-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is designed to support research and experimentation in the intersection of art, technology, and interactive systems. It contains data generated from images belonging to five distinct art styles: Drawings and Watercolors, Paintings, Sculptures, Graphic Art, and Iconography (Russian Art). Each entry includes a unique Art ID, the corresponding art style, sensor readings (simulating environmental or system data), system status (indicating the state of the system at the time of interaction), interaction count (representing user interactions), and timestamps for each event or action.

The dataset is intended for use in analyzing the relationship between art styles and system interactions in embedded environments. It can be used for training machine learning models, exploring system optimization techniques, or developing creative technologies that merge artistic expression with digital interaction. The synthetic nature of the data allows for a wide range of exploratory tasks, including classification, anomaly detection, and time-series analysis, and is well-suited for applications in AI-driven creative industries.

Dataset Contents: Images: The dataset includes approximately 9,000 images of artwork across five categories:

Drawings and Watercolors Paintings Sculptures Graphic Art Iconography (Russian Art) These images are sourced from various online repositories and cover diverse styles and artistic expressions.

CSV File: A corresponding CSV file, art_data.csv, is provided, containing the following columns:

Art ID: A unique identifier for each artwork. Art Style: The category of the artwork (e.g., Drawings and Watercolors, Paintings, Sculptures, Graphic Art, Iconography). Sensor Reading: Numeric values representing sensor data (e.g., environmental or system measurements). System Status: The current state of the system (e.g., Active, Idle, Processed, or Error). Interaction Count: The number of interactions or views of the image. Timestamp: The timestamp indicating when the interaction or event occurred. The CSV file can be used for training, analysis, and developing machine learning models for interactive art systems, while the image dataset provides the visual content necessary for studying art in a digital context.
u
Data from: IA Tweets Analysis Dataset (Spanish)
produccioncientifica.uca.es
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://produccioncientifica.uca.es/documentos/67321e53aea56d4af04854c2
Explore at:
Dataset updated
2024
Authors
Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
Description
Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
c
Fashion category dataset from SSENSE
crawlfeeds.com
csv, zip
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Fashion category dataset from SSENSE [Dataset]. https://crawlfeeds.com/datasets/fashion-category-dataset-from-ssense
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 7, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The SSENSE Fashion Dataset provides a curated and detailed snapshot of high-end fashion products listed on the SSENSE platform. With 78,000+ product records, this dataset spans a wide range of categories including apparel, footwear, and accessories from global luxury and streetwear brands.

Each entry includes:

Product title, brand, price, and currency

Availability and formatted pricing

SKU, item ID, and unique identifiers

Descriptions, gender tags, and high-quality image links

Crawl timestamp for freshness and tracking

Delivered as a ZIP-compressed CSV, this dataset is perfect for fashion tech startups, trend analysis, pricing research, or building AI models trained on real eCommerce data.
c
Ulta Beauty Ingredient-Focused Product Dataset (CSV)
crawlfeeds.com
csv, zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Ulta Beauty Ingredient-Focused Product Dataset (CSV) [Dataset]. https://crawlfeeds.com/datasets/ulta-beauty-ingredient-focused-product-dataset-csv
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
This dataset features only products from Ulta.com that include detailed ingredient lists, ideal for product transparency tools, clean label research, and beauty data modeling.

Designed for professionals and researchers working in beauty tech, compliance, formulation, and product analysis, it focuses on ingredient-rich listings for advanced use cases.

What’s Inside:

Product Name

Brand

Full Ingredient List

Category (e.g., Hair, Skin, Makeup)

Product URL

Price (if available)

Description

Images

Date Extracted

Best For:

Clean beauty app builders

Ingredient risk assessment and allergen tracking

Comparative cosmetic formulation

Beauty AI and ML dataset training

Ingredient transparency dashboards for e-commerce

Updates:

Available weekly or monthly or on request

Network traffic datasets created by Single Flow Time Series Analysis

zenodo.org
data.niaid.nih.gov

csv, pdf

Updated Jul 11, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka (2024). Network traffic datasets created by Single Flow Time Series Analysis [Dataset]. http://doi.org/10.5281/zenodo.8035724

Explore at:

csv, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8035724

Dataset updated

Jul 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Tomáš Čejka; Tomáš Čejka

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Network traffic datasets created by Single Flow Time Series Analysis

Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:

J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.

This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf

In the following table is a description of each dataset file:

File name	Detection problem	Citation of original raw dataset
botnet_binary.csv	Binary detection of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv	Multi-class classification of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv	Binary detection of cryptomining; the design part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv	Binary detection of cryptomining; the evaluation part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv	Binary detection of malware DNS	Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv	Binary detection of DoH	Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv	Binary detection of DoH	Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv	Binary detection of DoS	Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv	Binary detection of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv	Multi-class classification of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
https_brute_force.csv	Binary detection of HTTPS Brute Force	Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020
ids_cic_binary.csv	Binary detection of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_cic_multiclass.csv	Multi-class classification of intrusion in IDS	Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
ids_unsw_nb_15_binary.csv	Binary detection of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
ids_unsw_nb_15_multiclass.csv	Multi-class classification of intrusion in IDS	Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015.
iot_23.csv	Binary detection of IoT malware	Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23
ton_iot_binary.csv	Binary detection of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
ton_iot_multiclass.csv	Multi-class classification of IoT malware	Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021
tor_binary.csv	Binary detection of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
tor_multiclass.csv	Multi-class classification of TOR	Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017.
vpn_iscx_binary.csv	Binary detection of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_iscx_multiclass.csv	Multi-class classification of VPN	Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016.
vpn_vnat_binary.csv	Binary detection of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022
vpn_vnat_multiclass.csv	Multi-class classification of VPN	Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022

Pokemon TCG Pocket Dataset

kaggle.com

Updated Jun 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

JoaoCoelho03 (2025). Pokemon TCG Pocket Dataset [Dataset]. https://www.kaggle.com/datasets/joaocoelho03/pocket-tcg-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 26, 2025

Dataset provided by

Kaggle

Authors

JoaoCoelho03

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Pokémon TCG Pocket Card Dataset

This dataset contains detailed information about all cards available in the Pokémon Trading Card Game Pocket mobile app. The data has been carefully curated and cleaned to provide Pokémon enthusiasts and developers with accurate and comprehensive card information.

Dataset Contents

8+ Complete Sets: All major card sets including latest expansions
1000+ Cards: Every card with detailed metadata and classifications
Clean Format: CSV format optimized for analysis, machine learning, and research

Key Features

🃏 Complete Card Data

Card names and numbers with proper formatting
Complete set and pack organization structure
Release dates for all sets and expansions
Total card counts per set for completion tracking

💎 Rarity Classifications

7+ Rarity Types including:
- Common, Uncommon, Rare
- Ultra Rare, Secret Rare, Special Art Rare
- Crown Rare and other premium classifications
Includes shiny and special variant cards
Standardized rarity naming conventions

Use Cases

📊 Data Analysis & Research

Card rarity distribution analysis across sets
Set completion and collection tracking

🤖 Machine Learning & AI

Card classification models
Recommendation systems for collectors
Rarity prediction algorithms
Collection optimization models

📈 Visualization & Dashboards

Interactive card browsers
Collection progress tracking
Rarity distribution charts
Set release timeline visualizations

Data Quality

✅ Manually Verified: All card information cross-checked for accuracy
✅ Standardized Format: Consistent naming and classification across all entries
✅ Complete Coverage: All available cards from the mobile game
✅ Clean Structure: Optimized for both human readability and machine processing

Technical Specifications

📋 File Format

Format: CSV (Comma Separated Values)
Encoding: UTF-8 with full international character support
Delimiter: Comma (,)
Headers: Included in first row

🗂️ Column Structure (9 columns)

Column	Description	Example
`set_name`	Full name of the card set	"Eevee Grove"
`set_code`	Official set identifier	"a3b"
`set_release_date`	Set release date	"June 26, 2025"
`set_total_cards`	Total cards in the set	107
`pack_name`	Name of the specific pack	"Eevee Grove"
`card_name`	Full card name	"Leafeon"
`card_number`	Card number within set	"2"
`card_rarity`	Rarity classification	"Rare"
`card_type`	Card type category	"Pokémon"

If you find this dataset useful, consider giving it an upvote — it really helps others discover it too! 🔼😊

Happy analyzing! 🎯📊

Meta Kaggle Code
kaggle.com
zip
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(148301844275 bytes)Available download formats
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
f
Data Sheet 1_Harnessing machine learning and AI-driven analytics to identify...
frontiersin.figshare.com
csv
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaojia Qin; Biyu Deng; Dan Mo; Zhengyou Zhang; Xuan Wei; Zhougui Ling (2025). Data Sheet 1_Harnessing machine learning and AI-driven analytics to identify novel drug targets and predict chemotherapy efficacy in NSCLC.csv [Dataset]. http://doi.org/10.3389/fphar.2025.1555040.s002
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2025.1555040.s002
Dataset updated
Feb 19, 2025
Dataset provided by
Frontiers
Authors
Shaojia Qin; Biyu Deng; Dan Mo; Zhengyou Zhang; Xuan Wei; Zhougui Ling
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionNon-small cell lung cancer (NSCLC) constitutes the majority of lung cancer cases and exhibits marked heterogeneity in both clinical presentation and molecular profiles, leading to variable responses to chemotherapy. Emerging evidence suggests that mitochondria-derived RNAs (mtRNAs) may serve as novel biomarkers, although their role in predicting chemotherapy outcomes remains to be fully explored.MethodsIn this study, peripheral blood mononuclear cells were obtained from NSCLC patients for analysis of the mtRNA ratio (mt_tRNA-Tyr-GTA_5_end to mt_tRNA-Phe-GAA), while thoracic CT images were processed to derive an AI-driven BiomedGPT variable. Although individual clinical factors (Sex, Age, History_of_smoking, Pathological_type, Stage) offered limited predictive power when used in isolation, their integration into a random forest model improved sensitivity in the training set, albeit with reduced generalizability in the validation cohort. The subsequent integration of the BiomedGPT score and mtRNA ratio significantly enhanced predictive performance across both training and validation datasets.ResultsAn all-inclusive model combining clinical data, AI-derived variables, and mtRNA biomarkers produced a risk score capable of discriminating patients into high- and low-risk groups for progression-free survival and overall survival, with statistically significant differences observed between these groups.DiscussionThese findings highlight the potential of integrating mtRNA biomarkers with advanced AI methods to refine therapeutic decision-making in NSCLC, underscoring the importance of combining diverse data sources in precision oncology.
c
Cult Beauty Products with Ingredients Dataset (CSV)
crawlfeeds.com
csv, zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Cult Beauty Products with Ingredients Dataset (CSV) [Dataset]. https://crawlfeeds.com/datasets/cult-beauty-products-with-ingredients-dataset-csv
Explore at:
csv, zipAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
This curated dataset contains only products from CultBeauty.com that include detailed ingredient information, ideal for brands, formulators, analysts, and researchers seeking transparency in cosmetics and skincare data.

It focuses on ingredient-rich listings — allowing deep analysis of formulation trends, compliance mapping, and clean beauty initiatives. Whether you're building an internal database or powering an AI model, this dataset offers a clean, structured foundation for insight.

What’s Included:

Product Name

Brand

Full Ingredient List

Category

Product URL

Price (if available)

Description

Image links

Timestamps

Use Cases:

Ingredient analysis for clean beauty scoring

Competitor formulation comparison

Cosmetic safety mapping (e.g., for allergen research)

Building training sets for AI/ML models in skincare

Trend monitoring across skincare and cosmetic products

Update Frequency:

Monthly or on demand
m
MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat...
data.mendeley.com
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABDERRAHMANE AQACHTOUL (2025). MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat Detection in MQTT Networks [Dataset]. http://doi.org/10.17632/jfttfjn6tr.1
Explore at:
Unique identifier
https://doi.org/10.17632/jfttfjn6tr.1
Dataset updated
Mar 20, 2025
Authors
ABDERRAHMANE AQACHTOUL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the research article on MQTTEEB-D and is intended for public use in cybersecurity research. The MQTTEEB-D dataset is a practical real-world data set for intrusion detection improvement in Message Queuing Telemetry Transport (MQTT)-based Internet of Things (IoT) networks. In contrast to already existing datasets that are constructed on simulated network traffic, MQTTEEB-D is obtained from a real-time IoT deployment at the International University of Rabat (UIR), Morocco. Using MySignals IoT health sensors, Raspberry Pi 4, and an MQTT broker server, this dataset represents the actual complexity of the active IoT communication process, which synthetic data fails to offer. To narrow the gap between simulated and real-world attack scenarios, various cyberattacks including Denial of Service (DoS), Slow DoS against Internet of Things Environments (SlowITe), Malformed Data Injection, Brute Force, and MQTT publish flooding were carried out in real-time, permitting close monitoring of network traffic anomalies. The data was captured using Python wrapper for tshark (PyShark) and organized into multiple Comma-Separated Values (CSV) files. To ensure high data quality, we performed pre-processing steps, such as outlier removal, normalization, standardization, and class balance. Several processed forms (raw, cleaned, normalized, standardized, Synthetic Minority Over-sampling Technique (SMOTE)) applied for this dataset are provided, along with detailed metadata to facilitate ease of use in cybersecurity research. This dataset provides an opportunity for researchers to develop and validate intrusion detection models in a real-world MQTT environment - a critical ingredient in Artificial Intelligence (AI)-driven cybersecurity solutions for IoT networks. The dataset will support future research IoT security and anomaly detection domains.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604

Data from: Multi-Source Distributed System Data for AI-powered Analytics

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3549604

Dataset updated

Nov 10, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract:

In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

General Information:

This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

You may find details of this dataset from the original paper:

Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

If you use the data, implementation, or any details of the paper, please cite!

BIBTEX:

@inproceedings{nedelkoski2020multi,
 title={Multi-source Distributed System Data for AI-Powered Analytics},
 author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej},
 booktitle={European Conference on Service-Oriented and Cloud Computing},
 pages={161--176},
 year={2020},
 organization={Springer}
}

The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/

Clear search

Close search

Google apps

Main menu

Data from: Multi-Source Distributed System Data for AI-powered Analytics

AI-Driven Mental Health Literacy - An Interventional Study from India (Data...

Mental Health Chatbot Pairs

Mental Health Chatbot Pairs

AI-based Tailored Support for Mental Health Conversation

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

‘School Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

XAI-FUNGI: Dataset from the user study on comprehensibility of XAI...

XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms

AI Question Answering Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Impact of AI in Education Processes

Data from: IA Tweets Analysis Dataset (Spanish)

General Description

Data Collection Method

Dataset Content

Cite as

Potential Use Cases

Data Format and File Type

License

NLP Expert QA Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Company Datasets for Business Profiling

Art Dataset for Interactive System & Sensor Data

Data from: IA Tweets Analysis Dataset (Spanish)

Fashion category dataset from SSENSE

Ulta Beauty Ingredient-Focused Product Dataset (CSV)

What’s Inside:

Best For:

Updates:

Network traffic datasets created by Single Flow Time Series Analysis

Pokemon TCG Pocket Dataset

Pokémon TCG Pocket Card Dataset

Dataset Contents

Key Features

🃏 Complete Card Data

💎 Rarity Classifications

Use Cases

📊 Data Analysis & Research

🤖 Machine Learning & AI

📈 Visualization & Dashboards

Data Quality

Technical Specifications

📋 File Format

🗂️ Column Structure (9 columns)

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data Sheet 1_Harnessing machine learning and AI-driven analytics to identify...

Cult Beauty Products with Ingredients Dataset (CSV)

Data from: Multi-Source Distributed System Data for AI-powered Analytics