100+ datasets found

AI Vs Human Text
kaggle.com
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shayan Gerami (2024). AI Vs Human Text [Dataset]. https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shayan Gerami
Description
Around 500K essays are available in this dataset, both created by AI and written by Human.

I have gathered the data from multiple sources, added them together and removed the duplicates
Phishing Websites Dataset
kaggle.com
zip
Updated Mar 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnav Samal (2024). Phishing Websites Dataset [Dataset]. https://www.kaggle.com/datasets/arnavs19/phishing-websites-dataset
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 23, 2024
Authors
Arnav Samal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.

Here, the two variants of the Phishing Dataset are presented.

Full variant - dataset_full.csv

Total number of instances: 88,647

Number of legitimate website instances (labeled as 0): 58,000

Number of phishing website instances (labeled as 1): 30,647

Total number of features: 111

Small variant - dataset_small.csv

Total number of instances: 58,645

Number of legitimate website instances (labeled as 0): 27,998

Number of phishing website instances (labeled as 1): 30,647

Total number of features: 111
A
‘Popular Website Traffic Over Time ’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Popular Website Traffic Over Time ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-popular-website-traffic-over-time-62e4/62549059/?iid=003-357&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Popular Website Traffic Over Time ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/popular-website-traffice on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Background

Have you every been in a conversation and the question comes up, who uses Bing? This question comes up occasionally because people wonder if these sites have any views. For this research study, we are going to be exploring popular website traffic for many popular websites.

Methodology

The data collected originates from SimilarWeb.com.

Source

For the analysis and study, go to The Concept Center

This dataset was created by Chase Willden and contains around 0 samples along with 1/1/2017, Social Media, technical information and other features such as: - 12/1/2016 - 3/1/2017 - and more.

How to use this dataset

Analyze 11/1/2016 in relation to 2/1/2017

Study the influence of 4/1/2017 on 1/1/2017

More datasets

Acknowledgements

If you use this dataset in your research, please credit Chase Willden

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
A
‘Kaggle Datasets Ranking’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Kaggle Datasets Ranking’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-datasets-ranking-2744/64eafea2/?iid=003-771&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Kaggle Datasets Ranking’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-datasets-ranking on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset contains Kaggle ranking of datasets.

Content

+800 rows and 8 columns. Columns' description are listed below.

Rank : Rank of the user

Tier : Grandmaster, Master or Expert

Username : Name of the user

Join Date : Year of join

Gold Medals : Number of gold medals

Silver Medals : Number of silver medals

Bronze Medals : Number of bronze medals

Points : Total points

Acknowledgements

Data from Kaggle. Image from The Guardian.

If you're reading this, please upvote.

--- Original source retains full ownership of the source dataset ---
C
Community-Driven Model Service Platform Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73124
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing availability of open-source machine learning models and datasets, facilitated by platforms like Kaggle, GitHub, and Hugging Face, significantly lowers the barrier to entry for developers and researchers. Furthermore, the growing demand for specialized, niche models that cater to specific industry needs is propelling market growth. The rise of cloud-based solutions offers scalability and cost-effectiveness, attracting a wider user base. The market segmentation reveals strong growth in both adult and children's applications, reflecting the broadening use cases for these platforms across various age demographics. Cloud-based platforms dominate the market share due to their flexibility and accessibility, though on-premise solutions retain a significant presence in enterprises prioritizing data security and control. Geographic distribution showcases a strong presence in North America and Europe, driven by established tech ecosystems and high adoption rates of AI technologies. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like China and India, due to increasing digitalization and investment in AI infrastructure. The market's restraints are primarily focused on the need for robust data governance and ethical considerations surrounding AI model development and deployment. Concerns about data bias, model explainability, and potential misuse of AI models need to be addressed to ensure responsible growth. Furthermore, the skill gap in AI development and deployment presents a challenge, requiring substantial investment in education and training initiatives to support the expanding market. Competition among platform providers is intense, necessitating continuous innovation and the development of unique value propositions to maintain market share. Despite these challenges, the long-term outlook for the Community-Driven Model Service Platform market remains exceptionally positive, driven by sustained technological advancements and the increasing reliance on AI across diverse industries. The platform's potential to foster collaboration and democratize access to powerful machine learning tools is poised to drive considerable market expansion in the coming years.
arxiv-kaggle
zenodo.org
json
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Maltzan; Brian Maltzan (2025). arxiv-kaggle [Dataset]. http://doi.org/10.5281/zenodo.15808027
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15808027
Dataset updated
Jul 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Brian Maltzan; Brian Maltzan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

About Dataset

This is version 239. The following is a blurb taken from the Kaggle website where this dataset originates:

About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.
Q&A sites
kaggle.com
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TianBaojie (2023). Q&A sites [Dataset]. https://www.kaggle.com/datasets/tianbaojie/q-and-a-sites
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
TianBaojie
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
In recent years, social Q&A sites have developed rapidly, like Zhihu and Quora, with hundreds of millions of users, providing a convenient platform for users to ask questions and share knowledge. Most users share knowledge through real-name answers, but this sometimes hinders knowledge sharing, such as employees sharing company salaries and students sharing inside information about the laboratory. Revealing the author's true identity, in this case, can cause significant harm to the author. In order to tackle the above problems, social question-and-answer websites provide users with an anonymous answer function, which replaces the real author id of the answer with an anonymous id. Countless users who cannot disclose their identities use anonymous answers to share valuable knowledge.

Although there are countless anonymous answers in the Q&A community, few anonymization techniques have been used. The two super-large Q&A sites, Quora and Zhihu, use two anonymization technologies are hiding the author's information and protecting the storage of anonymous user information. However, anonymous answers, questions, comments, and topics and their topological structure contain distinct attributes unknown to most users, providing valuable features for de-anonymization attacks. Although the question-and-answer website warns anonymous users that personal specific information and language styles in answers may lead to privacy leaks, the question-and-answer community cannot give the probability and cause of privacy leaks for a specific anonymous answer.

In this paper, we propose a novel task, the de-anonymization of the Q&A websites, which refers to recovering the identity information of the real author of the anonymous answer. This task aims to evaluate the risk of privacy leakage of a specific anonymous answer in the question-and-answer websites and explain why the answer is vulnerable to de-anonymization.

To explore the effectiveness of various methodologies, we employ web scraping techniques on public answers from online platforms Zhihu and Quora. The first step involves the selection of seed users related to the ten popular topics. We selected one seed user from each topic. This step can ensure that the collected Q&A community dataset encompasses a diverse range of popular topics. In the second step, we recursively crawl the social relationships between users based on the ten seed users crawled in the first step. In order to make the crawled user pool more widely distributed, we only crawled the first 100 following users for each user until the crawled user pool reached 700,000. Step three is to ascertain the users to be ultimately crawled. A community discovery algorithm is employed to identify a community with the highest transitivity. This community must have a population exceeding 5,000. As a result, we will crawl all users in this community. The fourth step involves extracting the data related to all users within the chosen community. To mitigate the issue of excessive data volume, this article sets a constraint on the number of answers collected by individual users during the crawling process. This upper limit ensures that all answers of 95% of users are crawled. The crawled data consists of user homepage information, relationships between users, user-generated questions and answers, and associated comments and topics. Each question includes its title and the name of the person who asked it. Each answer contains the author's name, time of submission, the content of the answer, and any first-level comments. Additionally, each comment includes its author, submission time, and content.

Here is an example code to read the dataset https://www.kaggle.com/tianbaojie/example-code.
A
‘CENSORED WEB-SITES BY ALL COUNTRIES’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘CENSORED WEB-SITES BY ALL COUNTRIES’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-censored-web-sites-by-all-countries-a1aa/2679f88d/?iid=006-696&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘CENSORED WEB-SITES BY ALL COUNTRIES’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/brsdincer/censored-websites-by-all-countries on 28 January 2022.

--- Dataset description provided by original source is as follows ---

CENSORED WEB-SITES BY ALL COUNTRIES

Sites that were or are currently banned.

This data was created by each country's own users.

Some of the sites you have seen may have been active again.

--- Original source retains full ownership of the source dataset ---
buds-lab/building-data-genome-project-2: v1.0
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers; Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers (2020). buds-lab/building-data-genome-project-2: v1.0 [Dataset]. http://doi.org/10.5281/zenodo.3887306
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3887306
Dataset updated
Sep 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers; Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BDG2 open data set consists of 3,053 energy meters from 1,636 non-residential buildings with a range of two full years (2016 and 2017) at an hourly frequency (17,544 measurements per meter resulting in approximately 53.6 million measurements). These meters are collected from 19 sites across North America and Europe, and they measure electrical, heating and cooling water, steam, and solar energy as well as water and irrigation meters. Part of these data was used in the Great Energy Predictor III (GEPIII) competition hosted by the ASHRAE organization in October-December 2019. This subset includes data from 2,380 meters from 1,448 buildings that were used in the GEPIII, a machine learning competition for long-term prediction with an application to measurement and verification. This paper describes the process of data collection, cleaning, and convergence of time-series meter data, the meta-data about the buildings, and complementary weather data. This data set can be used for further prediction benchmarking and prototyping as well as anomaly detection, energy analysis, and building type classification.
h
Kaggle-post-and-comments-question-answer-topic
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duverne Mathieu, Kaggle-post-and-comments-question-answer-topic [Dataset]. https://huggingface.co/datasets/Raaxx/Kaggle-post-and-comments-question-answer-topic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Duverne Mathieu
Description
This is a dataset containing 10,000 posts from Kaggle and 60,000 comments related to those posts in the question-answer topic.

Data Fields kaggle_post

'pseudo', The question authors. 'title', Title of the Post. 'question', The question's body. 'vote', Voting on Kaggle is similar to liking. 'medal', I will share with you the Kaggle medal system, which can be found at https://www.kaggle.com/progression. The system awards medals to users based on… See the full description on the dataset page: https://huggingface.co/datasets/Raaxx/Kaggle-post-and-comments-question-answer-topic.
h
kaggle-mbti
huggingface.co
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Jie Tan (2024). kaggle-mbti [Dataset]. http://doi.org/10.57967/hf/3955
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3955
Dataset updated
Jul 24, 2024
Authors
Jing Jie Tan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Personality Dataset

Essays https://huggingface.co/datasets/jingjietan/essays-big5 MBTI https://huggingface.co/datasets/jingjietan/kaggle-mbti Pandora https://huggingface.co/datasets/jingjietan/pandora-big5 Please contact jingjietan.com for another dataset. Cite: @software{jingjietan-apr-dataset, author = {Jing Jie, Tan}, title = {Personality Kaggle Dataset Splitting}, url = {https://huggingface.co/datasets/jingjietan/kaggle-mbti}, version = {1.0.0}, year = {2024} }
FSDKaggle2018
zenodo.org
opendatalab.com
+1more
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2552860
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
42156 English Website
kaggle.com
Updated Jul 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ida Sörndal (2018). 42156 English Website [Dataset]. https://www.kaggle.com/idasorndal/42156-english-website/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ida Sörndal
Description
About sites

All the sites were obtained from different sources.
The Test-Case Dataset
kaggle.com
Updated Nov 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sapal6 (2020). The Test-Case Dataset [Dataset]. https://www.kaggle.com/datasets/sapal6/the-testcase-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sapal6
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

There are lots of datasets available for different machine learning tasks like NLP, Computer vision etc. However I couldn't find any dataset which catered to the domain of software testing. This is one area which has lots of potential for application of Machine Learning techniques specially deep-learning.

This was the reason I wanted such a dataset to exist. So, I made one.

Content

New version [28th Nov'20]- Uploaded testing related questions and related details from stack-overflow. These are query results which were collected from stack-overflow by using stack-overflow's query viewer. The result set of this query contained posts which had the words "testing web pages".

New version[27th Nov'20] - Created a csv file containing pairs of test case titles and test case description.

This dataset is very tiny (approximately 200 rows of data). I have collected sample test cases from around the web and created a text file which contains all the test cases that I have collected. This text file has sections and under each section there are numbered rows of test cases.

Acknowledgements

I would like to thank websites like guru99.com, softwaretestinghelp.com and many other such websites which host great many sample test cases. These were the source for the test cases in this dataset.

Inspiration

My Inspiration to create this dataset was the scarcity of examples showcasing the implementation of machine learning on the domain of software testing. I would like to see if this dataset can be used to answer questions similar to the following--> * Finding semantic similarity between different test cases ranging across products and applications. * Automating the elimination of duplicate test cases in a test case repository. * Cana recommendation system be built for suggesting domain specific test cases to software testers.
LLM prompts in the context of machine learning
kaggle.com
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Nelson (2024). LLM prompts in the context of machine learning [Dataset]. https://www.kaggle.com/datasets/jordanln/llm-prompts-in-the-context-of-machine-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2024
Dataset provided by
Kaggle
Authors
Jordan Nelson
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.

KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?

Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?

Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...
notMNIST
kaggle.com
opendatalab.com
+3more
Updated Feb 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
jwjohnson314
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

Content

notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

Acknowledgements

Thanks to Yaroslav Bulatov for putting together the dataset.
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
Predictive Maintenance Dataset
kaggle.com
Updated Nov 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himanshu Agarwal (2022). Predictive Maintenance Dataset [Dataset]. https://www.kaggle.com/datasets/hiimanshuagarwal/predictive-maintenance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 7, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Himanshu Agarwal
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.

The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.
Website Screenshots
kaggle.com
universe.roboflow.com
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pooria Mostafapoor (2025). Website Screenshots [Dataset]. https://www.kaggle.com/datasets/pooriamst/website-screenshots
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 19, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pooria Mostafapoor
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

Example

This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

Usage

Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

The dataset contains 1689 train data, 243 test data and 483 valid data.
What social Media People like the most and why?
kaggle.com
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Luquez (2023). What social Media People like the most and why? [Dataset]. https://www.kaggle.com/ninaluquez/what-social-media-people-like-the-most-and-why/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nina Luquez
Description
Dataset

This dataset was created by Nina Luquez

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Shayan Gerami (2024). AI Vs Human Text [Dataset]. https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text

AI Vs Human Text

500K AI and Human Generated Essays

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 10, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Shayan Gerami

Description

Around 500K essays are available in this dataset, both created by AI and written by Human.

I have gathered the data from multiple sources, added them together and removed the duplicates

Clear search

Close search

Google apps

Main menu

AI Vs Human Text

Phishing Websites Dataset

‘Popular Website Traffic Over Time ’ analyzed by Analyst-2

About this dataset

Background

Methodology

Source

How to use this dataset

Acknowledgements

Start A New Notebook!

‘Kaggle Datasets Ranking’ analyzed by Analyst-2

Context

Content

Acknowledgements

Community-Driven Model Service Platform Report

arxiv-kaggle

About Dataset

About ArXiv

Q&A sites

‘CENSORED WEB-SITES BY ALL COUNTRIES’ analyzed by Analyst-2

CENSORED WEB-SITES BY ALL COUNTRIES

buds-lab/building-data-genome-project-2: v1.0

Kaggle-post-and-comments-question-answer-topic

kaggle-mbti

FSDKaggle2018

42156 English Website

About sites

The Test-Case Dataset

Context

Content

Acknowledgements

Inspiration

LLM prompts in the context of machine learning

notMNIST

Context

Content

Acknowledgements

LLM: 7 prompt training dataset

Predictive Maintenance Dataset

Website Screenshots

Example

Usage

What social Media People like the most and why?

Dataset

Contents

AI Vs Human Text

500K AI and Human Generated Essays