25 datasets found

Sarcasm Detection Dataset
kaggle.com
zip
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RK (2025). Sarcasm Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ruchikakumbhar/sarcasm-detection-dataset
Explore at:
zip(1670891 bytes)Available download formats
Dataset updated
Jan 20, 2025
Authors
RK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Sarcasm Detection:

Steps: 1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json] 2. Preprocessing of text in case the text is loaded instead of manually adding it to the code 3. Vectorizing the text using TfidfVectorizer 4. Reduce the dimension using PCA 5. Clustering the documents 6. Plot the cluster using matplotlib
h
blurbs-clustering-p2p
huggingface.co
Updated Apr 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvan (2023). blurbs-clustering-p2p [Dataset]. https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2023
Authors
Silvan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.
GSDMM: Short text clustering
kaggle.com
zip
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
catherine (2025). GSDMM: Short text clustering [Dataset]. https://www.kaggle.com/datasets/ptfrwrd/gsdmm-short-text-clustering/discussion
Explore at:
zip(8048 bytes)Available download formats
Dataset updated
Nov 9, 2025
Authors
catherine
Description
Dataset

This dataset was created by catherine

Contents

From https://github.com/rwalk/gsdmm
YouTube Video Subtitles
kaggle.com
zip
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Herman (2025). YouTube Video Subtitles [Dataset]. https://www.kaggle.com/datasets/jetakow/youtube-videos-subtitles/data
Explore at:
zip(42191918 bytes)Available download formats
Dataset updated
Feb 5, 2025
Authors
Daniel Herman
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
YouTube
Description
Over 12k scraped YouTube EN subtitles for videos on GitHub topics.

How? Based on the topics https://github.com/topics I searched YouTube with the phrase "What is {topic}?" and downloaded up to 100 video subtitles for a given topic. The extracted text can be found in the dataset together with the topic name, video title and video URL.

Why? I wan to know if we can rate videos based on their information value, especially when we use YouTube as an information source.

You can find the source code here: https://github.com/detrin/text-info-value
Data from: Automatic Definition of Robust Microbiome Sub-states in...
zenodo.org
txt, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson (2020). Data from: Automatic Definition of Robust Microbiome Sub-states in Longitudinal Data [Dataset]. http://doi.org/10.5281/zenodo.167376
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.167376
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.

Prefixes:

David2014_: original microbiome dataset published in [David et al.,2014] (http://genomebiology.com/2014/15/7/R89)

Ballou2016_: original microbiome dataset published in [Ballou et al.,2016] (http://journal.frontiersin.org/article/10.3389/fvets.2016.00002/full)

Gajer2012_: original microbiome dataset published in [Gajer et al.,2012] (http://stm.sciencemag.org/content/4/132/132ra52.long)

LaRosa2014_: original microbiome dataset published in [LaRosa et al.,2014] (http://www.pnas.org/cgi/doi/10.1073/pnas.1409497111)

Suffixes:

_All: all taxa

_Dominant: only 1% most abundant taxa

_NonDominant: remaining taxa after removing above dominant taxa

_GenusAll: taxa aggregated at genus level

_GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa

_GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa

Each folder contains 3 output files related to the same input dataset:
- data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
- definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
- sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID

Abstract of the associated paper:

The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.
h
tenkgnad-clustering-s2s
huggingface.co
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvan (2023). tenkgnad-clustering-s2s [Dataset]. https://huggingface.co/datasets/slvnwhrl/tenkgnad-clustering-s2s
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2023
Authors
Silvan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains news article titles and is based on the dataset of the One Million Posts Corpus and 10kGNAD. It contains 10'267 unique samples, 10 splits with 1'436 to 9'962 samples and 9 unique classes. Splits are built similarly to MTEB's TwentyNewsgroupsClustering. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation results. If you use this… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/tenkgnad-clustering-s2s.
Word Representations for Clinical Danish
figshare.com
tar
Updated May 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon Derczynski (2020). Word Representations for Clinical Danish [Dataset]. http://doi.org/10.6084/m9.figshare.12377858.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12377858.v1
Dataset updated
May 27, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Leon Derczynski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Word embeddings and word clusters for Clinical Danish, drawn from the heavily-anonymised E4C resource (https://doi.org/10.1177/1460458216647760) and presented here as statistical aggregate data over those records. Vocabulary of 382737 words. Vectors have 100 dimensions. Clusters generated using Generalised Brown clustering with a=2500 and a minimum count of 3; coarser clusters can be generated rapidly from the included mergefile (see https://github.com/sean-chester/generalised-brown/blob/master/cluster_generator/cluster.py)Data statement included
Z
Dataset and trained models belonging to the article 'Distant reading...
data.niaid.nih.gov
Updated Sep 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Smits, Thomas; Ros, Ruben (2021). Dataset and trained models belonging to the article 'Distant reading patterns of iconicity in 940.000 online circulations of 26 iconic photographs' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4244000
Explore at:
Dataset updated
Sep 28, 2021
Dataset provided by
Luxembourg Centre for Contemporary and Digital History
Utrecht University
Authors
Smits, Thomas; Ros, Ruben
Description
Quantifying Iconicity - Zenodo

The Dataset

This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.

The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes: - the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match - the title of the page - the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found - the language found by the langid Python module link, along with the normalized score. - the labels associated with the image by Google - the scrape date

Alongside the .tsv-files, there are several other elements in the following folder structure:

├── data │ ├── embeddings │ └── doc2vec │ └── input-text │ └── metadata │ └── umap │ └── evaluation │ └── results │ └── diachronic-plots │ └── top-words │ └── tsv

The /embeddings folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata.

The /evaluation folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters.

The /results folder contains the top words associated with the clusters and the diachronic cluster prominence plots.

Data Cleaning and Curation

Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as , etc.
Comparing the crisis of 806/1403-4 and the Fatimid fitna...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Comparing the crisis of 806/1403-4 and the Fatimid fitna (450-466/1058-1073): al-Maqrīzī as a historian of the Fatimids - Datasets [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7062745?locale=ga
Explore at:
unknown(151)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Fatimid Caliphate
Description
Accompanying datasets that are referenced in the Journal Article 'Comparing the crisis of 806/1403-4 and the Fatimid fitna (450-466/1058-1073): al-Maqrīzī as a historian of the Fatimids - Datasets'. The texts used in the analysis are taken from OpenITI corpus release (Version 2021.2.5). If the ID (the final part of the text URI) has changed from the OpenITI release to this data release, then the text has been modified for this case study. File extensions following the text URI indicate that the text has had additional tags applied (either date tags, or text reuse cluster tags). csv file names indicate the text file from which the csv file was generated. This is a published part of an active research project. For other datasets and the scripts used to generate this data, see the relevant GitHub repositiory: https://github.com/mabarber92/fitna-study
Data from: Dataset for Vector space model and the usage patterns of...
figshare.com
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave (2023). Dataset for Vector space model and the usage patterns of Indonesian denominal verbs [Dataset]. http://doi.org/10.6084/m9.figshare.8187155.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8187155.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PrefaceThis is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).How to cite the datasetIf you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the malindo_dbase, see below), please cite as:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205Dataset description1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions: - For me-: ^(?i)(me)([a-z-]{3,})$- For di-: ^(?i)(di)([a-z-]{3,})$- For ter-: ^(?i)(ter)([a-z-]{3,})$7. malindo_dbase is the MALINDO Morphological Dictionary (see above).ReferencesSchmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.
Ground truth labels - Amazon movie reviews dataset
kaggle.com
zip
Updated Jul 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Bazakos (2017). Ground truth labels - Amazon movie reviews dataset [Dataset]. https://www.kaggle.com/thebuzz/ground-truth-labels-amazon-movie-reviews-dataset
Explore at:
zip(6829166 bytes)Available download formats
Dataset updated
Jul 8, 2017
Authors
Konstantinos Bazakos
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Addition of ground truth labels on Amazon movie reviews

http://i.imgur.com/aDVUwMz.png" alt="Image">

What is it?

This is a side project for my thesis “Classification/Clustering Techniques for Large Web Data Collections”.

My main goal was to provide a new, enriched, ground truth labeled dataset to the Machine Learning community. All labels have been collected by crawling/scraping Amazon.com for a period of some months. By labels I mean the categories in which the products are classified (look the green underlined labels on the screenshot below).

http://i.imgur.com/mAiuoO6.png" alt="Image">

Please, if you feel you can make any contribution that will improve this dataset, fork it on github.com.

The original dataset

The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012.

Data format:

product/productId: B00006HAXW

review/userId: A1RSDE90N6RSZF

review/profileName: Joseph M. Kotow

review/helpfulness: 9/9

review/score: 5.0

review/time: 1042502400

review/summary: Pittsburgh - Home of the OLDIES

review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!

where:

product/productId: asin, e.g. amazon.com/dp/B00006HAXW

review/userId: id of the user, e.g. A1RSDE90N6RSZF

review/profileName: name of the user

review/helpfulness: fraction of users who found the review helpful

review/score: rating of the product

review/time: time of the review (unix time)

review/summary: review summary

review/text: text of the review

The new labeled dataset

All the collected data (for every ASIN of the SNAP Dataset, ~253k products for ~8m reviews) are stored in a csv file labels.csv in the following format:

ASIN: unique identifier for the product

Categories: [label, label, label,..., label]

The new data format will be:

product/productId: B00006HAXW

review/userId: A1RSDE90N6RSZF

review/profileName: Joseph M. Kotow

review/helpfulness: 9/9

review/score: 5.0

review/time: 1042502400

review/summary: Pittsburgh - Home of the OLDIES

review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!

product/categories: ['CDs & Vinyl', 'Pop', 'Oldies', 'Doo Wop']

Instructions

You can follow the steps mentioned below on how to get the enriched dataset:

Download the original dataset from the SNAP website (~ 3.3 GB compressed) and put it in the root folder of the repository (where you can find also the labels.csv file).

Execute the python file enrich.py (it is available in the github project), so the new enriched multi-labeled dataset be exported. The name of the new file should be output.txt.gz.

Notice: Please be patient as the python script will take a while to parse all these reviews.

The python script generates a new compressed file that is actually same with the original one, but with an extra feature (product/categories).

In fact,(the python script) applies a mapping between ASIN values in both files and adds the labels data of the product in every review instance of that, as an extra column.

Here is the code:

import gzip import csv import ast def look_up(asin, diction): try: return diction[asin] except KeyError: return [] def load_labels(): labels_dictionary = {} with open('labels.csv', mode='r') as infile: csvreader = csv.reader(infile) next(csvreader) for rows in csvreader: labels_dictionary[rows[0]] = ast.literal_eval(rows[1]) return labels_dictionary def parse(filename): labels_dict = load_labels() f = gzip.open(filename, 'r') entry = {} for l in f: l = l.strip() colonPos = l.find(':') if colonPos == -1: yield entry entry = {} continue eName = l[:colonPos] rest = l[colonPos+2:] entry[eName] = rest if eName == 'product/productId': entry['product/categories'] ...
Z
Simulation data on the growth of atmospheric molecular clusters and...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Mar 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kontkanen, Jenni; Olenius, Tinja; Stolzenburg, Dominik; Lehtipalo, Katrianne; Riipinen, Ilona (2022). Simulation data on the growth of atmospheric molecular clusters and particles [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6370140
Explore at:
Dataset updated
Mar 20, 2022
Dataset provided by
Department of Environmental Science (ACES) and Bolin Centre for Climate Research
University of Helsinki, Finnish Meteorological Institute
Swedish Meteorological and Hydrological Institute
University of Helsinki
Authors
Kontkanen, Jenni; Olenius, Tinja; Stolzenburg, Dominik; Lehtipalo, Katrianne; Riipinen, Ilona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains output data from cluster population simulations performed with Atmospheric Cluster Dynamics Code (ACDC) model, which simulates the formation of clusters from atmospheric vapors and the growth of these clusters by further molecular and cluster-cluster collisions. The data can be used for investigating the formation and growth of atmospheric particles from inorganic and organic vapors.

The data is output of a computational process model, and hence does not represent a specific time period or location. Simulation sets are calculated for a one or two-component system containing a quasi-unary inorganic compound representing a mixture of sulfuric acid and ammonia (SA) and/or oxidized organic vapors corresponding to a low volatility organic compound (LVOC) and an extremely-low volatility organic compound (ELVOC). The external conditions in the simulations correspond to those in the CLOUD (Cosmics Leaving Outdoor Droplets) chamber at temperature of 5 C°.

Data are provided for 14 simulations.

References

Kontkanen J, Stolzenburg D, Olenius T, Yan C, Dada L, Ahonen L, Simon M, Lehtipalo K, Riipinen I (2022) What controls the observed size-dependency of the growth rates of sub-10 nm atmospheric particles?. Environ. sci. Atmos. https://doi:10.1039/d1ea00103e

Olenius T, Riipinen I (2017) Molecular-resolution simulations of new particle formation: Evaluation of common assumptions made in describing nucleation in aerosol dynamics models. Aerosol Sci. Tech. 51:397⁠ – ⁠408. https://doi.org/10.1080/02786826.2016.1262530

Olenius T, Atmospheric Cluster Dynamics Code. https://github.com/tolenius/ACDC

McGrath MJ et al. (2012) Atmospheric Cluster Dynamics Code: a flexible method for solution of the birth-death equations. Atmos. Chem. Phys. 12:2345⁠ – ⁠2355. https://doi.org/10.5194/acp-12-2345-2012

Data description

The data is in the form of text files. The provided data files (total compressed size ~10GB) correspond to simulation output from the ACDC model. Simulation sets are shown in the table below and further described in Kontkanen et al. (2022). For the interpretation of the model output, the interested user is referred to the manual of ACDC model (https://github.com/tolenius/ACDC).

Simulation set

Model compounds

Vapor concentrations (cm-3)

Method to retrieve evaporation rates

1

SA

CSA = 8.0*106, 2.0*107, 4.7*107, 1.1*108

Kelvin eq. (classical evaporation rates)

2

SA

CSA = 2.0*107, 4.7*107, 1.1*108

QC data and Kelvin eq. (non-classical evaporation rates)

3

LVOC

CLVOC = 5.0*107, 1*108

Kelvin eq. (classical evaporation rates)

4

LVOC, ELVOC

CLVOC = 5.0*107, 1*108 CELVOC = 1.0 *107

Kelvin eq. (classical evaporation rates)

5

LVOC, SA

CLVOC = 2.0*107, 5.0*107, 1*108 CSA = 8.0*106

Kelvin eq. (classical evaporation rates)
190k+ Medium Articles
kaggle.com
zip
Updated Apr 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabio Chiusano (2022). 190k+ Medium Articles [Dataset]. https://www.kaggle.com/datasets/fabiochiusano/medium-articles
Explore at:
zip(386824829 bytes)Available download formats
Dataset updated
Apr 26, 2022
Authors
Fabio Chiusano
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data source

This data has been collected through a standard scraping process from the Medium website, looking for published articles.

Data description

Each row in the data is a different article published on Medium. For each article, you have the following features: - title [string]: The title of the article. - text [string]: The text content of the article. - url [string]: The URL associated to the article. - authors [list of strings]: The article authors. - timestamp [string]: The publication datetime of the article. - tags [list of strings]: List of tags associated to the article.

Data analysis

You can find a very quick data analysis in this notebook.

What can I do with this data?

A multilabel classification model that assigns tags to articles.

A seq2seq model that generates article titles.

Text analysis.

Finetune text generation models on the general domain of Medium, or on specific domains by filtering articles by the appropriate tags.

Collection methodology

Scraping has been done with Python and the requests library. Starting from a random article on Medium, the next articles to scrape are selected by visiting: 1. The author archive pages. 2. The publication archive pages (if present). 3. The tags archives (if present).

The article HTML pages have been parsed with the newspaper Python library.

Published articles have been filtered for English articles only, using the Python langdetect library.

As a consequence of the collection methodology, the scraped articles are coming from a not uniform publication date distribution. This means that there are articles published in 2016 and in 2022, but the number of articles in this dataset published in 2016 is not the same as the number of articles published in 2022. In particular, there is a strong prevalence of articles published in 2020. Have a look at the accompanying notebook to see the distribution of the publication dates.
h
TenKGnadClusteringP2P.v2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark, TenKGnadClusteringP2P.v2 [Dataset]. https://huggingface.co/datasets/mteb/TenKGnadClusteringP2P.v2
Explore at:
Dataset authored and provided by
Massive Text Embedding Benchmark
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
TenKGnadClusteringP2P.v2 An MTEB dataset Massive Text Embedding Benchmark

Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.

Task category t2c

Domains News, Non-fiction, Written

Reference https://tblock.github.io/10kGNAD/

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_task("TenKGnadClusteringP2P.v2") evaluator… See the full description on the dataset page: https://huggingface.co/datasets/mteb/TenKGnadClusteringP2P.v2.
Z
Metadata, Title Pages, and Network Graph of the Digitized Content of the...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zellhöfer, David (2024). Metadata, Title Pages, and Network Graph of the Digitized Content of the Berlin State Library (146,000 items) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2582481
Explore at:
Dataset updated
Jul 25, 2024
Authors
Zellhöfer, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Berlin
Description
The data set has been downloaded via the OAI-PMH endpoint of the Berlin State Library/Staatsbibliothek zu Berlin’s Digitized Collections (https://digital.staatsbibliothek-berlin.de/oai) on March 1st 2019 and converted into common tabular formats on the basis of the provided Dublin Core metadata. It contains 146,000 records.

In addition to the bibliographic metadata, representative images of the works have been downloaded, resized to a 512 pixel maximum thumbnail image and saved in JPEG format. The image data is split into title pages and first pages. Title pages have been derived from structural metadata created by scan operators and librarians. If this information was not available, first pages of the media have been downloaded. In case of multi-volume media, title pages are not available.

In total, 141,206 images title/first pages are available.

Furthermore, the tabular data has been cleaned and extended with geo-spatial coordinates provided by the OpenStreetMap project (https://www.openstreetmap.org). The actual data processing steps are summarized in the next section. For the sake of transparency and reproducibility, the original data taken from the OAI-PMH endpoint is still present in the table.

To conclude with, various graphs in GML file format are available that can be loaded directly into graph analysis tools such as Gephi (https://gephi.org/).

The implementation of the data processing steps (incl. graph creation) are available as a Jupyter notebook provided at https://github.com/elektrobohemian/SBBrowse2018/blob/master/DataProcessing.ipynb.

Tabular Metadata

The metadata is available in Excel (cleanedData.xlsx) and CSV (cleanedData.csv) file formats with equal content.

The table contains the following columns. Italique columns have not been processed.

· title The title of the medium

· creator Its creator (family name, first name)

· subject A collection’s name as provided by the library

· type The type of medium

· format A MIME type for full metadata download

· identifier An additional identifier (most often the PPN)

· language A 3-letter language code of the medium

· date The date of creation/publication or a time span

· relation A relation to a project or collection a medium has been digitized for.

· coverage The location of publication or origin (ranging from cities to continents)

· publisher The publisher of the medium.

· rights Copyright information.

· PPN The unique identifier that can be used to find more information about the current medium in all information systems of Berlin State Library/Staatsbibliothek zu Berlin.

· spatialClean In case of multiple entries in coverage, only the first place of origin has been extracted. Additionally, characters such as question marks, brackets, or the like have been removed. The entries have been normalized regarding whitespaces and writing variants with the help of regular expressions.

· dateClean As the original date may contain various format variants to indicate unclear creation dates (e.g., time spans or question marks), this field contains a mapping to a certain point in time.

· spatialCluster The cluster ID determined with the help of the Jaro-Winkler distance on the spatialClean string. This step is needed because the spatialClean fields still contain a huge amount of orthographic variants and latinizations of geographic names.

· spatialClusterName A verbal cluster name (controlled manually).

· latitude The latitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

· longitude The longitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

· century A century derived from the date.

· textCluster A text cluster ID on the basis of a k-means clustering relying on the title field with a vocabulary size of 125,000 using the tf*idf model and k=5,000.

· creatorCluster A text cluster ID based on the creator field with k=20,000.

· titleImage The path to the first/title page relative to the img/ subdirectory or None in case of a multi-volume work.

Other Data

graphs.zip

Various pre-computed graphs.

img.zip

First and title pages in JPEG format.

json.zip

JSON files for each record in the following format:

ppn "PPN57346250X"

dateClean "1625"

title "M. Georgii Gutkii, Gymnasii Berlinensis Rectoris Habitus Primorum Principiorum, Seu Intelligentia; Annexae Sunt Appendicis loco Disputationes super eodem habitu tum in Academia Wittebergensi, tum in Gymnasio Berlinensi ventilatae"

creator "Gutke, Georg"

spatialClusterName "Berlin"

spatialClean "Berolini"

spatialRaw "Berolini"

mediatype "monograph"

subject "Historische Drucke"

publisher "Kallius"

lat "52.5170365"

lng "13.3888599"

textCluster "45"

creatorCluster "5040"

titleImage "titlepages/PPN57346250X.jpg"

Steam Reviews English - Dead by Daylight

kaggle.com

zip

Updated Nov 20, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Nicola Mustone (2025). Steam Reviews English - Dead by Daylight [Dataset]. https://www.kaggle.com/datasets/nicolamustone/steam-reviews-english-dead-by-daylight

Explore at:

zip(22155467 bytes)Available download formats

Dataset updated

Nov 20, 2025

Authors

Nicola Mustone

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Steam Reviews — Dead by Daylight (App 381210)

A dataset of 277,439 English-only Steam user reviews for Dead by Daylight from 2019 to November 2025, collected through the official Steam API. Each row represents a single review, including sentiment labels, playtime, and engagement metrics.
This dataset is ideal for natural language processing, sentiment analysis, and behavioral data studies.

A separate CSV with all the patches released for Dead by Daylight is included in the download for your convenience.

Dataset Summary

Field	Description
`review`	Full review text
`sentiment`	`1` = positive review, `0` = negative
`purchased`	`1` if purchased on Steam
`received_for_free`	`1` if the game was received for free
`votes_up`	Number of helpful votes
`votes_funny`	Number of “funny” votes
`date_created`	Review creation date (YYYY-MM-DD, UTC)
`date_updated`	Last update date (YYYY-MM-DD, UTC)
`author_num_games_owned`	Total games owned by reviewer
`author_num_reviews`	Total reviews written by reviewer
`author_playtime_forever_min`	Total playtime in minutes
`author_playtime_at_review_min`	Playtime when the review was written (minutes)

Example Use Cases

Sentiment Analysis: Train classifiers using user tone and voting patterns.
Text Embeddings: Extract embeddings for clustering or topic modeling.
Behavioral Correlation: Relate sentiment to playtime or review length.

Data Source

Reviews were collected using the SirDarcanos/Steam-Reviews-Scraper script.

This dataset includes only publicly accessible user content and metadata.
Each record is factual and unaltered beyond format normalization.

Licensing

Dataset: MIT License
Free for commercial and non-commercial use with attribution.
Collection Script: GPLv3 License
Ensures derivative software remains open-source.

Update Schedule

Updates will be performed irregularly and only when new data is collected. Users are welcome to suggest improvements or request updates via the discussion section.

Credits

Created by Nicola Mustone.

Disclaimer

This dataset and its author are not affiliated with, endorsed by, or sponsored by Valve Corporation or Behaviour Interactive Inc.

All product names, logos, brands, and trademarks are the property of their respective owners.

The data included in this dataset was collected from publicly available user reviews through the official Steam Web API, and is provided solely for educational and research purposes.

h
GeoreviewClusteringP2P
huggingface.co
Updated May 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2025). GeoreviewClusteringP2P [Dataset]. https://huggingface.co/datasets/mteb/GeoreviewClusteringP2P
Explore at:
Dataset updated
May 11, 2025
Dataset authored and provided by
Massive Text Embedding Benchmark
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GeoreviewClusteringP2P An MTEB dataset Massive Text Embedding Benchmark

Review clustering based on Yandex Georeview dataset

Task category t2c

Domains Reviews, Written

Reference https://github.com/yandex/geo-reviews-dataset-2023

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_tasks(["GeoreviewClusteringP2P"]) evaluator = mteb.MTEB(task)

model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/GeoreviewClusteringP2P.
History of Philosophy
kaggle.com
zip
Updated Mar 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kourosh Alizadeh (2021). History of Philosophy [Dataset]. https://www.kaggle.com/kouroshalizadeh/history-of-philosophy
Explore at:
zip(57826536 bytes)Available download formats
Dataset updated
Mar 31, 2021
Authors
Kourosh Alizadeh
Description
Context

This dataset was compiled for the Philosophy Data Project and used to develop the features available on that site. As a former philosophy teacher and now data scientist, I thought it would be interesting to apply the tools of data science to the history of philosophy.

The initial goal was to build a classification model with the data. After all, a book of philosophy represents an effort to systematically organize one's thought about the world. Using the data from the history of philosophy to classify texts would thus enable us to, by proxy, classify how people think about the world. Where some projects focus on sentiment analysis, here we focus on conceptual, or ideological analysis. Once we understand a person's worldview, there is no limit to what we can do with that information - from advertising to political campaigning through to self-exploration and therapy.

After that, I built several features to help people explore philosophical ideas and do comparisons. These included a w2v model for word use comparison, a set of basic stats for each text and school, and a feature enabling users to search the corpus.

After finishing initial work on the site and its data tools, I thought it would be worthwhile to make the data publicly available so others could work with it.

Content

The dataset contains over 300,000 sentences from over 50 texts spanning 10 major schools of philosophy. The represented schools are: Plato, Aristotle, Rationalism, Empiricism, German Idealism, Communism, Capitalism, Phenomenology, Continental Philosophy, and Analytic Philosophy.

Texts were taken either from Project Gutenberg or from my own personal library of pdfs. The dataset is updated periodically as I add new texts to the corpus.

The texts were cleaned extensively before being tokenized and organized in the way they're presented here. For information on the cleaning steps, check out the github repo for the initial project, which contains a notebook with all the cleaning steps.

Inspiration

There are a ton of cool project ideas! Here are a few: - use some clustering technique to see if the sentences would naturally cluster into their corresponding schools - build a text completion or chat-bot app by training on the sources - compare the texts to secondary literature on the philosophers to see if the secondary literature gets the interpretation right

If you come up with any cool visualizations or insights you want to share, please do contact me and we can definitely feature your work on the Philosophy Data Project website. Looking forward to seeing what you come up with :)
h
RuSciBenchGRNTIClusteringP2P
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2025). RuSciBenchGRNTIClusteringP2P [Dataset]. https://huggingface.co/datasets/mteb/RuSciBenchGRNTIClusteringP2P
Explore at:
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Massive Text Embedding Benchmark
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
RuSciBenchGRNTIClusteringP2P An MTEB dataset Massive Text Embedding Benchmark

Clustering of scientific papers (title+abstract) by rubric

Task category t2c

Domains Academic, Written

Reference https://github.com/mlsa-iai-msu-lab/ru_sci_bench/

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_task("RuSciBenchGRNTIClusteringP2P") evaluator = mteb.MTEB([task])

model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/RuSciBenchGRNTIClusteringP2P.
h
LivedoorNewsClustering
huggingface.co
Updated Sep 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2025). LivedoorNewsClustering [Dataset]. https://huggingface.co/datasets/mteb/LivedoorNewsClustering
Explore at:
Dataset updated
Sep 9, 2025
Dataset authored and provided by
Massive Text Embedding Benchmark
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
LivedoorNewsClustering An MTEB dataset Massive Text Embedding Benchmark

Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics).

Task category t2c

Domains News, Written

Reference https://github.com/sbintuitions/JMTEB

Source datasets:

sbintuitions/JMTEB

How to evaluate on this task

You can evaluate an embedding model on this dataset using… See the full description on the dataset page: https://huggingface.co/datasets/mteb/LivedoorNewsClustering.

Facebook

Twitter

Click to copy link

Link copied

Cite

RK (2025). Sarcasm Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ruchikakumbhar/sarcasm-detection-dataset

Sarcasm Detection Dataset

Clustering Text Documents using K-Means in Scikit Learn

Explore at:

zip(1670891 bytes)Available download formats

Dataset updated

Jan 20, 2025

Authors

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Sarcasm Detection:

Steps: 1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json] 2. Preprocessing of text in case the text is loaded instead of manually adding it to the code 3. Vectorizing the text using TfidfVectorizer 4. Reduce the dimension using PCA 5. Clustering the documents 6. Plot the cluster using matplotlib

Clear search

Close search

Google apps

Main menu

Sarcasm Detection Dataset

blurbs-clustering-p2p

GSDMM: Short text clustering

Dataset

Contents

YouTube Video Subtitles

Data from: Automatic Definition of Robust Microbiome Sub-states in...

tenkgnad-clustering-s2s

Word Representations for Clinical Danish

Dataset and trained models belonging to the article 'Distant reading...

The Dataset

Data Cleaning and Curation

Comparing the crisis of 806/1403-4 and the Fatimid fitna...

Data from: Dataset for Vector space model and the usage patterns of...

Ground truth labels - Amazon movie reviews dataset

Addition of ground truth labels on Amazon movie reviews

http://i.imgur.com/aDVUwMz.png" alt="Image">

What is it?

The original dataset

The new labeled dataset

Instructions

Simulation data on the growth of atmospheric molecular clusters and...

190k+ Medium Articles

Data source

Data description

Data analysis

What can I do with this data?

Collection methodology

TenKGnadClusteringP2P.v2

Metadata, Title Pages, and Network Graph of the Digitized Content of the...

Steam Reviews English - Dead by Daylight

Steam Reviews — Dead by Daylight (App 381210)

Dataset Summary

Example Use Cases

Data Source

Licensing

Update Schedule

Credits

Disclaimer

GeoreviewClusteringP2P

History of Philosophy

Context

Content

Inspiration

RuSciBenchGRNTIClusteringP2P

LivedoorNewsClustering

Sarcasm Detection Dataset

Clustering Text Documents using K-Means in Scikit Learn