25 datasets found
  1. Sarcasm Detection Dataset

    • kaggle.com
    zip
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RK (2025). Sarcasm Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ruchikakumbhar/sarcasm-detection-dataset
    Explore at:
    zip(1670891 bytes)Available download formats
    Dataset updated
    Jan 20, 2025
    Authors
    RK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Sarcasm Detection:

    Steps: 1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json] 2. Preprocessing of text in case the text is loaded instead of manually adding it to the code 3. Vectorizing the text using TfidfVectorizer 4. Reduce the dimension using PCA 5. Clustering the documents 6. Plot the cluster using matplotlib

  2. h

    blurbs-clustering-p2p

    • huggingface.co
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvan (2023). blurbs-clustering-p2p [Dataset]. https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2023
    Authors
    Silvan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.

  3. GSDMM: Short text clustering

    • kaggle.com
    zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    catherine (2025). GSDMM: Short text clustering [Dataset]. https://www.kaggle.com/datasets/ptfrwrd/gsdmm-short-text-clustering/discussion
    Explore at:
    zip(8048 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    catherine
    Description

    Dataset

    This dataset was created by catherine

    Contents

    From https://github.com/rwalk/gsdmm

  4. YouTube Video Subtitles

    • kaggle.com
    zip
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Herman (2025). YouTube Video Subtitles [Dataset]. https://www.kaggle.com/datasets/jetakow/youtube-videos-subtitles/data
    Explore at:
    zip(42191918 bytes)Available download formats
    Dataset updated
    Feb 5, 2025
    Authors
    Daniel Herman
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    YouTube
    Description

    Over 12k scraped YouTube EN subtitles for videos on GitHub topics.

    How? Based on the topics https://github.com/topics I searched YouTube with the phrase "What is {topic}?" and downloaded up to 100 video subtitles for a given topic. The extracted text can be found in the dataset together with the topic name, video title and video URL.

    Why? I wan to know if we can rate videos based on their information value, especially when we use YouTube as an information source.

    You can find the source code here: https://github.com/detrin/text-info-value

  5. Data from: Automatic Definition of Robust Microbiome Sub-states in...

    • zenodo.org
    txt, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson (2020). Data from: Automatic Definition of Robust Microbiome Sub-states in Longitudinal Data [Dataset]. http://doi.org/10.5281/zenodo.167376
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.

    Prefixes:

    Suffixes:

    • _All: all taxa

    • _Dominant: only 1% most abundant taxa

    • _NonDominant: remaining taxa after removing above dominant taxa

    • _GenusAll: taxa aggregated at genus level

    • _GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa

    • _GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa

    Each folder contains 3 output files related to the same input dataset:
    - data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
    - definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
    - sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID

    Abstract of the associated paper:

    The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.

  6. h

    tenkgnad-clustering-s2s

    • huggingface.co
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvan (2023). tenkgnad-clustering-s2s [Dataset]. https://huggingface.co/datasets/slvnwhrl/tenkgnad-clustering-s2s
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2023
    Authors
    Silvan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains news article titles and is based on the dataset of the One Million Posts Corpus and 10kGNAD. It contains 10'267 unique samples, 10 splits with 1'436 to 9'962 samples and 9 unique classes. Splits are built similarly to MTEB's TwentyNewsgroupsClustering. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation results. If you use this… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/tenkgnad-clustering-s2s.

  7. Word Representations for Clinical Danish

    • figshare.com
    tar
    Updated May 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leon Derczynski (2020). Word Representations for Clinical Danish [Dataset]. http://doi.org/10.6084/m9.figshare.12377858.v1
    Explore at:
    tarAvailable download formats
    Dataset updated
    May 27, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Leon Derczynski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Word embeddings and word clusters for Clinical Danish, drawn from the heavily-anonymised E4C resource (https://doi.org/10.1177/1460458216647760) and presented here as statistical aggregate data over those records. Vocabulary of 382737 words. Vectors have 100 dimensions. Clusters generated using Generalised Brown clustering with a=2500 and a minimum count of 3; coarser clusters can be generated rapidly from the included mergefile (see https://github.com/sean-chester/generalised-brown/blob/master/cluster_generator/cluster.py)Data statement included

  8. Z

    Dataset and trained models belonging to the article 'Distant reading...

    • data.niaid.nih.gov
    Updated Sep 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smits, Thomas; Ros, Ruben (2021). Dataset and trained models belonging to the article 'Distant reading patterns of iconicity in 940.000 online circulations of 26 iconic photographs' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4244000
    Explore at:
    Dataset updated
    Sep 28, 2021
    Dataset provided by
    Luxembourg Centre for Contemporary and Digital History
    Utrecht University
    Authors
    Smits, Thomas; Ros, Ruben
    Description

    Quantifying Iconicity - Zenodo

    The Dataset

    This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.

    The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes: - the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match - the title of the page - the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found - the language found by the langid Python module link, along with the normalized score. - the labels associated with the image by Google - the scrape date

    Alongside the .tsv-files, there are several other elements in the following folder structure:

    ├── data
    │  ├── embeddings
    │        └── doc2vec
    │        └── input-text
    │        └── metadata
    │        └── umap
    │  └── evaluation
    │  └── results
    │        └── diachronic-plots
    │        └── top-words
    │  └── tsv
    
    1. The /embeddings folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata.
    2. The /evaluation folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters.
    3. The /results folder contains the top words associated with the clusters and the diachronic cluster prominence plots.

    Data Cleaning and Curation

    Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as , etc.

  9. Comparing the crisis of 806/1403-4 and the Fatimid fitna...

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Comparing the crisis of 806/1403-4 and the Fatimid fitna (450-466/1058-1073): al-Maqrīzī as a historian of the Fatimids - Datasets [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7062745?locale=ga
    Explore at:
    unknown(151)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Fatimid Caliphate
    Description

    Accompanying datasets that are referenced in the Journal Article 'Comparing the crisis of 806/1403-4 and the Fatimid fitna (450-466/1058-1073): al-Maqrīzī as a historian of the Fatimids - Datasets'. The texts used in the analysis are taken from OpenITI corpus release (Version 2021.2.5). If the ID (the final part of the text URI) has changed from the OpenITI release to this data release, then the text has been modified for this case study. File extensions following the text URI indicate that the text has had additional tags applied (either date tags, or text reuse cluster tags). csv file names indicate the text file from which the csv file was generated. This is a published part of an active research project. For other datasets and the scripts used to generate this data, see the relevant GitHub repositiory: https://github.com/mabarber92/fitna-study

  10. Data from: Dataset for Vector space model and the usage patterns of...

    • figshare.com
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave (2023). Dataset for Vector space model and the usage patterns of Indonesian denominal verbs [Dataset]. http://doi.org/10.6084/m9.figshare.8187155.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PrefaceThis is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).How to cite the datasetIf you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the malindo_dbase, see below), please cite as:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205Dataset description1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions: - For me-: ^(?i)(me)([a-z-]{3,})$- For di-: ^(?i)(di)([a-z-]{3,})$- For ter-: ^(?i)(ter)([a-z-]{3,})$7. malindo_dbase is the MALINDO Morphological Dictionary (see above).ReferencesSchmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.

  11. Ground truth labels - Amazon movie reviews dataset

    • kaggle.com
    zip
    Updated Jul 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Bazakos (2017). Ground truth labels - Amazon movie reviews dataset [Dataset]. https://www.kaggle.com/thebuzz/ground-truth-labels-amazon-movie-reviews-dataset
    Explore at:
    zip(6829166 bytes)Available download formats
    Dataset updated
    Jul 8, 2017
    Authors
    Konstantinos Bazakos
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Addition of ground truth labels on Amazon movie reviews

    http://i.imgur.com/aDVUwMz.png" alt="Image">

    What is it?

    This is a side project for my thesis “Classification/Clustering Techniques for Large Web Data Collections”.

    My main goal was to provide a new, enriched, ground truth labeled dataset to the Machine Learning community. All labels have been collected by crawling/scraping Amazon.com for a period of some months. By labels I mean the categories in which the products are classified (look the green underlined labels on the screenshot below).

    http://i.imgur.com/mAiuoO6.png" alt="Image">

    Please, if you feel you can make any contribution that will improve this dataset, fork it on github.com.

    The original dataset

    The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012.

    Data format:

    • product/productId: B00006HAXW
    • review/userId: A1RSDE90N6RSZF
    • review/profileName: Joseph M. Kotow
    • review/helpfulness: 9/9
    • review/score: 5.0
    • review/time: 1042502400
    • review/summary: Pittsburgh - Home of the OLDIES
    • review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!

    where:

    • product/productId: asin, e.g. amazon.com/dp/B00006HAXW
    • review/userId: id of the user, e.g. A1RSDE90N6RSZF
    • review/profileName: name of the user
    • review/helpfulness: fraction of users who found the review helpful
    • review/score: rating of the product
    • review/time: time of the review (unix time)
    • review/summary: review summary
    • review/text: text of the review

    The new labeled dataset

    All the collected data (for every ASIN of the SNAP Dataset, ~253k products for ~8m reviews) are stored in a csv file labels.csv in the following format:

    • ASIN: unique identifier for the product
    • Categories: [label, label, label,..., label]

    The new data format will be:

    • product/productId: B00006HAXW
    • review/userId: A1RSDE90N6RSZF
    • review/profileName: Joseph M. Kotow
    • review/helpfulness: 9/9
    • review/score: 5.0
    • review/time: 1042502400
    • review/summary: Pittsburgh - Home of the OLDIES
    • review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!
    • product/categories: ['CDs & Vinyl', 'Pop', 'Oldies', 'Doo Wop']

    Instructions

    You can follow the steps mentioned below on how to get the enriched dataset:

    1. Download the original dataset from the SNAP website (~ 3.3 GB compressed) and put it in the root folder of the repository (where you can find also the labels.csv file).

    2. Execute the python file enrich.py (it is available in the github project), so the new enriched multi-labeled dataset be exported. The name of the new file should be output.txt.gz.

    Notice: Please be patient as the python script will take a while to parse all these reviews.

    The python script generates a new compressed file that is actually same with the original one, but with an extra feature (product/categories).

    In fact,(the python script) applies a mapping between ASIN values in both files and adds the labels data of the product in every review instance of that, as an extra column.

    Here is the code:

    import gzip
    import csv
    import ast
    
    def look_up(asin, diction):
      try:
        return diction[asin]
      except KeyError:
        return []
    
    def load_labels():
      labels_dictionary = {}
      with open('labels.csv', mode='r') as infile:
        csvreader = csv.reader(infile)
        next(csvreader)
        for rows in csvreader:
          labels_dictionary[rows[0]] = ast.literal_eval(rows[1])
      return labels_dictionary
    
    def parse(filename):
      labels_dict = load_labels()
      f = gzip.open(filename, 'r')
      entry = {}
      for l in f:
        l = l.strip()
        colonPos = l.find(':')
        if colonPos == -1:
          yield entry
          entry = {}
          continue
        eName = l[:colonPos]
        rest = l[colonPos+2:]
        entry[eName] = rest
        if eName == 'product/productId':
          entry['product/categories'] ...
    
  12. Z

    Simulation data on the growth of atmospheric molecular clusters and...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Mar 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kontkanen, Jenni; Olenius, Tinja; Stolzenburg, Dominik; Lehtipalo, Katrianne; Riipinen, Ilona (2022). Simulation data on the growth of atmospheric molecular clusters and particles [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6370140
    Explore at:
    Dataset updated
    Mar 20, 2022
    Dataset provided by
    Department of Environmental Science (ACES) and Bolin Centre for Climate Research
    University of Helsinki, Finnish Meteorological Institute
    Swedish Meteorological and Hydrological Institute
    University of Helsinki
    Authors
    Kontkanen, Jenni; Olenius, Tinja; Stolzenburg, Dominik; Lehtipalo, Katrianne; Riipinen, Ilona
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains output data from cluster population simulations performed with Atmospheric Cluster Dynamics Code (ACDC) model, which simulates the formation of clusters from atmospheric vapors and the growth of these clusters by further molecular and cluster-cluster collisions. The data can be used for investigating the formation and growth of atmospheric particles from inorganic and organic vapors.

    The data is output of a computational process model, and hence does not represent a specific time period or location. Simulation sets are calculated for a one or two-component system containing a quasi-unary inorganic compound representing a mixture of sulfuric acid and ammonia (SA) and/or oxidized organic vapors corresponding to a low volatility organic compound (LVOC) and an extremely-low volatility organic compound (ELVOC). The external conditions in the simulations correspond to those in the CLOUD (Cosmics Leaving Outdoor Droplets) chamber at temperature of 5 C°.

    Data are provided for 14 simulations.

    References

    Kontkanen J, Stolzenburg D, Olenius T, Yan C, Dada L, Ahonen L, Simon M, Lehtipalo K, Riipinen I (2022) What controls the observed size-dependency of the growth rates of sub-10 nm atmospheric particles?. Environ. sci. Atmos. https://doi:10.1039/d1ea00103e

    Olenius T, Riipinen I (2017) Molecular-resolution simulations of new particle formation: Evaluation of common assumptions made in describing nucleation in aerosol dynamics models. Aerosol Sci. Tech. 51:397⁠ – ⁠408. https://doi.org/10.1080/02786826.2016.1262530

    Olenius T, Atmospheric Cluster Dynamics Code. https://github.com/tolenius/ACDC

    McGrath MJ et al. (2012) Atmospheric Cluster Dynamics Code: a flexible method for solution of the birth-death equations. Atmos. Chem. Phys. 12:2345⁠ – ⁠2355. https://doi.org/10.5194/acp-12-2345-2012

    Data description

    The data is in the form of text files. The provided data files (total compressed size ~10GB) correspond to simulation output from the ACDC model. Simulation sets are shown in the table below and further described in Kontkanen et al. (2022). For the interpretation of the model output, the interested user is referred to the manual of ACDC model (https://github.com/tolenius/ACDC).

    Simulation set

    Model compounds

    Vapor concentrations (cm-3)

    Method to retrieve evaporation rates

    1

    SA

    CSA = 8.0*106, 2.0*107, 4.7*107, 1.1*108

    Kelvin eq. (classical evaporation rates)

    2

    SA

    CSA = 2.0*107, 4.7*107, 1.1*108

    QC data and Kelvin eq. (non-classical evaporation rates)

    3

    LVOC

    CLVOC = 5.0*107, 1*108

    Kelvin eq. (classical evaporation rates)

    4

    LVOC, ELVOC

    CLVOC = 5.0*107, 1*108 CELVOC = 1.0 *107

    Kelvin eq. (classical evaporation rates)

    5

    LVOC, SA

    CLVOC = 2.0*107, 5.0*107, 1*108 CSA = 8.0*106

    Kelvin eq. (classical evaporation rates)

  13. 190k+ Medium Articles

    • kaggle.com
    zip
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabio Chiusano (2022). 190k+ Medium Articles [Dataset]. https://www.kaggle.com/datasets/fabiochiusano/medium-articles
    Explore at:
    zip(386824829 bytes)Available download formats
    Dataset updated
    Apr 26, 2022
    Authors
    Fabio Chiusano
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data source

    This data has been collected through a standard scraping process from the Medium website, looking for published articles.

    Data description

    Each row in the data is a different article published on Medium. For each article, you have the following features: - title [string]: The title of the article. - text [string]: The text content of the article. - url [string]: The URL associated to the article. - authors [list of strings]: The article authors. - timestamp [string]: The publication datetime of the article. - tags [list of strings]: List of tags associated to the article.

    Data analysis

    You can find a very quick data analysis in this notebook.

    What can I do with this data?

    • A multilabel classification model that assigns tags to articles.
    • A seq2seq model that generates article titles.
    • Text analysis.
    • Finetune text generation models on the general domain of Medium, or on specific domains by filtering articles by the appropriate tags.

    Collection methodology

    Scraping has been done with Python and the requests library. Starting from a random article on Medium, the next articles to scrape are selected by visiting: 1. The author archive pages. 2. The publication archive pages (if present). 3. The tags archives (if present).

    The article HTML pages have been parsed with the newspaper Python library.

    Published articles have been filtered for English articles only, using the Python langdetect library.

    As a consequence of the collection methodology, the scraped articles are coming from a not uniform publication date distribution. This means that there are articles published in 2016 and in 2022, but the number of articles in this dataset published in 2016 is not the same as the number of articles published in 2022. In particular, there is a strong prevalence of articles published in 2020. Have a look at the accompanying notebook to see the distribution of the publication dates.

  14. h

    TenKGnadClusteringP2P.v2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark, TenKGnadClusteringP2P.v2 [Dataset]. https://huggingface.co/datasets/mteb/TenKGnadClusteringP2P.v2
    Explore at:
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    TenKGnadClusteringP2P.v2 An MTEB dataset Massive Text Embedding Benchmark

    Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.

    Task category t2c

    Domains News, Non-fiction, Written

    Reference https://tblock.github.io/10kGNAD/

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_task("TenKGnadClusteringP2P.v2") evaluator… See the full description on the dataset page: https://huggingface.co/datasets/mteb/TenKGnadClusteringP2P.v2.

  15. Z

    Metadata, Title Pages, and Network Graph of the Digitized Content of the...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zellhöfer, David (2024). Metadata, Title Pages, and Network Graph of the Digitized Content of the Berlin State Library (146,000 items) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2582481
    Explore at:
    Dataset updated
    Jul 25, 2024
    Authors
    Zellhöfer, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Berlin
    Description

    The data set has been downloaded via the OAI-PMH endpoint of the Berlin State Library/Staatsbibliothek zu Berlin’s Digitized Collections (https://digital.staatsbibliothek-berlin.de/oai) on March 1st 2019 and converted into common tabular formats on the basis of the provided Dublin Core metadata. It contains 146,000 records.

    In addition to the bibliographic metadata, representative images of the works have been downloaded, resized to a 512 pixel maximum thumbnail image and saved in JPEG format. The image data is split into title pages and first pages. Title pages have been derived from structural metadata created by scan operators and librarians. If this information was not available, first pages of the media have been downloaded. In case of multi-volume media, title pages are not available.

    In total, 141,206 images title/first pages are available.

    Furthermore, the tabular data has been cleaned and extended with geo-spatial coordinates provided by the OpenStreetMap project (https://www.openstreetmap.org). The actual data processing steps are summarized in the next section. For the sake of transparency and reproducibility, the original data taken from the OAI-PMH endpoint is still present in the table.

    To conclude with, various graphs in GML file format are available that can be loaded directly into graph analysis tools such as Gephi (https://gephi.org/).

    The implementation of the data processing steps (incl. graph creation) are available as a Jupyter notebook provided at https://github.com/elektrobohemian/SBBrowse2018/blob/master/DataProcessing.ipynb.

    Tabular Metadata

    The metadata is available in Excel (cleanedData.xlsx) and CSV (cleanedData.csv) file formats with equal content.

    The table contains the following columns. Italique columns have not been processed.

    · title The title of the medium

    · creator Its creator (family name, first name)

    · subject A collection’s name as provided by the library

    · type The type of medium

    · format A MIME type for full metadata download

    · identifier An additional identifier (most often the PPN)

    · language A 3-letter language code of the medium

    · date The date of creation/publication or a time span

    · relation A relation to a project or collection a medium has been digitized for.

    · coverage The location of publication or origin (ranging from cities to continents)

    · publisher The publisher of the medium.

    · rights Copyright information.

    · PPN The unique identifier that can be used to find more information about the current medium in all information systems of Berlin State Library/Staatsbibliothek zu Berlin.

    · spatialClean In case of multiple entries in coverage, only the first place of origin has been extracted. Additionally, characters such as question marks, brackets, or the like have been removed. The entries have been normalized regarding whitespaces and writing variants with the help of regular expressions.

    · dateClean As the original date may contain various format variants to indicate unclear creation dates (e.g., time spans or question marks), this field contains a mapping to a certain point in time.

    · spatialCluster The cluster ID determined with the help of the Jaro-Winkler distance on the spatialClean string. This step is needed because the spatialClean fields still contain a huge amount of orthographic variants and latinizations of geographic names.

    · spatialClusterName A verbal cluster name (controlled manually).

    · latitude The latitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

    · longitude The longitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

    · century A century derived from the date.

    · textCluster A text cluster ID on the basis of a k-means clustering relying on the title field with a vocabulary size of 125,000 using the tf*idf model and k=5,000.

    · creatorCluster A text cluster ID based on the creator field with k=20,000.

    · titleImage The path to the first/title page relative to the img/ subdirectory or None in case of a multi-volume work.

    Other Data

    graphs.zip

    Various pre-computed graphs.

    img.zip

    First and title pages in JPEG format.

    json.zip

    JSON files for each record in the following format:

    ppn "PPN57346250X"

    dateClean "1625"

    title "M. Georgii Gutkii, Gymnasii Berlinensis Rectoris Habitus Primorum Principiorum, Seu Intelligentia; Annexae Sunt Appendicis loco Disputationes super eodem habitu tum in Academia Wittebergensi, tum in Gymnasio Berlinensi ventilatae"

    creator "Gutke, Georg"

    spatialClusterName "Berlin"

    spatialClean "Berolini"

    spatialRaw "Berolini"

    mediatype "monograph"

    subject "Historische Drucke"

    publisher "Kallius"

    lat "52.5170365"

    lng "13.3888599"

    textCluster "45"

    creatorCluster "5040"

    titleImage "titlepages/PPN57346250X.jpg"

  16. Steam Reviews English - Dead by Daylight

    • kaggle.com
    zip
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Mustone (2025). Steam Reviews English - Dead by Daylight [Dataset]. https://www.kaggle.com/datasets/nicolamustone/steam-reviews-english-dead-by-daylight
    Explore at:
    zip(22155467 bytes)Available download formats
    Dataset updated
    Nov 20, 2025
    Authors
    Nicola Mustone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Steam Reviews — Dead by Daylight (App 381210)

    A dataset of 277,439 English-only Steam user reviews for Dead by Daylight from 2019 to November 2025, collected through the official Steam API. Each row represents a single review, including sentiment labels, playtime, and engagement metrics.
    This dataset is ideal for natural language processing, sentiment analysis, and behavioral data studies.

    A separate CSV with all the patches released for Dead by Daylight is included in the download for your convenience.

    Dataset Summary

    FieldDescription
    reviewFull review text
    sentiment1 = positive review, 0 = negative
    purchased1 if purchased on Steam
    received_for_free1 if the game was received for free
    votes_upNumber of helpful votes
    votes_funnyNumber of “funny” votes
    date_createdReview creation date (YYYY-MM-DD, UTC)
    date_updatedLast update date (YYYY-MM-DD, UTC)
    author_num_games_ownedTotal games owned by reviewer
    author_num_reviewsTotal reviews written by reviewer
    author_playtime_forever_minTotal playtime in minutes
    author_playtime_at_review_minPlaytime when the review was written (minutes)

    Example Use Cases

    • Sentiment Analysis: Train classifiers using user tone and voting patterns.
    • Text Embeddings: Extract embeddings for clustering or topic modeling.
    • Behavioral Correlation: Relate sentiment to playtime or review length.

    Data Source

    Reviews were collected using the SirDarcanos/Steam-Reviews-Scraper script.

    This dataset includes only publicly accessible user content and metadata.
    Each record is factual and unaltered beyond format normalization.

    Licensing

    • Dataset: MIT License
      Free for commercial and non-commercial use with attribution.
    • Collection Script: GPLv3 License
      Ensures derivative software remains open-source.

    Update Schedule

    Updates will be performed irregularly and only when new data is collected. Users are welcome to suggest improvements or request updates via the discussion section.

    Credits

    Created by Nicola Mustone.

    Disclaimer

    This dataset and its author are not affiliated with, endorsed by, or sponsored by Valve Corporation or Behaviour Interactive Inc.

    All product names, logos, brands, and trademarks are the property of their respective owners.

    The data included in this dataset was collected from publicly available user reviews through the official Steam Web API, and is provided solely for educational and research purposes.

  17. h

    GeoreviewClusteringP2P

    • huggingface.co
    Updated May 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2025). GeoreviewClusteringP2P [Dataset]. https://huggingface.co/datasets/mteb/GeoreviewClusteringP2P
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GeoreviewClusteringP2P An MTEB dataset Massive Text Embedding Benchmark

    Review clustering based on Yandex Georeview dataset

    Task category t2c

    Domains Reviews, Written

    Reference https://github.com/yandex/geo-reviews-dataset-2023

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["GeoreviewClusteringP2P"]) evaluator = mteb.MTEB(task)

    model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/GeoreviewClusteringP2P.

  18. History of Philosophy

    • kaggle.com
    zip
    Updated Mar 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kourosh Alizadeh (2021). History of Philosophy [Dataset]. https://www.kaggle.com/kouroshalizadeh/history-of-philosophy
    Explore at:
    zip(57826536 bytes)Available download formats
    Dataset updated
    Mar 31, 2021
    Authors
    Kourosh Alizadeh
    Description

    Context

    This dataset was compiled for the Philosophy Data Project and used to develop the features available on that site. As a former philosophy teacher and now data scientist, I thought it would be interesting to apply the tools of data science to the history of philosophy.

    The initial goal was to build a classification model with the data. After all, a book of philosophy represents an effort to systematically organize one's thought about the world. Using the data from the history of philosophy to classify texts would thus enable us to, by proxy, classify how people think about the world. Where some projects focus on sentiment analysis, here we focus on conceptual, or ideological analysis. Once we understand a person's worldview, there is no limit to what we can do with that information - from advertising to political campaigning through to self-exploration and therapy.

    After that, I built several features to help people explore philosophical ideas and do comparisons. These included a w2v model for word use comparison, a set of basic stats for each text and school, and a feature enabling users to search the corpus.

    After finishing initial work on the site and its data tools, I thought it would be worthwhile to make the data publicly available so others could work with it.

    Content

    The dataset contains over 300,000 sentences from over 50 texts spanning 10 major schools of philosophy. The represented schools are: Plato, Aristotle, Rationalism, Empiricism, German Idealism, Communism, Capitalism, Phenomenology, Continental Philosophy, and Analytic Philosophy.

    Texts were taken either from Project Gutenberg or from my own personal library of pdfs. The dataset is updated periodically as I add new texts to the corpus.

    The texts were cleaned extensively before being tokenized and organized in the way they're presented here. For information on the cleaning steps, check out the github repo for the initial project, which contains a notebook with all the cleaning steps.

    Inspiration

    There are a ton of cool project ideas! Here are a few: - use some clustering technique to see if the sentences would naturally cluster into their corresponding schools - build a text completion or chat-bot app by training on the sources - compare the texts to secondary literature on the philosophers to see if the secondary literature gets the interpretation right

    If you come up with any cool visualizations or insights you want to share, please do contact me and we can definitely feature your work on the Philosophy Data Project website. Looking forward to seeing what you come up with :)

  19. h

    RuSciBenchGRNTIClusteringP2P

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2025). RuSciBenchGRNTIClusteringP2P [Dataset]. https://huggingface.co/datasets/mteb/RuSciBenchGRNTIClusteringP2P
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    RuSciBenchGRNTIClusteringP2P An MTEB dataset Massive Text Embedding Benchmark

    Clustering of scientific papers (title+abstract) by rubric

    Task category t2c

    Domains Academic, Written

    Reference https://github.com/mlsa-iai-msu-lab/ru_sci_bench/

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_task("RuSciBenchGRNTIClusteringP2P") evaluator = mteb.MTEB([task])

    model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/RuSciBenchGRNTIClusteringP2P.

  20. h

    LivedoorNewsClustering

    • huggingface.co
    Updated Sep 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2025). LivedoorNewsClustering [Dataset]. https://huggingface.co/datasets/mteb/LivedoorNewsClustering
    Explore at:
    Dataset updated
    Sep 9, 2025
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    LivedoorNewsClustering An MTEB dataset Massive Text Embedding Benchmark

    Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics).

    Task category t2c

    Domains News, Written

    Reference https://github.com/sbintuitions/JMTEB

    Source datasets:

    sbintuitions/JMTEB

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using… See the full description on the dataset page: https://huggingface.co/datasets/mteb/LivedoorNewsClustering.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
RK (2025). Sarcasm Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ruchikakumbhar/sarcasm-detection-dataset
Organization logo

Sarcasm Detection Dataset

Clustering Text Documents using K-Means in Scikit Learn

Explore at:
zip(1670891 bytes)Available download formats
Dataset updated
Jan 20, 2025
Authors
RK
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Sarcasm Detection:

Steps: 1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json] 2. Preprocessing of text in case the text is loaded instead of manually adding it to the code 3. Vectorizing the text using TfidfVectorizer 4. Reduce the dimension using PCA 5. Clustering the documents 6. Plot the cluster using matplotlib

Search
Clear search
Close search
Google apps
Main menu