Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Sarcasm Detection:
Steps: 1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json] 2. Preprocessing of text in case the text is loaded instead of manually adding it to the code 3. Vectorizing the text using TfidfVectorizer 4. Reduce the dimension using PCA 5. Clustering the documents 6. Plot the cluster using matplotlib
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.
Facebook
Twitter
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Over 12k scraped YouTube EN subtitles for videos on GitHub topics.
How? Based on the topics https://github.com/topics I searched YouTube with the phrase "What is {topic}?" and downloaded up to 100 video subtitles for a given topic. The extracted text can be found in the dataset together with the topic name, video title and video URL.
Why? I wan to know if we can rate videos based on their information value, especially when we use YouTube as an information source.
You can find the source code here: https://github.com/detrin/text-info-value
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.
Prefixes:
Suffixes:
_All: all taxa
_Dominant: only 1% most abundant taxa
_NonDominant: remaining taxa after removing above dominant taxa
_GenusAll: taxa aggregated at genus level
_GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa
_GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa
Each folder contains 3 output files related to the same input dataset:
- data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
- definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
- sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID
Abstract of the associated paper:
The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains news article titles and is based on the dataset of the One Million Posts Corpus and 10kGNAD. It contains 10'267 unique samples, 10 splits with 1'436 to 9'962 samples and 9 unique classes. Splits are built similarly to MTEB's TwentyNewsgroupsClustering. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation results. If you use this… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/tenkgnad-clustering-s2s.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Word embeddings and word clusters for Clinical Danish, drawn from the heavily-anonymised E4C resource (https://doi.org/10.1177/1460458216647760) and presented here as statistical aggregate data over those records. Vocabulary of 382737 words. Vectors have 100 dimensions. Clusters generated using Generalised Brown clustering with a=2500 and a minimum count of 3; coarser clusters can be generated rapidly from the included mergefile (see https://github.com/sean-chester/generalised-brown/blob/master/cluster_generator/cluster.py)Data statement included
Facebook
TwitterQuantifying Iconicity - Zenodo
This dataset contains the material collected for the article "Distant reading 940,000 online circulations of 26 iconic photographs" (to be) published in New Media & Society (DOI: 10.1177/14614448211049459). We identified 26 iconic photographs based on earlier work (Van der Hoeven, 2019). The Google Cloud Vision (GCV) API was subsequently used to identify webpages that host a reproduction of the iconic image. The GCV API uses computer vision methods and the Google index to retrieve these reproductions. The code for calling the API and parsing the data can be found on GitHub: https://github.com/rubenros1795/ReACT_GCV.
The core dataset consists of .tsv-files with the URLs that refer to the webpages. Other metadata provided by the GCV API is also found in the file and manually generated metadata. This includes:
- the URL that refers specifically to the image. This can be an URL that refers to a full match or a partial match
- the title of the page
- the iteration number. Because the GCV API puts a limit on its output, we had to reupload the identified images to the API to extend our search. We continued these iterations until no more new unique URLs were found
- the language found by the langid Python module link, along with the normalized score.
- the labels associated with the image by Google
- the scrape date
Alongside the .tsv-files, there are several other elements in the following folder structure:
├── data
│ ├── embeddings
│ └── doc2vec
│ └── input-text
│ └── metadata
│ └── umap
│ └── evaluation
│ └── results
│ └── diachronic-plots
│ └── top-words
│ └── tsv
/embeddings folder contains the doc2vec models, the training input for the models, the metadata (id, URL, date) and the UMAP embeddings used in the GMM clustering. Please note that the date parser was not able to find dates for all webpages and for this reason not all training texts have associated metadata./evaluation folder contains the AIC and BIC scores for GMM clustering with different numbers of clusters./results folder contains the top words associated with the clusters and the diachronic cluster prominence plots.Our pipeline contained several interventions to prevent noise in the data. First, in between the iterations we manually checked the scraped photos for relevance. We did so because reuploading an iconic image that is paired with another, irrelevant, one results in reproductions of the irrelevant one in the next iteration. Because we did not catch all noise, we used Scale Invariant Feature Transform (SIFT), a basic computer vision algorithm, to remove images that did not meet a threshold of ten keypoints. By doing so we removed completely unrelated photographs, but left room for variations of the original (such as painted versions of Che Guevara, or cropped versions of the Napalm Girl image). Another issue was the parsing of webpage texts. After experimenting with different webpage parsers that aim to extract 'relevant' text it proved too difficult to use one solution for all our webpages. Therefore we simply parsed all the text contained in commonly used html-tags, such as , etc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accompanying datasets that are referenced in the Journal Article 'Comparing the crisis of 806/1403-4 and the Fatimid fitna (450-466/1058-1073): al-Maqrīzī as a historian of the Fatimids - Datasets'. The texts used in the analysis are taken from OpenITI corpus release (Version 2021.2.5). If the ID (the final part of the text URI) has changed from the OpenITI release to this data release, then the text has been modified for this case study. File extensions following the text URI indicate that the text has had additional tags applied (either date tags, or text reuse cluster tags). csv file names indicate the text file from which the csv file was generated. This is a published part of an active research project. For other datasets and the scripts used to generate this data, see the relevant GitHub repositiory: https://github.com/mabarber92/fitna-study
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PrefaceThis is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).How to cite the datasetIf you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the malindo_dbase, see below), please cite as:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205Dataset description1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions: - For me-: ^(?i)(me)([a-z-]{3,})$- For di-: ^(?i)(di)([a-z-]{3,})$- For ter-: ^(?i)(ter)([a-z-]{3,})$7. malindo_dbase is the MALINDO Morphological Dictionary (see above).ReferencesSchmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a side project for my thesis “Classification/Clustering Techniques for Large Web Data Collections”.
My main goal was to provide a new, enriched, ground truth labeled dataset to the Machine Learning community. All labels have been collected by crawling/scraping Amazon.com for a period of some months. By labels I mean the categories in which the products are classified (look the green underlined labels on the screenshot below).
http://i.imgur.com/mAiuoO6.png" alt="Image">
Please, if you feel you can make any contribution that will improve this dataset, fork it on github.com.
The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012.
Data format:
where:
All the collected data (for every ASIN of the SNAP Dataset, ~253k products for ~8m reviews) are stored in a csv file labels.csv in the following format:
The new data format will be:
You can follow the steps mentioned below on how to get the enriched dataset:
Download the original dataset from the SNAP website (~ 3.3 GB compressed) and put it in the root folder of the repository (where you can find also the labels.csv file).
Execute the python file enrich.py (it is available in the github project), so the new enriched multi-labeled dataset be exported. The name of the new file should be output.txt.gz.
Notice: Please be patient as the python script will take a while to parse all these reviews.
The python script generates a new compressed file that is actually same with the original one, but with an extra feature (product/categories).
In fact,(the python script) applies a mapping between ASIN values in both files and adds the labels data of the product in every review instance of that, as an extra column.
Here is the code:
import gzip
import csv
import ast
def look_up(asin, diction):
try:
return diction[asin]
except KeyError:
return []
def load_labels():
labels_dictionary = {}
with open('labels.csv', mode='r') as infile:
csvreader = csv.reader(infile)
next(csvreader)
for rows in csvreader:
labels_dictionary[rows[0]] = ast.literal_eval(rows[1])
return labels_dictionary
def parse(filename):
labels_dict = load_labels()
f = gzip.open(filename, 'r')
entry = {}
for l in f:
l = l.strip()
colonPos = l.find(':')
if colonPos == -1:
yield entry
entry = {}
continue
eName = l[:colonPos]
rest = l[colonPos+2:]
entry[eName] = rest
if eName == 'product/productId':
entry['product/categories'] ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains output data from cluster population simulations performed with Atmospheric Cluster Dynamics Code (ACDC) model, which simulates the formation of clusters from atmospheric vapors and the growth of these clusters by further molecular and cluster-cluster collisions. The data can be used for investigating the formation and growth of atmospheric particles from inorganic and organic vapors.
The data is output of a computational process model, and hence does not represent a specific time period or location. Simulation sets are calculated for a one or two-component system containing a quasi-unary inorganic compound representing a mixture of sulfuric acid and ammonia (SA) and/or oxidized organic vapors corresponding to a low volatility organic compound (LVOC) and an extremely-low volatility organic compound (ELVOC). The external conditions in the simulations correspond to those in the CLOUD (Cosmics Leaving Outdoor Droplets) chamber at temperature of 5 C°.
Data are provided for 14 simulations.
References
Kontkanen J, Stolzenburg D, Olenius T, Yan C, Dada L, Ahonen L, Simon M, Lehtipalo K, Riipinen I (2022) What controls the observed size-dependency of the growth rates of sub-10 nm atmospheric particles?. Environ. sci. Atmos. https://doi:10.1039/d1ea00103e
Olenius T, Riipinen I (2017) Molecular-resolution simulations of new particle formation: Evaluation of common assumptions made in describing nucleation in aerosol dynamics models. Aerosol Sci. Tech. 51:397 – 408. https://doi.org/10.1080/02786826.2016.1262530
Olenius T, Atmospheric Cluster Dynamics Code. https://github.com/tolenius/ACDC
McGrath MJ et al. (2012) Atmospheric Cluster Dynamics Code: a flexible method for solution of the birth-death equations. Atmos. Chem. Phys. 12:2345 – 2355. https://doi.org/10.5194/acp-12-2345-2012
Data description
The data is in the form of text files. The provided data files (total compressed size ~10GB) correspond to simulation output from the ACDC model. Simulation sets are shown in the table below and further described in Kontkanen et al. (2022). For the interpretation of the model output, the interested user is referred to the manual of ACDC model (https://github.com/tolenius/ACDC).
Simulation set
Model compounds
Vapor concentrations (cm-3)
Method to retrieve evaporation rates
1
SA
CSA = 8.0*106, 2.0*107, 4.7*107, 1.1*108
Kelvin eq. (classical evaporation rates)
2
SA
CSA = 2.0*107, 4.7*107, 1.1*108
QC data and Kelvin eq. (non-classical evaporation rates)
3
LVOC
CLVOC = 5.0*107, 1*108
Kelvin eq. (classical evaporation rates)
4
LVOC, ELVOC
CLVOC = 5.0*107, 1*108 CELVOC = 1.0 *107
Kelvin eq. (classical evaporation rates)
5
LVOC, SA
CLVOC = 2.0*107, 5.0*107, 1*108 CSA = 8.0*106
Kelvin eq. (classical evaporation rates)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data has been collected through a standard scraping process from the Medium website, looking for published articles.
Each row in the data is a different article published on Medium. For each article, you have the following features: - title [string]: The title of the article. - text [string]: The text content of the article. - url [string]: The URL associated to the article. - authors [list of strings]: The article authors. - timestamp [string]: The publication datetime of the article. - tags [list of strings]: List of tags associated to the article.
You can find a very quick data analysis in this notebook.
Scraping has been done with Python and the requests library. Starting from a random article on Medium, the next articles to scrape are selected by visiting: 1. The author archive pages. 2. The publication archive pages (if present). 3. The tags archives (if present).
The article HTML pages have been parsed with the newspaper Python library.
Published articles have been filtered for English articles only, using the Python langdetect library.
As a consequence of the collection methodology, the scraped articles are coming from a not uniform publication date distribution. This means that there are articles published in 2016 and in 2022, but the number of articles in this dataset published in 2016 is not the same as the number of articles published in 2022. In particular, there is a strong prevalence of articles published in 2020. Have a look at the accompanying notebook to see the distribution of the publication dates.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
TenKGnadClusteringP2P.v2 An MTEB dataset Massive Text Embedding Benchmark
Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.
Task category t2c
Domains News, Non-fiction, Written
Reference https://tblock.github.io/10kGNAD/
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_task("TenKGnadClusteringP2P.v2") evaluator… See the full description on the dataset page: https://huggingface.co/datasets/mteb/TenKGnadClusteringP2P.v2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set has been downloaded via the OAI-PMH endpoint of the Berlin State Library/Staatsbibliothek zu Berlin’s Digitized Collections (https://digital.staatsbibliothek-berlin.de/oai) on March 1st 2019 and converted into common tabular formats on the basis of the provided Dublin Core metadata. It contains 146,000 records.
In addition to the bibliographic metadata, representative images of the works have been downloaded, resized to a 512 pixel maximum thumbnail image and saved in JPEG format. The image data is split into title pages and first pages. Title pages have been derived from structural metadata created by scan operators and librarians. If this information was not available, first pages of the media have been downloaded. In case of multi-volume media, title pages are not available.
In total, 141,206 images title/first pages are available.
Furthermore, the tabular data has been cleaned and extended with geo-spatial coordinates provided by the OpenStreetMap project (https://www.openstreetmap.org). The actual data processing steps are summarized in the next section. For the sake of transparency and reproducibility, the original data taken from the OAI-PMH endpoint is still present in the table.
To conclude with, various graphs in GML file format are available that can be loaded directly into graph analysis tools such as Gephi (https://gephi.org/).
The implementation of the data processing steps (incl. graph creation) are available as a Jupyter notebook provided at https://github.com/elektrobohemian/SBBrowse2018/blob/master/DataProcessing.ipynb.
Tabular Metadata
The metadata is available in Excel (cleanedData.xlsx) and CSV (cleanedData.csv) file formats with equal content.
The table contains the following columns. Italique columns have not been processed.
· title The title of the medium
· creator Its creator (family name, first name)
· subject A collection’s name as provided by the library
· type The type of medium
· format A MIME type for full metadata download
· identifier An additional identifier (most often the PPN)
· language A 3-letter language code of the medium
· date The date of creation/publication or a time span
· relation A relation to a project or collection a medium has been digitized for.
· coverage The location of publication or origin (ranging from cities to continents)
· publisher The publisher of the medium.
· rights Copyright information.
· PPN The unique identifier that can be used to find more information about the current medium in all information systems of Berlin State Library/Staatsbibliothek zu Berlin.
· spatialClean In case of multiple entries in coverage, only the first place of origin has been extracted. Additionally, characters such as question marks, brackets, or the like have been removed. The entries have been normalized regarding whitespaces and writing variants with the help of regular expressions.
· dateClean As the original date may contain various format variants to indicate unclear creation dates (e.g., time spans or question marks), this field contains a mapping to a certain point in time.
· spatialCluster The cluster ID determined with the help of the Jaro-Winkler distance on the spatialClean string. This step is needed because the spatialClean fields still contain a huge amount of orthographic variants and latinizations of geographic names.
· spatialClusterName A verbal cluster name (controlled manually).
· latitude The latitude provided by OpenStreetMap of the spatialClusterName if the location could be found.
· longitude The longitude provided by OpenStreetMap of the spatialClusterName if the location could be found.
· century A century derived from the date.
· textCluster A text cluster ID on the basis of a k-means clustering relying on the title field with a vocabulary size of 125,000 using the tf*idf model and k=5,000.
· creatorCluster A text cluster ID based on the creator field with k=20,000.
· titleImage The path to the first/title page relative to the img/ subdirectory or None in case of a multi-volume work.
Other Data
graphs.zip
Various pre-computed graphs.
img.zip
First and title pages in JPEG format.
json.zip
JSON files for each record in the following format:
ppn "PPN57346250X"
dateClean "1625"
title "M. Georgii Gutkii, Gymnasii Berlinensis Rectoris Habitus Primorum Principiorum, Seu Intelligentia; Annexae Sunt Appendicis loco Disputationes super eodem habitu tum in Academia Wittebergensi, tum in Gymnasio Berlinensi ventilatae"
creator "Gutke, Georg"
spatialClusterName "Berlin"
spatialClean "Berolini"
spatialRaw "Berolini"
mediatype "monograph"
subject "Historische Drucke"
publisher "Kallius"
lat "52.5170365"
lng "13.3888599"
textCluster "45"
creatorCluster "5040"
titleImage "titlepages/PPN57346250X.jpg"
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A dataset of 277,439 English-only Steam user reviews for Dead by Daylight from 2019 to November 2025, collected through the official Steam API.
Each row represents a single review, including sentiment labels, playtime, and engagement metrics.
This dataset is ideal for natural language processing, sentiment analysis, and behavioral data studies.
A separate CSV with all the patches released for Dead by Daylight is included in the download for your convenience.
| Field | Description |
|---|---|
review | Full review text |
sentiment | 1 = positive review, 0 = negative |
purchased | 1 if purchased on Steam |
received_for_free | 1 if the game was received for free |
votes_up | Number of helpful votes |
votes_funny | Number of “funny” votes |
date_created | Review creation date (YYYY-MM-DD, UTC) |
date_updated | Last update date (YYYY-MM-DD, UTC) |
author_num_games_owned | Total games owned by reviewer |
author_num_reviews | Total reviews written by reviewer |
author_playtime_forever_min | Total playtime in minutes |
author_playtime_at_review_min | Playtime when the review was written (minutes) |
Reviews were collected using the SirDarcanos/Steam-Reviews-Scraper script.
This dataset includes only publicly accessible user content and metadata.
Each record is factual and unaltered beyond format normalization.
Updates will be performed irregularly and only when new data is collected. Users are welcome to suggest improvements or request updates via the discussion section.
Created by Nicola Mustone.
This dataset and its author are not affiliated with, endorsed by, or sponsored by Valve Corporation or Behaviour Interactive Inc.
All product names, logos, brands, and trademarks are the property of their respective owners.
The data included in this dataset was collected from publicly available user reviews through the official Steam Web API, and is provided solely for educational and research purposes.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GeoreviewClusteringP2P An MTEB dataset Massive Text Embedding Benchmark
Review clustering based on Yandex Georeview dataset
Task category t2c
Domains Reviews, Written
Reference https://github.com/yandex/geo-reviews-dataset-2023
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["GeoreviewClusteringP2P"]) evaluator = mteb.MTEB(task)
model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/GeoreviewClusteringP2P.
Facebook
TwitterThis dataset was compiled for the Philosophy Data Project and used to develop the features available on that site. As a former philosophy teacher and now data scientist, I thought it would be interesting to apply the tools of data science to the history of philosophy.
The initial goal was to build a classification model with the data. After all, a book of philosophy represents an effort to systematically organize one's thought about the world. Using the data from the history of philosophy to classify texts would thus enable us to, by proxy, classify how people think about the world. Where some projects focus on sentiment analysis, here we focus on conceptual, or ideological analysis. Once we understand a person's worldview, there is no limit to what we can do with that information - from advertising to political campaigning through to self-exploration and therapy.
After that, I built several features to help people explore philosophical ideas and do comparisons. These included a w2v model for word use comparison, a set of basic stats for each text and school, and a feature enabling users to search the corpus.
After finishing initial work on the site and its data tools, I thought it would be worthwhile to make the data publicly available so others could work with it.
The dataset contains over 300,000 sentences from over 50 texts spanning 10 major schools of philosophy. The represented schools are: Plato, Aristotle, Rationalism, Empiricism, German Idealism, Communism, Capitalism, Phenomenology, Continental Philosophy, and Analytic Philosophy.
Texts were taken either from Project Gutenberg or from my own personal library of pdfs. The dataset is updated periodically as I add new texts to the corpus.
The texts were cleaned extensively before being tokenized and organized in the way they're presented here. For information on the cleaning steps, check out the github repo for the initial project, which contains a notebook with all the cleaning steps.
There are a ton of cool project ideas! Here are a few: - use some clustering technique to see if the sentences would naturally cluster into their corresponding schools - build a text completion or chat-bot app by training on the sources - compare the texts to secondary literature on the philosophers to see if the secondary literature gets the interpretation right
If you come up with any cool visualizations or insights you want to share, please do contact me and we can definitely feature your work on the Philosophy Data Project website. Looking forward to seeing what you come up with :)
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
RuSciBenchGRNTIClusteringP2P An MTEB dataset Massive Text Embedding Benchmark
Clustering of scientific papers (title+abstract) by rubric
Task category t2c
Domains Academic, Written
Reference https://github.com/mlsa-iai-msu-lab/ru_sci_bench/
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_task("RuSciBenchGRNTIClusteringP2P") evaluator = mteb.MTEB([task])
model =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/RuSciBenchGRNTIClusteringP2P.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
LivedoorNewsClustering An MTEB dataset Massive Text Embedding Benchmark
Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics).
Task category t2c
Domains News, Written
Reference https://github.com/sbintuitions/JMTEB
Source datasets:
sbintuitions/JMTEB
How to evaluate on this task
You can evaluate an embedding model on this dataset using… See the full description on the dataset page: https://huggingface.co/datasets/mteb/LivedoorNewsClustering.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Sarcasm Detection:
Steps: 1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json] 2. Preprocessing of text in case the text is loaded instead of manually adding it to the code 3. Vectorizing the text using TfidfVectorizer 4. Reduce the dimension using PCA 5. Clustering the documents 6. Plot the cluster using matplotlib