61 datasets found

Document Clustering
kaggle.com
zip
Updated Mar 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arai Seisenbek (2022). Document Clustering [Dataset]. https://www.kaggle.com/datasets/nenriki/document-clustering
Explore at:
zip(1587 bytes)Available download formats
Dataset updated
Mar 7, 2022
Authors
Arai Seisenbek
Description
Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

Read texts from file and splitting them to the words.

Transform texts into vector spaces, calculate distances in these spaces 3. Bag of words and TF/IDF vectorizer.

Task 1.

Please download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.

Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).

Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine

Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.

Task 2.

For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.

Task 3.

Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.

NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.
Text Document Classification Dataset
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
Explore at:
zip(1941393 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
sunil thite
Description
This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

Politics = 0

Sport = 1

Technology = 2

Entertainment =3

Business = 4
bilingual text clusters -English and Arabic-
kaggle.com
zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aml Hassan Esmil (2025). bilingual text clusters -English and Arabic- [Dataset]. https://www.kaggle.com/datasets/amlhassan/bilingual-text-clusters
Explore at:
zip(1168763 bytes)Available download formats
Dataset updated
Apr 25, 2025
Authors
Aml Hassan Esmil
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
I built the dataset by: - Sampling 500 texts with 6 different labels from here https://medium.com/r/?url=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fhaithemhermessi%2Fsanad-dataset%2Fdata%3Fselect%3DCulture, I made sure that all the categories got different random number of samples inside them: Politics: 94, Sports: 110, Finance: 83, Tech: 67, Religion: 66, Medical: 80. I also made sure the text lengths varies across the samples. - And sampling 500 texts with 6 different labels from here https://www.kaggle.com/datasets/micchaeelwijaya/news-topics-classification-dataset, I made sure that all the categories got different random number of samples inside them: Politics: 50, Sport: 87, Business: 81, Tech: 55, Religion: 169, Entertainment: 58. I also made sure the text lengths varies across the samples - And I then went ahead and put the English and Arabic texts that belong to the same category together -considered Business and Finance the same category- and left the Medical category to be only Arabic texts and the Entertainment category to be only English texts
h
blurbs-clustering-p2p
huggingface.co
Updated Apr 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvan (2023). blurbs-clustering-p2p [Dataset]. https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2023
Authors
Silvan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.
Z
Financial News dataset for text mining
data.niaid.nih.gov
Updated Oct 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
turenne nicolas (2021). Financial News dataset for text mining [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5569112
Explore at:
Dataset updated
Oct 23, 2021
Dataset provided by
INRAE
Authors
turenne nicolas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
please cite this dataset by :

Nicolas Turenne, Ziwei Chen, Guitao Fan, Jianlong Li, Yiwen Li, Siyuan Wang, Jiaqi Zhou (2021) Mining an English-Chinese parallel Corpus of Financial News, BNU HKBU UIC, technical report

The dataset comes from Financial Times news website (https://www.ft.com/)

news are written in both languages Chinese and English.

FTIE.zip contains all documents in a file individually

FT-en-zh.rar contains all documents in one file

Below is a sample document in the dataset defined by these fields and syntax :

id;time;english_title;chinese_title;integer;english_body;chinese_body

1021892;2008-09-10T00:00:00Z;FLAW IN TWIN TOWERS REVEALED;科学家发现纽约双子塔倒塌的根本原因;1;Scientists have discovered the fundamental reason the Twin Towers collapsed on September 11 2001. The steel used in the buildings softened fatally at 500?C – far below its melting point – as a result of a magnetic change in the metal. @ The finding, announced at the BA Festival of Science in Liverpool yesterday, should lead to a new generation of steels capable of retaining strength at much higher temperatures.;科学家发现了纽约世贸双子大厦(Twin Towers)在2001年9月11日倒塌的根本原因。由于磁性变化,大厦使用的钢在500摄氏度——远远低于其熔点——时变软,从而产生致命后果。 @ 这一发现在昨日利物浦举行的BA科学节(BA Festival of Science)上公布。这应会推动能够在更高温度下保持强度的新一代钢铁的问世。

The dataset contains 60,473 bilingual documents.

Time range is from 2007 and 2020.

This dataset has been used for parallel bilingual news mining in Finance domain.
T
ag_news_subset
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
Explore at:
Unique identifier
https://identifiers.org/arxiv:1509.01626
Dataset updated
Dec 6, 2022
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ag_news_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Z
Dataset - Clustering Semantic Predicates in the Open Research Knowledge...
data.niaid.nih.gov
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arab Oghli, Omar (2022). Dataset - Clustering Semantic Predicates in the Open Research Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6513498
Explore at:
Dataset updated
Aug 8, 2022
Authors
Arab Oghli, Omar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.

data.json

The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.

{ "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }

training_set.json and test_set.json

The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.

Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.

{ "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }

Dataset Statistics:

- Papers Predicates Research Fields Research Problems Min/Comparison 2 2 1 0 Max/Comparison 202 112 5 23 Avg./Comparison 21,54 12,79 1,20 1,09 Total 4060 1816 46 178

Dataset Splits:

- Papers Comparisons Training Set 2857 214 Test Set 1203 180
The examples of some retrieved concept relevance records (in part. As some...
plos.figshare.com
figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanchao Liu; Ming Liu; Xin Wang (2023). The examples of some retrieved concept relevance records (in part. As some phrases records have been omitted here). [Dataset]. http://doi.org/10.1371/journal.pone.0117390.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0117390.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yuanchao Liu; Ming Liu; Xin Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The examples of some retrieved concept relevance records (in part. As some phrases records have been omitted here).
Data from: Automatic Definition of Robust Microbiome Sub-states in...
zenodo.org
txt, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson (2020). Data from: Automatic Definition of Robust Microbiome Sub-states in Longitudinal Data [Dataset]. http://doi.org/10.5281/zenodo.167376
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.167376
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.

Prefixes:

David2014_: original microbiome dataset published in [David et al.,2014] (http://genomebiology.com/2014/15/7/R89)

Ballou2016_: original microbiome dataset published in [Ballou et al.,2016] (http://journal.frontiersin.org/article/10.3389/fvets.2016.00002/full)

Gajer2012_: original microbiome dataset published in [Gajer et al.,2012] (http://stm.sciencemag.org/content/4/132/132ra52.long)

LaRosa2014_: original microbiome dataset published in [LaRosa et al.,2014] (http://www.pnas.org/cgi/doi/10.1073/pnas.1409497111)

Suffixes:

_All: all taxa

_Dominant: only 1% most abundant taxa

_NonDominant: remaining taxa after removing above dominant taxa

_GenusAll: taxa aggregated at genus level

_GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa

_GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa

Each folder contains 3 output files related to the same input dataset:
- data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
- definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
- sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID

Abstract of the associated paper:

The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.
Company Documents Dataset
kaggle.com
zip
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
Explore at:
zip(9789538 bytes)Available download formats
Dataset updated
May 23, 2024
Authors
Ayoub Cherguelaine
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

Dataset Content

PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

The document types are:

Invoices: Detailed records of transactions between a buyer and a seller.

Inventory Reports: Records of inventory levels, including items in stock and units sold.

Purchase Orders: Requests made by a buyer to a seller to purchase products or services.

Shipping Orders: Instructions for the delivery of goods to specified recipients.

Example Entries

Here are a few example entries from the CSV file:

Shipping Order:

Order ID: 10718

Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."

Word Count: 120

Invoice:

Order ID: 10707

Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."

Word Count: 66

Purchase Order:

Order ID: 10892

Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."

Word Count: 26

Applications

This dataset can be used for:

Text Classification: Train models to classify documents into their respective categories.

Information Extraction: Extract specific fields and details from the documents.

Document Clustering: Group similar documents together based on their content.

OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
f
Data_Sheet_1_A deep learning-based prediction model of college students’...
figshare.com
frontiersin.figshare.com
docx
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongheng Liu; Yajing Shen; Zhiyong Cai (2023). Data_Sheet_1_A deep learning-based prediction model of college students’ psychological problem categories for post-epidemic era—Taking college students in Jiangsu Province, China as an example.docx [Dataset]. http://doi.org/10.3389/fpsyg.2022.975493.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.975493.s001
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers
Authors
Yongheng Liu; Yajing Shen; Zhiyong Cai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For a long time, it takes a lot of time and energy for psychological workers to classify the psychological problems of college students. In order to quickly and efficiently understand the common psychological problems of college students in the region for real-time analysis in the post-epidemic era, 2,000 college students’ psychological problems were selected as research data in the community question section of the “Su Xin” application, a psychological self-help and mutual aid platform for college students in Jiangsu Province. First, word segmentation, removal of stop words, establishment of word vectors, etc. were used for the preprocessing of research data. Secondly, it was divided into 9 common psychological problems by LDA clustering analysis, which also combined with previous researches. Thirdly, the text information was processed into word vectors and transferred to the Attention-Based Bidirectional Long Short-Term Memory Networks (AB-LSTM). The experimental results showed that the proposed model has a higher test accuracy of 78% compared with other models.
Supplementary material for preprint "Analyzing the Possibilities of Using...
figshare.com
png
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boris Chigarev (2024). Supplementary material for preprint "Analyzing the Possibilities of Using the Scilit Platform to Identify Current Energy Efficiency and Conservation Issues" [Dataset]. http://doi.org/10.6084/m9.figshare.25574058.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25574058.v1
Dataset updated
Apr 10, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Boris Chigarev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary material for preprint "Analyzing the Possibilities of Using the Scilit Platform to Identify Current Energy Efficiency and Conservation Issues" Purpose of publication:- Preparation of bibliometric data exported from the Scilit platform on energy efficiency and conservation for further analysis to identify relevant research topics.- To identify potential issues in the processing of data exported from the Scilit platform.- Providing colleagues with the opportunity to use the prepared data and examples of their analysis for independent research on topical issues of energy efficiency and energy conservation using materials provided by the Scilit platform.I have prepared a preprint and plan to post it on the platform https://www.preprints.org/search?field1=title_keywords&search2=Chigarev&field2=authors&clause=ANDIn this archive there is a file Energy_Efficiency-En.html with active links for convenience to find the full content of the tables used in the text.You can download the entire archive to your computer and use the data for your research using the algorithms and services listed in Energy_Efficiency-En.html.
COVID-19 Open Research Dataset Sentence Clustering
kaggle.com
zip
Updated Apr 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajasankar Viswanathan (2020). COVID-19 Open Research Dataset Sentence Clustering [Dataset]. https://www.kaggle.com/rajasankar/covid19-open-research-dataset-sentence-clustering
Explore at:
zip(74817024 bytes)Available download formats
Dataset updated
Apr 6, 2020
Authors
Rajasankar Viswanathan
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Context

Finding useful information from 30,000 papers is a hard task. Understanding information from all those papers takes time. With advanced AI methods, we can find and extract similar patterns from text data. This method uses advanced AI to find patterns in an unsupervised way. This will be equal to comparing all the sentences with every other sentence in brute-force method.

How this is different from other AI methods

This method goes beyond sentence level co-occurrence pattern finding. As it compares each sentence with other sentences, similar or comparable patterns between the sentences are extracted rather than co-occurrence patterns by other methods.

As it compares the concepts and patterns not the words, hidden but related words or phrases can be found easily. In other words, it goes beyond keyword search to bring all the related sentences in one place. This also reduces the reading requirement.

Content

This dataset creates similar sentences from unsupervised learning methods thus it extracts all the similar sentences which are nearly similar. It has some noise data which may not useful because it is fully unsupervised method.

Data is cleaned, stopwords removed and only English language papers were considered. Final result is 4.5 million sentences. These were processed to find relevant clusters of sentences with desired similarity.

One example is given below.

For full text of the paper, please refer to https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge data.

title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Antibiotics can be placed into the following relative rank order of predicted clinical efficacy for adults: 90% to 92% ch respiratory fluoroquinolones (gatifloxacin, levofloxacin, moxifloxacin), ceftriaxone, high-dose amoxicillin/clavulanate (4 g/250 mg/day), and amoxicillin/clavulanate (1.75 g/250 mg/day); 83% to 88% ch high-dose amoxicillin (4 g/day), amoxicillin (1.5 g/day), cefpodoxime proxetil, cefixime (based on H influenzae and M catarrhalis coverage), cefuroxime axetil, cefdinir, and TMP/SMX; 77% to 81% ch doxycycline, clindamycin (based on gram-positive coverage only), azithromycin, clarithromycin and erythromycin, and telithromycin; 65% to 66% ch cefaclor and loracarbef.

title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Antibiotics can be placed into the following relative rank order of predicted clinical efficacy in children with ABRS: 91% to 92% ch ceftriaxone, high-dose amoxicillin/clavulanate (90 mg/6.4 mg per kg per day) and amoxicillin/clavulanate (45 mg/6.4 mg per kg per day); 82% to 87% ch highdose amoxicillin (90 mg/kg per day), amoxicillin (45 mg/kg per day), cefpodoxime proxetil, cefixime (based on H influenzae and M catarrhalis coverage only), cefuroxime axetil, cefdinir, and TMP/SMX; and 78% to 80% ch clindamycin (based on gram-positive coverage only), cefprozil, azithromycin, clarithromycin, and erythromycin; 67% to 68% ch cefaclor and loracarbef.

title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Recommendations for initial therapy for adult patients with mild disease (who have not received antibiotics in the previous 4 to 6 weeks) include the following choices: amoxicillin/clavulanate (1.75 to 4 g/250 mg per day), amoxicillin (1.5 to 4 g/day), cefpodoxime proxetil, cefuroxime axetil, or cefdinir.

title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Recommendations for initial therapy for children with mild disease and who have not received antibiotics in the previous 4 to 6 weeks include the following: high-dose amoxicillin/clavulanate (90 mg/6.4 mg per kg per day), amoxicillin (90 mg/kg per day), cefpodoxime proxetil, cefuroxime axetil, or cefdinir.

title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : The relative antimicrobial activity against isolates of S pneumoniae based on PK/PD breakpoints, 89 can be listed as: gatifloxacin / levofloxacin / moxifloxacin ([?]99%); ceftriaxone / high-dose amoxicillin (Ti clavulanate [extended-release or extra strength]) (95% to 97%); amoxicillin (Ti clavulanate) / clindamycin (90% to 92%) ; cefpodoxime proxetil /cefuroxime axetil / cefdinir /erythromycin /cla...
h
clusters_inventor
huggingface.co
Updated Oct 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iliass ayaou (2025). clusters_inventor [Dataset]. https://huggingface.co/datasets/datalyes/clusters_inventor
Explore at:
Dataset updated
Oct 28, 2025
Authors
iliass ayaou
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Patent Clustering by Inventor

Dataset Description

This dataset is part of PatenTEB, a comprehensive benchmark for evaluating text embedding models on patent-specific tasks. PatenTEB comprises 15 tasks across retrieval, classification, paraphrase detection, and clustering, with 2.06 million examples designed to reflect real-world patent analysis workflows. Paper: PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Task Details… See the full description on the dataset page: https://huggingface.co/datasets/datalyes/clusters_inventor.
MetaKaggle Forum Data BGE-M3 Embeddings
kaggle.com
zip
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2025). MetaKaggle Forum Data BGE-M3 Embeddings [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-forum-data-embeddings-with-baaibge-m3
Explore at:
zip(14351380269 bytes)Available download formats
Dataset updated
Jun 2, 2025
Authors
BwandoWando
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Context

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fc9b6dac9e9a5010f65259a68811249a2%2F_3e6dc091-0889-487e-a530-23f6d88d08b7-small.jpeg?generation=1748759453448348&alt=media" alt="">

These are BAAI/bge-m3 embeddings of the Meta Kaggle ForumTopics.csv and ForumMessages.csv

Intended purpose

This is a supplemental dataset for the Meta Kaggle Hackathon

How I preprocessed the text data

I removed html elements using BeautifulSoup

I replaced any URL value with a placeholder <url> value

I removed emojis and symbols

I replaced 1 or more carriage returns with just a single white space

BAAI/bge-m3 was set to 2048 tokens context size and normalize_embeddings is set to true

Sample Data

The actual text data that I fed into the embedding model can be seen in this dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fe6a47fb262445e7dfefbb7be71d14565%2FScreenshot%20from%202025-06-01%2021-44-28.png?generation=1748785487135090&alt=media" alt="">

How to use

Download the original csvs from Meta Kaggle dataset so that you can see the original text values and compare it to the preprocessed values.

You can also just download the samples in the ./sample/*.parquet folder to see how the data looks like, before you download the whole dataset (16GB)

These are normalized embeddings that you can use with Cosine Similarity

See Related Datasets

👉 MetaKaggle Forum Data ALL-MINILM-L12-v2 Embeddings (256 context size| 384 dimensions)

👉 MetaKaggle Forum Data BAAI/bge-m3 Embeddings (2048 context size| 1024 dimensions)

👉 MetaKaggle Forum Data BGE BASE-EN v1.5 Embeddings (512 context size| 768 dimensions)

👉 MetaKaggle Forum Data Jina-Small-Eng-V1 Embeddings (512 context size| 512 dimensions)

👉 MetaKaggle Forum Data Qwen2 Embeddings (2048 context size| 1536 dimensions)

👉 MetaKaggle Forum Data Stella Embeddings (2048 context size| 1024 dimensions)

Image

Generated with Bing Image Generator
Z
Top Jet W-Momentum Reconstruction Dataset
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoffman, Timothy (2024). Top Jet W-Momentum Reconstruction Dataset [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8197722
Explore at:
Dataset updated
Mar 5, 2024
Dataset provided by
Hoffman, Timothy
Bogatskiy, Alexander
Offermann, Jan Tuzlić
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

A set of Monte Carlo simulated events, for the evaluation of top quarks' (and their child particles') momentum reconstruction, produced using the HEPData4ML package [1]. Specifically, the entries in this dataset correspond with top quark jets, and the momentum of the jets' constituent particles. This is a newer version of the "Top Quark Momentum Reconstruction Dataset" [2], but with sufficiently large changes to warrant this separate posting.

The dataset is saved in HDF5 format, as sets of arrays with keys (as detailed below). There are ~1.5M events, approximately broken down into the following sets:

Training: 700k events (files with "_train" suffix)

Validation: 200k events (files with "_valid" suffix)

Testing (small): 100k events (files with "_test" suffix)

Testing (large): 500k events (files with "_test_large" suffix)

The two separate types of testing files -- small and large -- are independent from one another, the former for conveniently running quicker testing and the latter for testing with a larger sample.

There are four version of the dataset present, with the versions indicated by the filenames. The different versions correspond with whether or not fast detector simulation was performed (versus truth-level jets), and whether or not the W-boson mass was modified: One version of the dataset uses the nominal value of (m_W = 80.385 \text{ GeV}) as used by Pythia8 [3], whereas another uses a variable mW taking on 101 values evenly-spaced as (m_W \in { 64.308,96.462 } \text{ GeV}). The dataset naming scheme is as follows:

train.h5 : jets clustered from truth-level, nominal mW

train_mW.h5: jets clustered from truth-level, variable mW

train_delphes.h5: jets clustered from Delphes outputs, nominal mW

train_delphes_mW.h5: jets clustered from Delphes outputs, variable mW

Description

13 TeV center-of-mass energy, fully hadronic top quark decays, simulated with Pythia8. ((t \rightarrow W \, b, \; W\rightarrow q \, q'))

Events are generated with leading top quark pT in [550,650] GeV. (set via Pythia8's (\hat{p}_{T,\text{ min}}) and (\hat{p}_{T,\text{ max}}) variables)

No inital- or final-state radiation (ISR/FSR), nor multi-parton interactions (MPI)

Where applicable, detector simulation is done using DELPHES [4], with the ATLAS detector card.

Clustering of particles/objects is done via FastJet [5], using the anti-kT algorithm, with (R=0.8) .

For the truth-level data, inputs to jet clustering are truth-level, final-state particles (i.e. clustering "truth jets").

For the data with detector simulation, the inputs are calorimeter towers from DELPHES.

Tower objects from DELPHES (not E-flow objects, no tracking information)

Each entry in the dataset corresponds with a single top quark jet, extracted from a (t\bar{t}) event.

All jets are matched to a parton-level top quark within (\Delta R < 0.8) . We choose the jet nearest the parton-level top quark.

Jets are required to have (|\eta| < 2), and (p_{T} > 15 \text{ GeV}).

The 200 leading (highest-pT) jet constituent four-momenta are stored in Cartesian coordinates (E,px,py,pz), sorted by decreasing pT, with zero-padding.

The jet four-momentum is stored in Cartesian coordinates (E, px, py, pz), as well as in cylindrical coordinates ((p_T,\eta,\phi,m)).

The truth (parton-level) four-momenta of the top quark, the bottom quark the W-boson, and the quarks to which the W-boson decays, are stored in Cartesian coordinates.

In addition, the momenta of the 120 leading stable daughter particles of the W-boson are stored in Cartesian coordinates.

Description of data fields & metadataBelow is a brief description of the various fields in the dataset. The dataset also contains metadata fields, stored using HDF5's "attributes". This is used for fields that are common across many events, and stores information such as generator-level configurations (in principle, all the information is stored as to be able to recreate the dataset with the HEPData4ML tool).

Note that fields whose keys have the prefix "jh_" correspond with output from the Johns Hopkins top tagger [6], as implemented in FastJet.

Also note that for the keys corresponding with four-momenta in Cartesian coordinates, there are rotated versions of these fields -- the data has been rotated so that the W-boson is at ((\theta=0, \phi=0)), and the b-quark is in the ((\theta=0, \phi < 0)) plane. This rotation is potentially useful for visualizations of the events.

Nobj: The number of constituents in the jet.

Pmu: The four-momenta of the jet constituents, in (E, px, py, pz). Sorted by decreasing pT and zero-padded to a length of 200.

Pmu_rot: Rotated version.

contained_daughter_sum_Pmu: Four-momentum sum of the stable daughter particles of the W-boson that fall within (\Delta R < 0.8) of the jet centroid.

contained_daughter_sum_Pmu_rot: Rotated version.

cross_section: Cross-section for the corresponding process, reported by Pythia8.

cross_section_uncertainty: Cross-section uncertainty for the corresponding process, reported by Pythia8.

energy_ratio smeared: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total smeared energy in this calorimeter tower.

Only relevant for the DELPHES datasets.

energy_ratio_truth: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total true energy of particles contributing to this calorimeter tower.

The above definition is relevant only for the DELPHES datasets. For the truth-level datasets, this field is repurposed to store a value (0 or 1) indicating whether or not the given particle (whose momentum is in the Pmu field) is a W-boson daughter.

event_idx: Redundant -- used for event indexing during the event generation process.

is_signal: Redundant -- indicates whether an event is signal or background, but this is a fully signal dataset. Potentially useful if combining with other datasets produced with HEPData4ML.

jet_Pmu: Four-momentum of the jet, in (E, px, py, pz).

jet_Pmu_rot: Rotated version.

jet_Pmu_cyl: Four-momentum of the jet, in ((pT_,\eta,\phi,m)).

jet_bqq_contained_dR06: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.

jet_bqq_contained_dR08: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.

jet_bqq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},b \right), \; \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).

jet_qq_contained_dR06: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.

jet_qq_contained_dR08: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.

jet_qq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).

jet_top_daughters_contained_dR08: Boolean flag indicating whether the final-state daughters of the top quark are within (\Delta R < 0.8) of the jet centroid. Specifically, the algorithm for this flag checks that the jet contains the stable daughters of both the b quark and the W boson. For the b and W each, daughter particles are allowed to be uncontained as long as (for each particle) the (p_T) of the sum of uncontained daughters is below (2.5 \text{ GeV}).

jh_W_Nobj: Number of constituents in the W-boson candidate identified by the JH tagger.

jh_W_Pmu: Four-momentum of the JH tagger W-boson candidate, in (E, px, py, pz).

jh_W_Pmu_rot: Rotated version.

jh_W_constituent_Pmu: Four-momentum of the constituents of the JH tagger W-boson candidate, in (E, px, py, pz).

jh_W_constituent_Pmu_rot: Rotated version.

jh_m: Mass of the JH W-boson candidate.

jh_m_resolution: Ratio of JH W-boson candidate mass, versus the true W-boson mass.

jh_pt: (p_T) of the JH W-boson candidate.

jh_pt_resolution: Ratio of JH W-boson candidate (p_T), versus the true W-boson mass.

jh_tag: Whether or not a jet was tagged by the JH tagger.

mc_weight: Monte Carlo weight for this event, reported by Pythia8.

process_code: Process code reported by Pythia8.

rotation_matrix: Rotation matrix for rotating the events' 3-momenta as to produce the rotated copies stored in the dataset.

truth_Nobj: Number of truth-level particles (saved in truth_Pmu).

truth_Pdg: PDG codes of the truth-level particles.

truth_Pmu: Truth-level particles: The top quark, bottom quark, W boson, q, q', and 120 leading, stable W-boson daughter particles, in (E, px, py, pz). A few of these are also stored in separate keys:

truth_Pmu_0: Top quark.

truth_Pmu_0_rot: Rotated version.

truth_Pmu_1: Bottom quark.

truth_Pmu_1_rot: Rotated version.

truth_Pmu_2: W-boson.

truth_Pmu_2_rot: Rotated version.

truth_Pmu_3: q from W decay.

truth_Pmu_3_rot: Rotated version.

truth_Pmu_4: q' from W decay.

truth_Pmu_4_rot: Rotated version.

truth_Pmu_0_rot: Rotated version of truth_Pmu.

The following fields correspond with metadata -- they provide the index of the corresponding metadata entry for each event:

command_line_arguments: The command-line arguments passed to HEPData4ML's run.py script.

config_file: The contents of the Python configuration file used for HEPData4ML. This, together with the command-line arguments, defines how the tool was run, what processes, jet clustering and post-processing was done, etc.

git_hash: Git hash for HEPData4ML.

timestamp: Timestamp for when the dataset was created
AG News (News articles)
kaggle.com
zip
Updated Nov 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). AG News (News articles) [Dataset]. https://www.kaggle.com/datasets/thedevastator/new-dataset-for-text-classification-ag-news/code
Explore at:
zip(11831597 bytes)Available download formats
Dataset updated
Nov 20, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
AG News (News articles)

News Articles Text Classification

Source

Huggingface Hub: link

About this dataset

The ag_news dataset provides a new opportunity for text classification research. It is a large dataset consisting of a training set of 10,000 examples and a test set of 5,000 examples. The examples are split evenly into two classes: positive and negative. This makes the dataset well-suited for research into text classification methods

How to use the dataset

If you're looking to do text classification research, the ag_news dataset is a great new dataset to use. It consists of a training set of 10,000 examples and a test set of 5,000 examples, split evenly between positive and negative class labels. The data is well-balanced and should be suitable for many different text classification tasks

Research Ideas

This dataset can be used to train a text classifier to automatically categorize news articles into positive and negative categories.

This dataset can be used to develop a system that can identify positive and negative sentiment in news articles.

This dataset can be used to study the difference in how positive and negative news is reported by different media outlets

Acknowledgements

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine that has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), XML, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:-----------------------------------------| | text | The text of the news article. (string) | | label | The label of the news article. (integer) |

File: test.csv | Column name | Description | |:--------------|:-----------------------------------------| | text | The text of the news article. (string) | | label | The label of the news article. (integer) |
Dataset for: A Bayesian Mixture Model for Clustering and Selection of...
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qiwei Li; Michele Guindani; Brian Reich; Howard Bondell; Marina Vannucci (2023). Dataset for: A Bayesian Mixture Model for Clustering and Selection of Feature Occurrence Rates under Mean Constraints [Dataset]. http://doi.org/10.6084/m9.figshare.5016386.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5016386.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Qiwei Li; Michele Guindani; Brian Reich; Howard Bondell; Marina Vannucci
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature section mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a bag-of-words benchmark data set, where the features are represented by the frequencies of occurrence of each word.
f
Density and centrality value of each cluster.
datasetcatalog.nlm.nih.gov
figshare.com
Updated Mar 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu, Xiaohan; Rogers, Roy Anthony (2023). Density and centrality value of each cluster. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000937837
Explore at:
Dataset updated
Mar 9, 2023
Authors
Xu, Xiaohan; Rogers, Roy Anthony
Description
After the cold war, some countries gradually seek to regional cooperation when they could not handle various transnational challenges alone. Shanghai Cooperation Organization (SCO) is a good example. It brought Central Asian countries together. This paper applies the text-mining method, using co-word analysis, co-occurrence matrix, cluster analysis, and strategic diagram to analyze the selected articles from newspapers quantitatively and visually. In order to investigate the Chinese government’s attitude toward the SCO, this study collected data from the China Core Newspaper Full-text Database, which contains high-impact government newspapers revealing the Chinese government’s perception of the SCO. This study characterizes the changing role of SCO as perceived by the Chinese government from 2001 to 2019. Beijing’s changing expectations in each of the three identified subperiods are described.
Geostatistical Analysis of SARS-CoV-2 Positive Cases in the United States
zenodo.org
data.niaid.nih.gov
Updated Sep 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter K. Rogan; Peter K. Rogan (2020). Geostatistical Analysis of SARS-CoV-2 Positive Cases in the United States [Dataset]. http://doi.org/10.5281/zenodo.4032708
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4032708
Dataset updated
Sep 17, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter K. Rogan; Peter K. Rogan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
Geostatistics analyzes and predicts the values associated with spatial or spatial-temporal phenomena. It incorporates the spatial (and in some cases temporal) coordinates of the data within the analyses. It is a practical means of describing spatial patterns and interpolating values for locations where samples were not taken (and measures the uncertainty of those values, which is critical to informed decision making). This archive contains results of geostatistical analysis of COVID-19 case counts for all available US counties. Test results were obtained with ArcGIS Pro (ESRI). Sources are state health departments, which are scraped and aggregated by the Johns Hopkins Coronavirus Resource Center and then pre-processed by MappingSupport.com.

This update of the Zenodo dataset (version 6) consists of three compressed archives containing geostatistical analyses of SARS-CoV-2 testing data. This dataset utilizes many of the geostatistical techniques used in previous versions of this Zenodo archive, but has been significantly expanded to include analyses of up-to-date U.S. COVID-19 case data (from March 24th to September 8^th, 2020):

Archive #1: “1.Geostat. Space-Time analysis of SARS-CoV-2 in the US (Mar24-Sept6).zip” – results of a geostatistical analysis of COVID-19 cases incorporating spatially-weighted hotspots that are conserved over one-week timespans. Results are reported starting from when U.S. COVID-19 case data first became available (March 24^th, 2020) for 25 consecutive 1-week intervals (March 24th through to September 6th, 2020). Hotspots, where found, are reported in each individual state, rather than the entire continental United States.

Archive #2: "2.Geostat. Spatial analysis of SARS-CoV-2 in the US (Mar24-Sept8).zip" – the results from geostatistical spatial analyses only of corrected COVID-19 case data for the continental United States, spanning the period from March 24^th through September 8th, 2020. The geostatistical techniques utilized in this archive includes ‘Hot Spot’ analysis and ‘Cluster and Outlier’ analysis.

Archive #3: "3.Kriging and Densification of SARS-CoV-2 in LA and MA.zip" – this dataset provides preliminary kriging and densification analysis of COVID-19 case data for certain dates within the U.S. states of Louisiana and Massachusetts.

These archives consist of map files (as both static images and as animations) and data files (including text files which contain the underlying data of said map files [where applicable]) which were generated when performing the following Geostatistical analyses: Hot Spot analysis (Getis-Ord Gi*) [‘Archive #1’: consecutive weeklong Space-Time Hot Spot analysis; ‘Archive #2’: daily Hot Spot Analysis], Cluster and Outlier analysis (Anselin Local Moran's I) [‘Archive #2’], Spatial Autocorrelation (Global Moran's I) [‘Archive #2’], and point-to-point comparisons with Kriging and Densification analysis [‘Archive #3’].

The Word document provided ("Description-of-Archive.Updated-Geostatistical-Analysis-of-SARS-CoV-2 (version 6).docx") details the contents of each file and folder within these three archives and gives general interpretations of these results.

Facebook

Twitter

Click to copy link

Link copied

Cite

Arai Seisenbek (2022). Document Clustering [Dataset]. https://www.kaggle.com/datasets/nenriki/document-clustering

Document Clustering

Text similarity and Agglomerative Document Clustering

Explore at:

zip(1587 bytes)Available download formats

Dataset updated

Mar 7, 2022

Authors

Arai Seisenbek

Description

Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

Read texts from file and splitting them to the words.
Transform texts into vector spaces, calculate distances in these spaces 3. Bag of words and TF/IDF vectorizer.

Task 1.

Please download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.

Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).

Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine

Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.

Task 2.

For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.

Task 3.

Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.

NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.

Clear search

Close search

Google apps

Main menu

Document Clustering

Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

Task 1.

Task 2.

Task 3.

Text Document Classification Dataset

bilingual text clusters -English and Arabic-

blurbs-clustering-p2p

Financial News dataset for text mining

ag_news_subset

Dataset - Clustering Semantic Predicates in the Open Research Knowledge...

The examples of some retrieved concept relevance records (in part. As some...

Data from: Automatic Definition of Robust Microbiome Sub-states in...

Company Documents Dataset

Overview

Dataset Content

Example Entries

Shipping Order:

Invoice:

Purchase Order:

Applications

Data_Sheet_1_A deep learning-based prediction model of college students’...

Supplementary material for preprint "Analyzing the Possibilities of Using...

COVID-19 Open Research Dataset Sentence Clustering

Context

How this is different from other AI methods

Content

clusters_inventor

MetaKaggle Forum Data BGE-M3 Embeddings

Context

Intended purpose

How I preprocessed the text data

Sample Data

How to use

See Related Datasets

Image

Top Jet W-Momentum Reconstruction Dataset

AG News (News articles)

AG News (News articles)

News Articles Text Classification

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Dataset for: A Bayesian Mixture Model for Clustering and Selection of...

Density and centrality value of each cluster.

Geostatistical Analysis of SARS-CoV-2 Positive Cases in the United States

Document Clustering

Text similarity and Agglomerative Document Clustering

Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

Task 1.

Task 2.

Task 3.