61 datasets found
  1. Document Clustering

    • kaggle.com
    zip
    Updated Mar 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arai Seisenbek (2022). Document Clustering [Dataset]. https://www.kaggle.com/datasets/nenriki/document-clustering
    Explore at:
    zip(1587 bytes)Available download formats
    Dataset updated
    Mar 7, 2022
    Authors
    Arai Seisenbek
    Description

    Assignment 1. Text similarity and Agglomerative Document Clustering.

    Learning outcomes:

    1. Read texts from file and splitting them to the words.
    2. Transform texts into vector spaces, calculate distances in these spaces 3. Bag of words and TF/IDF vectorizer.

    Task 1.

    Please download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.

    Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).

    1. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine

    Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.

    Task 2.

    For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.

    Task 3.

    Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.

    NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.

  2. Text Document Classification Dataset

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
    Explore at:
    zip(1941393 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    sunil thite
    Description

    This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

    About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

    Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

    1. Politics = 0
    2. Sport = 1
    3. Technology = 2
    4. Entertainment =3
    5. Business = 4
  3. bilingual text clusters -English and Arabic-

    • kaggle.com
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aml Hassan Esmil (2025). bilingual text clusters -English and Arabic- [Dataset]. https://www.kaggle.com/datasets/amlhassan/bilingual-text-clusters
    Explore at:
    zip(1168763 bytes)Available download formats
    Dataset updated
    Apr 25, 2025
    Authors
    Aml Hassan Esmil
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    I built the dataset by: - Sampling 500 texts with 6 different labels from here https://medium.com/r/?url=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fhaithemhermessi%2Fsanad-dataset%2Fdata%3Fselect%3DCulture, I made sure that all the categories got different random number of samples inside them: Politics: 94, Sports: 110, Finance: 83, Tech: 67, Religion: 66, Medical: 80. I also made sure the text lengths varies across the samples. - And sampling 500 texts with 6 different labels from here https://www.kaggle.com/datasets/micchaeelwijaya/news-topics-classification-dataset, I made sure that all the categories got different random number of samples inside them: Politics: 50, Sport: 87, Business: 81, Tech: 55, Religion: 169, Entertainment: 58. I also made sure the text lengths varies across the samples - And I then went ahead and put the English and Arabic texts that belong to the same category together -considered Business and Finance the same category- and left the Medical category to be only Arabic texts and the Entertainment category to be only English texts

  4. h

    blurbs-clustering-p2p

    • huggingface.co
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silvan (2023). blurbs-clustering-p2p [Dataset]. https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2023
    Authors
    Silvan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.

  5. Z

    Financial News dataset for text mining

    • data.niaid.nih.gov
    Updated Oct 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    turenne nicolas (2021). Financial News dataset for text mining [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5569112
    Explore at:
    Dataset updated
    Oct 23, 2021
    Dataset provided by
    INRAE
    Authors
    turenne nicolas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    please cite this dataset by :

    Nicolas Turenne, Ziwei Chen, Guitao Fan, Jianlong Li, Yiwen Li, Siyuan Wang, Jiaqi Zhou (2021) Mining an English-Chinese parallel Corpus of Financial News, BNU HKBU UIC, technical report

    The dataset comes from Financial Times news website (https://www.ft.com/)

    news are written in both languages Chinese and English.

    FTIE.zip contains all documents in a file individually

    FT-en-zh.rar contains all documents in one file

    Below is a sample document in the dataset defined by these fields and syntax :

    id;time;english_title;chinese_title;integer;english_body;chinese_body

    1021892;2008-09-10T00:00:00Z;FLAW IN TWIN TOWERS REVEALED;科学家发现纽约双子塔倒塌的根本原因;1;Scientists have discovered the fundamental reason the Twin Towers collapsed on September 11 2001. The steel used in the buildings softened fatally at 500?C – far below its melting point – as a result of a magnetic change in the metal. @ The finding, announced at the BA Festival of Science in Liverpool yesterday, should lead to a new generation of steels capable of retaining strength at much higher temperatures.;科学家发现了纽约世贸双子大厦(Twin Towers)在2001年9月11日倒塌的根本原因。由于磁性变化,大厦使用的钢在500摄氏度——远远低于其熔点——时变软,从而产生致命后果。 @ 这一发现在昨日利物浦举行的BA科学节(BA Festival of Science)上公布。这应会推动能够在更高温度下保持强度的新一代钢铁的问世。

    The dataset contains 60,473 bilingual documents.

    Time range is from 2007 and 2020.

    This dataset has been used for parallel bilingual news mining in Finance domain.

  6. T

    ag_news_subset

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ag_news_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  7. Z

    Dataset - Clustering Semantic Predicates in the Open Research Knowledge...

    • data.niaid.nih.gov
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arab Oghli, Omar (2022). Dataset - Clustering Semantic Predicates in the Open Research Knowledge Graph [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6513498
    Explore at:
    Dataset updated
    Aug 8, 2022
    Authors
    Arab Oghli, Omar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

    The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.

    data.json

    The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.

    { "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }

    training_set.json and test_set.json

    The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.

    Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.

    { "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }

    Dataset Statistics:

        -
        Papers
        Predicates
        Research Fields
        Research Problems
    
    
    
    
        Min/Comparison
        2
        2
        1
        0
    
    
        Max/Comparison
        202
        112
        5
        23
    
    
        Avg./Comparison
        21,54
        12,79
        1,20
        1,09
    
    
        Total
        4060
        1816
        46
        178
    

    Dataset Splits:

        -
        Papers
        Comparisons
    
    
    
    
        Training Set
        2857
        214
    
    
        Test Set
        1203
        180
    
  8. The examples of some retrieved concept relevance records (in part. As some...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuanchao Liu; Ming Liu; Xin Wang (2023). The examples of some retrieved concept relevance records (in part. As some phrases records have been omitted here). [Dataset]. http://doi.org/10.1371/journal.pone.0117390.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuanchao Liu; Ming Liu; Xin Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The examples of some retrieved concept relevance records (in part. As some phrases records have been omitted here).

  9. Data from: Automatic Definition of Robust Microbiome Sub-states in...

    • zenodo.org
    txt, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson (2020). Data from: Automatic Definition of Robust Microbiome Sub-states in Longitudinal Data [Dataset]. http://doi.org/10.5281/zenodo.167376
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Beatriz García-Jiménez; Mark D. Wilkinson; Beatriz García-Jiménez; Mark D. Wilkinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.

    Prefixes:

    Suffixes:

    • _All: all taxa

    • _Dominant: only 1% most abundant taxa

    • _NonDominant: remaining taxa after removing above dominant taxa

    • _GenusAll: taxa aggregated at genus level

    • _GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa

    • _GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa

    Each folder contains 3 output files related to the same input dataset:
    - data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
    - definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
    - sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID

    Abstract of the associated paper:

    The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.

  10. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  11. f

    Data_Sheet_1_A deep learning-based prediction model of college students’...

    • figshare.com
    • frontiersin.figshare.com
    docx
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongheng Liu; Yajing Shen; Zhiyong Cai (2023). Data_Sheet_1_A deep learning-based prediction model of college students’ psychological problem categories for post-epidemic era—Taking college students in Jiangsu Province, China as an example.docx [Dataset]. http://doi.org/10.3389/fpsyg.2022.975493.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Frontiers
    Authors
    Yongheng Liu; Yajing Shen; Zhiyong Cai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For a long time, it takes a lot of time and energy for psychological workers to classify the psychological problems of college students. In order to quickly and efficiently understand the common psychological problems of college students in the region for real-time analysis in the post-epidemic era, 2,000 college students’ psychological problems were selected as research data in the community question section of the “Su Xin” application, a psychological self-help and mutual aid platform for college students in Jiangsu Province. First, word segmentation, removal of stop words, establishment of word vectors, etc. were used for the preprocessing of research data. Secondly, it was divided into 9 common psychological problems by LDA clustering analysis, which also combined with previous researches. Thirdly, the text information was processed into word vectors and transferred to the Attention-Based Bidirectional Long Short-Term Memory Networks (AB-LSTM). The experimental results showed that the proposed model has a higher test accuracy of 78% compared with other models.

  12. Supplementary material for preprint "Analyzing the Possibilities of Using...

    • figshare.com
    png
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Boris Chigarev (2024). Supplementary material for preprint "Analyzing the Possibilities of Using the Scilit Platform to Identify Current Energy Efficiency and Conservation Issues" [Dataset]. http://doi.org/10.6084/m9.figshare.25574058.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Apr 10, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Boris Chigarev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material for preprint "Analyzing the Possibilities of Using the Scilit Platform to Identify Current Energy Efficiency and Conservation Issues" Purpose of publication:- Preparation of bibliometric data exported from the Scilit platform on energy efficiency and conservation for further analysis to identify relevant research topics.- To identify potential issues in the processing of data exported from the Scilit platform.- Providing colleagues with the opportunity to use the prepared data and examples of their analysis for independent research on topical issues of energy efficiency and energy conservation using materials provided by the Scilit platform.I have prepared a preprint and plan to post it on the platform https://www.preprints.org/search?field1=title_keywords&search2=Chigarev&field2=authors&clause=ANDIn this archive there is a file Energy_Efficiency-En.html with active links for convenience to find the full content of the tables used in the text.You can download the entire archive to your computer and use the data for your research using the algorithms and services listed in Energy_Efficiency-En.html.

  13. COVID-19 Open Research Dataset Sentence Clustering

    • kaggle.com
    zip
    Updated Apr 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajasankar Viswanathan (2020). COVID-19 Open Research Dataset Sentence Clustering [Dataset]. https://www.kaggle.com/rajasankar/covid19-open-research-dataset-sentence-clustering
    Explore at:
    zip(74817024 bytes)Available download formats
    Dataset updated
    Apr 6, 2020
    Authors
    Rajasankar Viswanathan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    Finding useful information from 30,000 papers is a hard task. Understanding information from all those papers takes time. With advanced AI methods, we can find and extract similar patterns from text data. This method uses advanced AI to find patterns in an unsupervised way. This will be equal to comparing all the sentences with every other sentence in brute-force method.

    How this is different from other AI methods

    This method goes beyond sentence level co-occurrence pattern finding. As it compares each sentence with other sentences, similar or comparable patterns between the sentences are extracted rather than co-occurrence patterns by other methods.

    As it compares the concepts and patterns not the words, hidden but related words or phrases can be found easily. In other words, it goes beyond keyword search to bring all the related sentences in one place. This also reduces the reading requirement.

    Content

    This dataset creates similar sentences from unsupervised learning methods thus it extracts all the similar sentences which are nearly similar. It has some noise data which may not useful because it is fully unsupervised method.

    Data is cleaned, stopwords removed and only English language papers were considered. Final result is 4.5 million sentences. These were processed to find relevant clusters of sentences with desired similarity.

    One example is given below.

    For full text of the paper, please refer to https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge data.

    title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Antibiotics can be placed into the following relative rank order of predicted clinical efficacy for adults: 90% to 92% ch respiratory fluoroquinolones (gatifloxacin, levofloxacin, moxifloxacin), ceftriaxone, high-dose amoxicillin/clavulanate (4 g/250 mg/day), and amoxicillin/clavulanate (1.75 g/250 mg/day); 83% to 88% ch high-dose amoxicillin (4 g/day), amoxicillin (1.5 g/day), cefpodoxime proxetil, cefixime (based on H influenzae and M catarrhalis coverage), cefuroxime axetil, cefdinir, and TMP/SMX; 77% to 81% ch doxycycline, clindamycin (based on gram-positive coverage only), azithromycin, clarithromycin and erythromycin, and telithromycin; 65% to 66% ch cefaclor and loracarbef.

    title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Antibiotics can be placed into the following relative rank order of predicted clinical efficacy in children with ABRS: 91% to 92% ch ceftriaxone, high-dose amoxicillin/clavulanate (90 mg/6.4 mg per kg per day) and amoxicillin/clavulanate (45 mg/6.4 mg per kg per day); 82% to 87% ch highdose amoxicillin (90 mg/kg per day), amoxicillin (45 mg/kg per day), cefpodoxime proxetil, cefixime (based on H influenzae and M catarrhalis coverage only), cefuroxime axetil, cefdinir, and TMP/SMX; and 78% to 80% ch clindamycin (based on gram-positive coverage only), cefprozil, azithromycin, clarithromycin, and erythromycin; 67% to 68% ch cefaclor and loracarbef.

    title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Recommendations for initial therapy for adult patients with mild disease (who have not received antibiotics in the previous 4 to 6 weeks) include the following choices: amoxicillin/clavulanate (1.75 to 4 g/250 mg per day), amoxicillin (1.5 to 4 g/day), cefpodoxime proxetil, cefuroxime axetil, or cefdinir.

    title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Recommendations for initial therapy for children with mild disease and who have not received antibiotics in the previous 4 to 6 weeks include the following: high-dose amoxicillin/clavulanate (90 mg/6.4 mg per kg per day), amoxicillin (90 mg/kg per day), cefpodoxime proxetil, cefuroxime axetil, or cefdinir.

    title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : The relative antimicrobial activity against isolates of S pneumoniae based on PK/PD breakpoints, 89 can be listed as: gatifloxacin / levofloxacin / moxifloxacin ([?]99%); ceftriaxone / high-dose amoxicillin (Ti clavulanate [extended-release or extra strength]) (95% to 97%); amoxicillin (Ti clavulanate) / clindamycin (90% to 92%) ; cefpodoxime proxetil /cefuroxime axetil / cefdinir /erythromycin /cla...

  14. h

    clusters_inventor

    • huggingface.co
    Updated Oct 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iliass ayaou (2025). clusters_inventor [Dataset]. https://huggingface.co/datasets/datalyes/clusters_inventor
    Explore at:
    Dataset updated
    Oct 28, 2025
    Authors
    iliass ayaou
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Patent Clustering by Inventor

      Dataset Description
    

    This dataset is part of PatenTEB, a comprehensive benchmark for evaluating text embedding models on patent-specific tasks. PatenTEB comprises 15 tasks across retrieval, classification, paraphrase detection, and clustering, with 2.06 million examples designed to reflect real-world patent analysis workflows. Paper: PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

      Task Details… See the full description on the dataset page: https://huggingface.co/datasets/datalyes/clusters_inventor.
    
  15. MetaKaggle Forum Data BGE-M3 Embeddings

    • kaggle.com
    zip
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2025). MetaKaggle Forum Data BGE-M3 Embeddings [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-forum-data-embeddings-with-baaibge-m3
    Explore at:
    zip(14351380269 bytes)Available download formats
    Dataset updated
    Jun 2, 2025
    Authors
    BwandoWando
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Context

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fc9b6dac9e9a5010f65259a68811249a2%2F_3e6dc091-0889-487e-a530-23f6d88d08b7-small.jpeg?generation=1748759453448348&alt=media" alt="">

    These are BAAI/bge-m3 embeddings of the Meta Kaggle ForumTopics.csv and ForumMessages.csv

    Intended purpose

    This is a supplemental dataset for the Meta Kaggle Hackathon

    How I preprocessed the text data

    1. I removed html elements using BeautifulSoup
    2. I replaced any URL value with a placeholder <url> value
    3. I removed emojis and symbols
    4. I replaced 1 or more carriage returns with just a single white space
    5. BAAI/bge-m3 was set to 2048 tokens context size and normalize_embeddings is set to true

    Sample Data

    The actual text data that I fed into the embedding model can be seen in this dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fe6a47fb262445e7dfefbb7be71d14565%2FScreenshot%20from%202025-06-01%2021-44-28.png?generation=1748785487135090&alt=media" alt="">

    How to use

    • Download the original csvs from Meta Kaggle dataset so that you can see the original text values and compare it to the preprocessed values.
    • You can also just download the samples in the ./sample/*.parquet folder to see how the data looks like, before you download the whole dataset (16GB)
    • These are normalized embeddings that you can use with Cosine Similarity

    See Related Datasets

    Image

    Generated with Bing Image Generator

  16. Z

    Top Jet W-Momentum Reconstruction Dataset

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +1more
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoffman, Timothy (2024). Top Jet W-Momentum Reconstruction Dataset [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8197722
    Explore at:
    Dataset updated
    Mar 5, 2024
    Dataset provided by
    Hoffman, Timothy
    Bogatskiy, Alexander
    Offermann, Jan Tuzlić
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    A set of Monte Carlo simulated events, for the evaluation of top quarks' (and their child particles') momentum reconstruction, produced using the HEPData4ML package [1]. Specifically, the entries in this dataset correspond with top quark jets, and the momentum of the jets' constituent particles. This is a newer version of the "Top Quark Momentum Reconstruction Dataset" [2], but with sufficiently large changes to warrant this separate posting.

    The dataset is saved in HDF5 format, as sets of arrays with keys (as detailed below). There are ~1.5M events, approximately broken down into the following sets:

    Training: 700k events (files with "_train" suffix)

    Validation: 200k events (files with "_valid" suffix)

    Testing (small): 100k events (files with "_test" suffix)

    Testing (large): 500k events (files with "_test_large" suffix)

    The two separate types of testing files -- small and large -- are independent from one another, the former for conveniently running quicker testing and the latter for testing with a larger sample.

    There are four version of the dataset present, with the versions indicated by the filenames. The different versions correspond with whether or not fast detector simulation was performed (versus truth-level jets), and whether or not the W-boson mass was modified: One version of the dataset uses the nominal value of (m_W = 80.385 \text{ GeV}) as used by Pythia8 [3], whereas another uses a variable mW taking on 101 values evenly-spaced as (m_W \in { 64.308,96.462 } \text{ GeV}). The dataset naming scheme is as follows:

    train.h5 : jets clustered from truth-level, nominal mW

    train_mW.h5: jets clustered from truth-level, variable mW

    train_delphes.h5: jets clustered from Delphes outputs, nominal mW

    train_delphes_mW.h5: jets clustered from Delphes outputs, variable mW

    Description

    13 TeV center-of-mass energy, fully hadronic top quark decays, simulated with Pythia8. ((t \rightarrow W \, b, \; W\rightarrow q \, q'))

    Events are generated with leading top quark pT in [550,650] GeV. (set via Pythia8's (\hat{p}_{T,\text{ min}}) and (\hat{p}_{T,\text{ max}}) variables)

    No inital- or final-state radiation (ISR/FSR), nor multi-parton interactions (MPI)

    Where applicable, detector simulation is done using DELPHES [4], with the ATLAS detector card.

    Clustering of particles/objects is done via FastJet [5], using the anti-kT algorithm, with (R=0.8) .

    For the truth-level data, inputs to jet clustering are truth-level, final-state particles (i.e. clustering "truth jets").

    For the data with detector simulation, the inputs are calorimeter towers from DELPHES.

    Tower objects from DELPHES (not E-flow objects, no tracking information)

    Each entry in the dataset corresponds with a single top quark jet, extracted from a (t\bar{t}) event.

    All jets are matched to a parton-level top quark within (\Delta R < 0.8) . We choose the jet nearest the parton-level top quark.

    Jets are required to have (|\eta| < 2), and (p_{T} > 15 \text{ GeV}).

    The 200 leading (highest-pT) jet constituent four-momenta are stored in Cartesian coordinates (E,px,py,pz), sorted by decreasing pT, with zero-padding.

    The jet four-momentum is stored in Cartesian coordinates (E, px, py, pz), as well as in cylindrical coordinates ((p_T,\eta,\phi,m)).

    The truth (parton-level) four-momenta of the top quark, the bottom quark the W-boson, and the quarks to which the W-boson decays, are stored in Cartesian coordinates.

    In addition, the momenta of the 120 leading stable daughter particles of the W-boson are stored in Cartesian coordinates.

    Description of data fields & metadataBelow is a brief description of the various fields in the dataset. The dataset also contains metadata fields, stored using HDF5's "attributes". This is used for fields that are common across many events, and stores information such as generator-level configurations (in principle, all the information is stored as to be able to recreate the dataset with the HEPData4ML tool).

    Note that fields whose keys have the prefix "jh_" correspond with output from the Johns Hopkins top tagger [6], as implemented in FastJet.

    Also note that for the keys corresponding with four-momenta in Cartesian coordinates, there are rotated versions of these fields -- the data has been rotated so that the W-boson is at ((\theta=0, \phi=0)), and the b-quark is in the ((\theta=0, \phi < 0)) plane. This rotation is potentially useful for visualizations of the events.

    Nobj: The number of constituents in the jet.

    Pmu: The four-momenta of the jet constituents, in (E, px, py, pz). Sorted by decreasing pT and zero-padded to a length of 200.

    Pmu_rot: Rotated version.

    contained_daughter_sum_Pmu: Four-momentum sum of the stable daughter particles of the W-boson that fall within (\Delta R < 0.8) of the jet centroid.

    contained_daughter_sum_Pmu_rot: Rotated version.

    cross_section: Cross-section for the corresponding process, reported by Pythia8.

    cross_section_uncertainty: Cross-section uncertainty for the corresponding process, reported by Pythia8.

    energy_ratio smeared: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total smeared energy in this calorimeter tower.

    Only relevant for the DELPHES datasets.

    energy_ratio_truth: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total true energy of particles contributing to this calorimeter tower.

    The above definition is relevant only for the DELPHES datasets. For the truth-level datasets, this field is repurposed to store a value (0 or 1) indicating whether or not the given particle (whose momentum is in the Pmu field) is a W-boson daughter.

    event_idx: Redundant -- used for event indexing during the event generation process.

    is_signal: Redundant -- indicates whether an event is signal or background, but this is a fully signal dataset. Potentially useful if combining with other datasets produced with HEPData4ML.

    jet_Pmu: Four-momentum of the jet, in (E, px, py, pz).

    jet_Pmu_rot: Rotated version.

    jet_Pmu_cyl: Four-momentum of the jet, in ((pT_,\eta,\phi,m)).

    jet_bqq_contained_dR06: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.

    jet_bqq_contained_dR08: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.

    jet_bqq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},b \right), \; \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).

    jet_qq_contained_dR06: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.

    jet_qq_contained_dR08: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.

    jet_qq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).

    jet_top_daughters_contained_dR08: Boolean flag indicating whether the final-state daughters of the top quark are within (\Delta R < 0.8) of the jet centroid. Specifically, the algorithm for this flag checks that the jet contains the stable daughters of both the b quark and the W boson. For the b and W each, daughter particles are allowed to be uncontained as long as (for each particle) the (p_T) of the sum of uncontained daughters is below (2.5 \text{ GeV}).

    jh_W_Nobj: Number of constituents in the W-boson candidate identified by the JH tagger.

    jh_W_Pmu: Four-momentum of the JH tagger W-boson candidate, in (E, px, py, pz).

    jh_W_Pmu_rot: Rotated version.

    jh_W_constituent_Pmu: Four-momentum of the constituents of the JH tagger W-boson candidate, in (E, px, py, pz).

    jh_W_constituent_Pmu_rot: Rotated version.

    jh_m: Mass of the JH W-boson candidate.

    jh_m_resolution: Ratio of JH W-boson candidate mass, versus the true W-boson mass.

    jh_pt: (p_T) of the JH W-boson candidate.

    jh_pt_resolution: Ratio of JH W-boson candidate (p_T), versus the true W-boson mass.

    jh_tag: Whether or not a jet was tagged by the JH tagger.

    mc_weight: Monte Carlo weight for this event, reported by Pythia8.

    process_code: Process code reported by Pythia8.

    rotation_matrix: Rotation matrix for rotating the events' 3-momenta as to produce the rotated copies stored in the dataset.

    truth_Nobj: Number of truth-level particles (saved in truth_Pmu).

    truth_Pdg: PDG codes of the truth-level particles.

    truth_Pmu: Truth-level particles: The top quark, bottom quark, W boson, q, q', and 120 leading, stable W-boson daughter particles, in (E, px, py, pz). A few of these are also stored in separate keys:

    truth_Pmu_0: Top quark.

    truth_Pmu_0_rot: Rotated version.

    truth_Pmu_1: Bottom quark.

    truth_Pmu_1_rot: Rotated version.

    truth_Pmu_2: W-boson.

    truth_Pmu_2_rot: Rotated version.

    truth_Pmu_3: q from W decay.

    truth_Pmu_3_rot: Rotated version.

    truth_Pmu_4: q' from W decay.

    truth_Pmu_4_rot: Rotated version.

    truth_Pmu_0_rot: Rotated version of truth_Pmu.

    The following fields correspond with metadata -- they provide the index of the corresponding metadata entry for each event:

    command_line_arguments: The command-line arguments passed to HEPData4ML's run.py script.

    config_file: The contents of the Python configuration file used for HEPData4ML. This, together with the command-line arguments, defines how the tool was run, what processes, jet clustering and post-processing was done, etc.

    git_hash: Git hash for HEPData4ML.

    timestamp: Timestamp for when the dataset was created

  17. AG News (News articles)

    • kaggle.com
    zip
    Updated Nov 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). AG News (News articles) [Dataset]. https://www.kaggle.com/datasets/thedevastator/new-dataset-for-text-classification-ag-news/code
    Explore at:
    zip(11831597 bytes)Available download formats
    Dataset updated
    Nov 20, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    AG News (News articles)

    News Articles Text Classification

    Source

    Huggingface Hub: link

    About this dataset

    The ag_news dataset provides a new opportunity for text classification research. It is a large dataset consisting of a training set of 10,000 examples and a test set of 5,000 examples. The examples are split evenly into two classes: positive and negative. This makes the dataset well-suited for research into text classification methods

    How to use the dataset

    If you're looking to do text classification research, the ag_news dataset is a great new dataset to use. It consists of a training set of 10,000 examples and a test set of 5,000 examples, split evenly between positive and negative class labels. The data is well-balanced and should be suitable for many different text classification tasks

    Research Ideas

    • This dataset can be used to train a text classifier to automatically categorize news articles into positive and negative categories.
    • This dataset can be used to develop a system that can identify positive and negative sentiment in news articles.
    • This dataset can be used to study the difference in how positive and negative news is reported by different media outlets

    Acknowledgements

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine that has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), XML, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:-----------------------------------------| | text | The text of the news article. (string) | | label | The label of the news article. (integer) |

    File: test.csv | Column name | Description | |:--------------|:-----------------------------------------| | text | The text of the news article. (string) | | label | The label of the news article. (integer) |

  18. Dataset for: A Bayesian Mixture Model for Clustering and Selection of...

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiwei Li; Michele Guindani; Brian Reich; Howard Bondell; Marina Vannucci (2023). Dataset for: A Bayesian Mixture Model for Clustering and Selection of Feature Occurrence Rates under Mean Constraints [Dataset]. http://doi.org/10.6084/m9.figshare.5016386.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Qiwei Li; Michele Guindani; Brian Reich; Howard Bondell; Marina Vannucci
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature section mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a bag-of-words benchmark data set, where the features are represented by the frequencies of occurrence of each word.

  19. f

    Density and centrality value of each cluster.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Mar 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu, Xiaohan; Rogers, Roy Anthony (2023). Density and centrality value of each cluster. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000937837
    Explore at:
    Dataset updated
    Mar 9, 2023
    Authors
    Xu, Xiaohan; Rogers, Roy Anthony
    Description

    After the cold war, some countries gradually seek to regional cooperation when they could not handle various transnational challenges alone. Shanghai Cooperation Organization (SCO) is a good example. It brought Central Asian countries together. This paper applies the text-mining method, using co-word analysis, co-occurrence matrix, cluster analysis, and strategic diagram to analyze the selected articles from newspapers quantitatively and visually. In order to investigate the Chinese government’s attitude toward the SCO, this study collected data from the China Core Newspaper Full-text Database, which contains high-impact government newspapers revealing the Chinese government’s perception of the SCO. This study characterizes the changing role of SCO as perceived by the Chinese government from 2001 to 2019. Beijing’s changing expectations in each of the three identified subperiods are described.

  20. Geostatistical Analysis of SARS-CoV-2 Positive Cases in the United States

    • zenodo.org
    • data.niaid.nih.gov
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter K. Rogan; Peter K. Rogan (2020). Geostatistical Analysis of SARS-CoV-2 Positive Cases in the United States [Dataset]. http://doi.org/10.5281/zenodo.4032708
    Explore at:
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter K. Rogan; Peter K. Rogan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Geostatistics analyzes and predicts the values associated with spatial or spatial-temporal phenomena. It incorporates the spatial (and in some cases temporal) coordinates of the data within the analyses. It is a practical means of describing spatial patterns and interpolating values for locations where samples were not taken (and measures the uncertainty of those values, which is critical to informed decision making). This archive contains results of geostatistical analysis of COVID-19 case counts for all available US counties. Test results were obtained with ArcGIS Pro (ESRI). Sources are state health departments, which are scraped and aggregated by the Johns Hopkins Coronavirus Resource Center and then pre-processed by MappingSupport.com.

    This update of the Zenodo dataset (version 6) consists of three compressed archives containing geostatistical analyses of SARS-CoV-2 testing data. This dataset utilizes many of the geostatistical techniques used in previous versions of this Zenodo archive, but has been significantly expanded to include analyses of up-to-date U.S. COVID-19 case data (from March 24th to September 8th, 2020):

    Archive #1: “1.Geostat. Space-Time analysis of SARS-CoV-2 in the US (Mar24-Sept6).zip” – results of a geostatistical analysis of COVID-19 cases incorporating spatially-weighted hotspots that are conserved over one-week timespans. Results are reported starting from when U.S. COVID-19 case data first became available (March 24th, 2020) for 25 consecutive 1-week intervals (March 24th through to September 6th, 2020). Hotspots, where found, are reported in each individual state, rather than the entire continental United States.

    Archive #2: "2.Geostat. Spatial analysis of SARS-CoV-2 in the US (Mar24-Sept8).zip" – the results from geostatistical spatial analyses only of corrected COVID-19 case data for the continental United States, spanning the period from March 24th through September 8th, 2020. The geostatistical techniques utilized in this archive includes ‘Hot Spot’ analysis and ‘Cluster and Outlier’ analysis.

    Archive #3: "3.Kriging and Densification of SARS-CoV-2 in LA and MA.zip" – this dataset provides preliminary kriging and densification analysis of COVID-19 case data for certain dates within the U.S. states of Louisiana and Massachusetts.

    These archives consist of map files (as both static images and as animations) and data files (including text files which contain the underlying data of said map files [where applicable]) which were generated when performing the following Geostatistical analyses: Hot Spot analysis (Getis-Ord Gi*) [‘Archive #1’: consecutive weeklong Space-Time Hot Spot analysis; ‘Archive #2’: daily Hot Spot Analysis], Cluster and Outlier analysis (Anselin Local Moran's I) [‘Archive #2’], Spatial Autocorrelation (Global Moran's I) [‘Archive #2’], and point-to-point comparisons with Kriging and Densification analysis [‘Archive #3’].

    The Word document provided ("Description-of-Archive.Updated-Geostatistical-Analysis-of-SARS-CoV-2 (version 6).docx") details the contents of each file and folder within these three archives and gives general interpretations of these results.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Arai Seisenbek (2022). Document Clustering [Dataset]. https://www.kaggle.com/datasets/nenriki/document-clustering
Organization logo

Document Clustering

Text similarity and Agglomerative Document Clustering

Explore at:
zip(1587 bytes)Available download formats
Dataset updated
Mar 7, 2022
Authors
Arai Seisenbek
Description

Assignment 1. Text similarity and Agglomerative Document Clustering.

Learning outcomes:

  1. Read texts from file and splitting them to the words.
  2. Transform texts into vector spaces, calculate distances in these spaces 3. Bag of words and TF/IDF vectorizer.

Task 1.

Please download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.

Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).

  1. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine

Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.

Task 2.

For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.

Task 3.

Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.

NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.

Search
Clear search
Close search
Google apps
Main menu