Facebook
TwitterPlease download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.
Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).
Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.
For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.
Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.
NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.
Facebook
TwitterThis is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.
About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2
Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I built the dataset by: - Sampling 500 texts with 6 different labels from here https://medium.com/r/?url=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fhaithemhermessi%2Fsanad-dataset%2Fdata%3Fselect%3DCulture, I made sure that all the categories got different random number of samples inside them: Politics: 94, Sports: 110, Finance: 83, Tech: 67, Religion: 66, Medical: 80. I also made sure the text lengths varies across the samples. - And sampling 500 texts with 6 different labels from here https://www.kaggle.com/datasets/micchaeelwijaya/news-topics-classification-dataset, I made sure that all the categories got different random number of samples inside them: Politics: 50, Sport: 87, Business: 81, Tech: 55, Religion: 169, Entertainment: 58. I also made sure the text lengths varies across the samples - And I then went ahead and put the English and Arabic texts that belong to the same category together -considered Business and Finance the same category- and left the Medical category to be only Arabic texts and the Entertainment category to be only English texts
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset can be used as a benchmark for clustering word embeddings for German. The datasets contains book titles and is based on the dataset from the GermEval 2019 Shared Task on Hierarchical Classification of Blurbs. It contains 18'084 unqiue samples, 28 splits with 177 to 16'425 samples and 4 to 93 unique classes. Splits are built similarly to MTEB's ArxivClusteringP2P. Have a look at German Text Embedding Clustering Benchmark (Github, Paper) for more infos, datasets and evaluation… See the full description on the dataset page: https://huggingface.co/datasets/slvnwhrl/blurbs-clustering-p2p.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
please cite this dataset by :
Nicolas Turenne, Ziwei Chen, Guitao Fan, Jianlong Li, Yiwen Li, Siyuan Wang, Jiaqi Zhou (2021) Mining an English-Chinese parallel Corpus of Financial News, BNU HKBU UIC, technical report
The dataset comes from Financial Times news website (https://www.ft.com/)
news are written in both languages Chinese and English.
FTIE.zip contains all documents in a file individually
FT-en-zh.rar contains all documents in one file
Below is a sample document in the dataset defined by these fields and syntax :
id;time;english_title;chinese_title;integer;english_body;chinese_body
1021892;2008-09-10T00:00:00Z;FLAW IN TWIN TOWERS REVEALED;科学家发现纽约双子塔倒塌的根本原因;1;Scientists have discovered the fundamental reason the Twin Towers collapsed on September 11 2001. The steel used in the buildings softened fatally at 500?C – far below its melting point – as a result of a magnetic change in the metal. @ The finding, announced at the BA Festival of Science in Liverpool yesterday, should lead to a new generation of steels capable of retaining strength at much higher temperatures.;科学家发现了纽约世贸双子大厦(Twin Towers)在2001年9月11日倒塌的根本原因。由于磁性变化,大厦使用的钢在500摄氏度——远远低于其熔点——时变软,从而产生致命后果。 @ 这一发现在昨日利物浦举行的BA科学节(BA Festival of Science)上公布。这应会推动能够在更高温度下保持强度的新一代钢铁的问世。
The dataset contains 60,473 bilingual documents.
Time range is from 2007 and 2020.
This dataset has been used for parallel bilingual news mining in Finance domain.
Facebook
TwitterAG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ag_news_subset', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been created for implementing a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.
The paper instances in the dataset are grouped by ORKG comparisons and therefore the data.json file is more comprehensive than training_set.json and test_set.json.
data.json
The main JSON object consists of a list of comparisons. Each comparisons object has an ID, label, list of papers and list of predicates, whereas each paper object has ID, label, DOI, research field, research problems and abstract. Each predicate object has an ID and a label. See an example instance below.
{ "comparisons": [ { "id": "R108331", "label": "Analysis of approaches based on required elements in way of modeling", "papers": [ { "id": "R108312", "label": "Rapid knowledge work visualization for organizations", "doi": "10.1108/13673270710762747", "research_field": { "id": "R134", "label": "Computer and Systems Architecture" }, "research_problems": [ { "id": "R108294", "label": "Enterprise engineering" } ], "abstract": "Purpose \u2013 The purpose of this contribution is to motivate a new, rapid approach to modeling knowledge work in organizational settings and to introduce a software tool that demonstrates the viability of the envisioned concept.Design/methodology/approach \u2013 Based on existing modeling structures, the KnowFlow toolset that aids knowledge analysts in rapidly conducting interviews and in conducting multi\u2010perspective analysis of organizational knowledge work is introduced.Findings \u2013 This article demonstrates how rapid knowledge work visualization can be conducted largely without human modelers by developing an interview structure that allows for self\u2010service interviews. Two application scenarios illustrate the pressing need for and the potentials of rapid knowledge work visualizations in organizational settings.Research limitations/implications \u2013 The efforts necessary for traditional modeling approaches in the area of knowledge management are often prohibitive. This contribution argues that future research needs ..." }, .... ], "predicates": [ { "id": "P37126", "label": "activities, behaviours, means [for knowledge development and/or for knowledge conveyance and transformation" }, { "id": "P36081", "label": "approach name" }, .... ] }, .... ] }
training_set.json and test_set.json
The main JSON object consists of a list of training/test instances. Each instance has an instance_id with the format (comparison_id X paper_id) and a text. The text is a concatenation of the paper's label (title) and abstract. See an example instance below.
Note that test instances are not duplicated and do not occur in the training set. Training instances are also not duplicated, BUT training papers can be duplicated in a concatenation with different comparisons.
{ "instances": [ { "instance_id": "R108331xR108301", "comparison_id": "R108331", "paper_id": "R108301", "text": "A notation for Knowledge-Intensive Processes Business process modeling has become essential for managing organizational knowledge artifacts. However, this is not an easy task, especially when it comes to the so-called Knowledge-Intensive Processes (KIPs). A KIP comprises activities based on acquisition, sharing, storage, and (re)use of knowledge, as well as collaboration among participants, so that the amount of value added to the organization depends on process agents' knowledge. The previously developed Knowledge Intensive Process Ontology (KIPO) structures all the concepts (and relationships among them) to make a KIP explicit. Nevertheless, KIPO does not include a graphical notation, which is crucial for KIP stakeholders to reach a common understanding about it. This paper proposes the Knowledge Intensive Process Notation (KIPN), a notation for building knowledge-intensive processes graphical models." }, ... ] }
Dataset Statistics:
-
Papers
Predicates
Research Fields
Research Problems
Min/Comparison
2
2
1
0
Max/Comparison
202
112
5
23
Avg./Comparison
21,54
12,79
1,20
1,09
Total
4060
1816
46
178
Dataset Splits:
-
Papers
Comparisons
Training Set
2857
214
Test Set
1203
180
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The examples of some retrieved concept relevance records (in part. As some phrases records have been omitted here).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Output files of the application of our R software (available at https://github.com/wilkinsonlab/robust-clustering-metagenomics) to different microbiome datasets already published.
Prefixes:
Suffixes:
_All: all taxa
_Dominant: only 1% most abundant taxa
_NonDominant: remaining taxa after removing above dominant taxa
_GenusAll: taxa aggregated at genus level
_GenusDominant: taxa aggregated at genes level and then to select only 1% most abundant taxa
_GenusNonDominant: taxa aggregated at genus level and then to remove 1% most abundant taxa
Each folder contains 3 output files related to the same input dataset:
- data.normAndDist_definitiveClustering_XXX.RData: R data file with a) a phyloseq object (including OTU table, meta-data and cluster assigned to each sample); and b) a distance matrix object.
- definitiveClusteringResults_XXX.txt: text file with assessment measures of the selected clustering.
- sampleId-cluster_pairs_XXX.txt: text file. Two columns, comma separated file: sampleID,clusterID
Abstract of the associated paper:
The analysis of microbiome dynamics would allow us to elucidate patterns within microbial community evolution; however, microbiome state-transition dynamics have been scarcely studied. This is in part because a necessary first-step in such analyses has not been well-defined: how to deterministically describe a microbiome's "state". Clustering in states have been widely studied, although no standard has been concluded yet. We propose a generic, domain-independent and automatic procedure to determine a reliable set of microbiome sub-states within a specific dataset, and with respect to the conditions of the study. The robustness of sub-state identification is established by the combination of diverse techniques for stable cluster verification. We reuse four distinct longitudinal microbiome datasets to demonstrate the broad applicability of our method, analysing results with different taxa subset allowing to adjust it depending on the application goal, and showing that the methodology provides a set of robust sub-states to examine in downstream studies about dynamics in microbiome.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.
PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.
The document types are:
Here are a few example entries from the CSV file:
This dataset can be used for:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For a long time, it takes a lot of time and energy for psychological workers to classify the psychological problems of college students. In order to quickly and efficiently understand the common psychological problems of college students in the region for real-time analysis in the post-epidemic era, 2,000 college students’ psychological problems were selected as research data in the community question section of the “Su Xin” application, a psychological self-help and mutual aid platform for college students in Jiangsu Province. First, word segmentation, removal of stop words, establishment of word vectors, etc. were used for the preprocessing of research data. Secondly, it was divided into 9 common psychological problems by LDA clustering analysis, which also combined with previous researches. Thirdly, the text information was processed into word vectors and transferred to the Attention-Based Bidirectional Long Short-Term Memory Networks (AB-LSTM). The experimental results showed that the proposed model has a higher test accuracy of 78% compared with other models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary material for preprint "Analyzing the Possibilities of Using the Scilit Platform to Identify Current Energy Efficiency and Conservation Issues" Purpose of publication:- Preparation of bibliometric data exported from the Scilit platform on energy efficiency and conservation for further analysis to identify relevant research topics.- To identify potential issues in the processing of data exported from the Scilit platform.- Providing colleagues with the opportunity to use the prepared data and examples of their analysis for independent research on topical issues of energy efficiency and energy conservation using materials provided by the Scilit platform.I have prepared a preprint and plan to post it on the platform https://www.preprints.org/search?field1=title_keywords&search2=Chigarev&field2=authors&clause=ANDIn this archive there is a file Energy_Efficiency-En.html with active links for convenience to find the full content of the tables used in the text.You can download the entire archive to your computer and use the data for your research using the algorithms and services listed in Energy_Efficiency-En.html.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Finding useful information from 30,000 papers is a hard task. Understanding information from all those papers takes time. With advanced AI methods, we can find and extract similar patterns from text data. This method uses advanced AI to find patterns in an unsupervised way. This will be equal to comparing all the sentences with every other sentence in brute-force method.
This method goes beyond sentence level co-occurrence pattern finding. As it compares each sentence with other sentences, similar or comparable patterns between the sentences are extracted rather than co-occurrence patterns by other methods.
As it compares the concepts and patterns not the words, hidden but related words or phrases can be found easily. In other words, it goes beyond keyword search to bring all the related sentences in one place. This also reduces the reading requirement.
This dataset creates similar sentences from unsupervised learning methods thus it extracts all the similar sentences which are nearly similar. It has some noise data which may not useful because it is fully unsupervised method.
Data is cleaned, stopwords removed and only English language papers were considered. Final result is 4.5 million sentences. These were processed to find relevant clusters of sentences with desired similarity.
One example is given below.
For full text of the paper, please refer to https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge data.
title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Antibiotics can be placed into the following relative rank order of predicted clinical efficacy for adults: 90% to 92% ch respiratory fluoroquinolones (gatifloxacin, levofloxacin, moxifloxacin), ceftriaxone, high-dose amoxicillin/clavulanate (4 g/250 mg/day), and amoxicillin/clavulanate (1.75 g/250 mg/day); 83% to 88% ch high-dose amoxicillin (4 g/day), amoxicillin (1.5 g/day), cefpodoxime proxetil, cefixime (based on H influenzae and M catarrhalis coverage), cefuroxime axetil, cefdinir, and TMP/SMX; 77% to 81% ch doxycycline, clindamycin (based on gram-positive coverage only), azithromycin, clarithromycin and erythromycin, and telithromycin; 65% to 66% ch cefaclor and loracarbef.
title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Antibiotics can be placed into the following relative rank order of predicted clinical efficacy in children with ABRS: 91% to 92% ch ceftriaxone, high-dose amoxicillin/clavulanate (90 mg/6.4 mg per kg per day) and amoxicillin/clavulanate (45 mg/6.4 mg per kg per day); 82% to 87% ch highdose amoxicillin (90 mg/kg per day), amoxicillin (45 mg/kg per day), cefpodoxime proxetil, cefixime (based on H influenzae and M catarrhalis coverage only), cefuroxime axetil, cefdinir, and TMP/SMX; and 78% to 80% ch clindamycin (based on gram-positive coverage only), cefprozil, azithromycin, clarithromycin, and erythromycin; 67% to 68% ch cefaclor and loracarbef.
title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Recommendations for initial therapy for adult patients with mild disease (who have not received antibiotics in the previous 4 to 6 weeks) include the following choices: amoxicillin/clavulanate (1.75 to 4 g/250 mg per day), amoxicillin (1.5 to 4 g/day), cefpodoxime proxetil, cefuroxime axetil, or cefdinir.
title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : Recommendations for initial therapy for children with mild disease and who have not received antibiotics in the previous 4 to 6 weeks include the following: high-dose amoxicillin/clavulanate (90 mg/6.4 mg per kg per day), amoxicillin (90 mg/kg per day), cefpodoxime proxetil, cefuroxime axetil, or cefdinir.
title : Antimicrobial treatment guidelines for acute bacterial rhinosinusitis Executive Summary SINUS AND ALLERGY HEALTH PARTNERSHIP* paper id : 32d8d8a2e5e0a499c98a53c9f71a22469752247e line : The relative antimicrobial activity against isolates of S pneumoniae based on PK/PD breakpoints, 89 can be listed as: gatifloxacin / levofloxacin / moxifloxacin ([?]99%); ceftriaxone / high-dose amoxicillin (Ti clavulanate [extended-release or extra strength]) (95% to 97%); amoxicillin (Ti clavulanate) / clindamycin (90% to 92%) ; cefpodoxime proxetil /cefuroxime axetil / cefdinir /erythromycin /cla...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Patent Clustering by Inventor
Dataset Description
This dataset is part of PatenTEB, a comprehensive benchmark for evaluating text embedding models on patent-specific tasks. PatenTEB comprises 15 tasks across retrieval, classification, paraphrase detection, and clustering, with 2.06 million examples designed to reflect real-world patent analysis workflows. Paper: PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
Task Details… See the full description on the dataset page: https://huggingface.co/datasets/datalyes/clusters_inventor.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fc9b6dac9e9a5010f65259a68811249a2%2F_3e6dc091-0889-487e-a530-23f6d88d08b7-small.jpeg?generation=1748759453448348&alt=media" alt="">
These are BAAI/bge-m3 embeddings of the Meta Kaggle ForumTopics.csv and ForumMessages.csv
This is a supplemental dataset for the Meta Kaggle Hackathon
<url> value2048 tokens context size and normalize_embeddings is set to trueThe actual text data that I fed into the embedding model can be seen in this dataset
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fe6a47fb262445e7dfefbb7be71d14565%2FScreenshot%20from%202025-06-01%2021-44-28.png?generation=1748785487135090&alt=media" alt="">
./sample/*.parquet folder to see how the data looks like, before you download the whole dataset (16GB)Generated with Bing Image Generator
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
A set of Monte Carlo simulated events, for the evaluation of top quarks' (and their child particles') momentum reconstruction, produced using the HEPData4ML package [1]. Specifically, the entries in this dataset correspond with top quark jets, and the momentum of the jets' constituent particles. This is a newer version of the "Top Quark Momentum Reconstruction Dataset" [2], but with sufficiently large changes to warrant this separate posting.
The dataset is saved in HDF5 format, as sets of arrays with keys (as detailed below). There are ~1.5M events, approximately broken down into the following sets:
Training: 700k events (files with "_train" suffix)
Validation: 200k events (files with "_valid" suffix)
Testing (small): 100k events (files with "_test" suffix)
Testing (large): 500k events (files with "_test_large" suffix)
The two separate types of testing files -- small and large -- are independent from one another, the former for conveniently running quicker testing and the latter for testing with a larger sample.
There are four version of the dataset present, with the versions indicated by the filenames. The different versions correspond with whether or not fast detector simulation was performed (versus truth-level jets), and whether or not the W-boson mass was modified: One version of the dataset uses the nominal value of (m_W = 80.385 \text{ GeV}) as used by Pythia8 [3], whereas another uses a variable mW taking on 101 values evenly-spaced as (m_W \in { 64.308,96.462 } \text{ GeV}). The dataset naming scheme is as follows:
train.h5 : jets clustered from truth-level, nominal mW
train_mW.h5: jets clustered from truth-level, variable mW
train_delphes.h5: jets clustered from Delphes outputs, nominal mW
train_delphes_mW.h5: jets clustered from Delphes outputs, variable mW
Description
13 TeV center-of-mass energy, fully hadronic top quark decays, simulated with Pythia8. ((t \rightarrow W \, b, \; W\rightarrow q \, q'))
Events are generated with leading top quark pT in [550,650] GeV. (set via Pythia8's (\hat{p}_{T,\text{ min}}) and (\hat{p}_{T,\text{ max}}) variables)
No inital- or final-state radiation (ISR/FSR), nor multi-parton interactions (MPI)
Where applicable, detector simulation is done using DELPHES [4], with the ATLAS detector card.
Clustering of particles/objects is done via FastJet [5], using the anti-kT algorithm, with (R=0.8) .
For the truth-level data, inputs to jet clustering are truth-level, final-state particles (i.e. clustering "truth jets").
For the data with detector simulation, the inputs are calorimeter towers from DELPHES.
Tower objects from DELPHES (not E-flow objects, no tracking information)
Each entry in the dataset corresponds with a single top quark jet, extracted from a (t\bar{t}) event.
All jets are matched to a parton-level top quark within (\Delta R < 0.8) . We choose the jet nearest the parton-level top quark.
Jets are required to have (|\eta| < 2), and (p_{T} > 15 \text{ GeV}).
The 200 leading (highest-pT) jet constituent four-momenta are stored in Cartesian coordinates (E,px,py,pz), sorted by decreasing pT, with zero-padding.
The jet four-momentum is stored in Cartesian coordinates (E, px, py, pz), as well as in cylindrical coordinates ((p_T,\eta,\phi,m)).
The truth (parton-level) four-momenta of the top quark, the bottom quark the W-boson, and the quarks to which the W-boson decays, are stored in Cartesian coordinates.
In addition, the momenta of the 120 leading stable daughter particles of the W-boson are stored in Cartesian coordinates.
Description of data fields & metadataBelow is a brief description of the various fields in the dataset. The dataset also contains metadata fields, stored using HDF5's "attributes". This is used for fields that are common across many events, and stores information such as generator-level configurations (in principle, all the information is stored as to be able to recreate the dataset with the HEPData4ML tool).
Note that fields whose keys have the prefix "jh_" correspond with output from the Johns Hopkins top tagger [6], as implemented in FastJet.
Also note that for the keys corresponding with four-momenta in Cartesian coordinates, there are rotated versions of these fields -- the data has been rotated so that the W-boson is at ((\theta=0, \phi=0)), and the b-quark is in the ((\theta=0, \phi < 0)) plane. This rotation is potentially useful for visualizations of the events.
Nobj: The number of constituents in the jet.
Pmu: The four-momenta of the jet constituents, in (E, px, py, pz). Sorted by decreasing pT and zero-padded to a length of 200.
Pmu_rot: Rotated version.
contained_daughter_sum_Pmu: Four-momentum sum of the stable daughter particles of the W-boson that fall within (\Delta R < 0.8) of the jet centroid.
contained_daughter_sum_Pmu_rot: Rotated version.
cross_section: Cross-section for the corresponding process, reported by Pythia8.
cross_section_uncertainty: Cross-section uncertainty for the corresponding process, reported by Pythia8.
energy_ratio smeared: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total smeared energy in this calorimeter tower.
Only relevant for the DELPHES datasets.
energy_ratio_truth: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total true energy of particles contributing to this calorimeter tower.
The above definition is relevant only for the DELPHES datasets. For the truth-level datasets, this field is repurposed to store a value (0 or 1) indicating whether or not the given particle (whose momentum is in the Pmu field) is a W-boson daughter.
event_idx: Redundant -- used for event indexing during the event generation process.
is_signal: Redundant -- indicates whether an event is signal or background, but this is a fully signal dataset. Potentially useful if combining with other datasets produced with HEPData4ML.
jet_Pmu: Four-momentum of the jet, in (E, px, py, pz).
jet_Pmu_rot: Rotated version.
jet_Pmu_cyl: Four-momentum of the jet, in ((pT_,\eta,\phi,m)).
jet_bqq_contained_dR06: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.
jet_bqq_contained_dR08: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.
jet_bqq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},b \right), \; \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).
jet_qq_contained_dR06: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.
jet_qq_contained_dR08: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.
jet_qq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).
jet_top_daughters_contained_dR08: Boolean flag indicating whether the final-state daughters of the top quark are within (\Delta R < 0.8) of the jet centroid. Specifically, the algorithm for this flag checks that the jet contains the stable daughters of both the b quark and the W boson. For the b and W each, daughter particles are allowed to be uncontained as long as (for each particle) the (p_T) of the sum of uncontained daughters is below (2.5 \text{ GeV}).
jh_W_Nobj: Number of constituents in the W-boson candidate identified by the JH tagger.
jh_W_Pmu: Four-momentum of the JH tagger W-boson candidate, in (E, px, py, pz).
jh_W_Pmu_rot: Rotated version.
jh_W_constituent_Pmu: Four-momentum of the constituents of the JH tagger W-boson candidate, in (E, px, py, pz).
jh_W_constituent_Pmu_rot: Rotated version.
jh_m: Mass of the JH W-boson candidate.
jh_m_resolution: Ratio of JH W-boson candidate mass, versus the true W-boson mass.
jh_pt: (p_T) of the JH W-boson candidate.
jh_pt_resolution: Ratio of JH W-boson candidate (p_T), versus the true W-boson mass.
jh_tag: Whether or not a jet was tagged by the JH tagger.
mc_weight: Monte Carlo weight for this event, reported by Pythia8.
process_code: Process code reported by Pythia8.
rotation_matrix: Rotation matrix for rotating the events' 3-momenta as to produce the rotated copies stored in the dataset.
truth_Nobj: Number of truth-level particles (saved in truth_Pmu).
truth_Pdg: PDG codes of the truth-level particles.
truth_Pmu: Truth-level particles: The top quark, bottom quark, W boson, q, q', and 120 leading, stable W-boson daughter particles, in (E, px, py, pz). A few of these are also stored in separate keys:
truth_Pmu_0: Top quark.
truth_Pmu_0_rot: Rotated version.
truth_Pmu_1: Bottom quark.
truth_Pmu_1_rot: Rotated version.
truth_Pmu_2: W-boson.
truth_Pmu_2_rot: Rotated version.
truth_Pmu_3: q from W decay.
truth_Pmu_3_rot: Rotated version.
truth_Pmu_4: q' from W decay.
truth_Pmu_4_rot: Rotated version.
truth_Pmu_0_rot: Rotated version of truth_Pmu.
The following fields correspond with metadata -- they provide the index of the corresponding metadata entry for each event:
command_line_arguments: The command-line arguments passed to HEPData4ML's run.py script.
config_file: The contents of the Python configuration file used for HEPData4ML. This, together with the command-line arguments, defines how the tool was run, what processes, jet clustering and post-processing was done, etc.
git_hash: Git hash for HEPData4ML.
timestamp: Timestamp for when the dataset was created
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The ag_news dataset provides a new opportunity for text classification research. It is a large dataset consisting of a training set of 10,000 examples and a test set of 5,000 examples. The examples are split evenly into two classes: positive and negative. This makes the dataset well-suited for research into text classification methods
If you're looking to do text classification research, the ag_news dataset is a great new dataset to use. It consists of a training set of 10,000 examples and a test set of 5,000 examples, split evenly between positive and negative class labels. The data is well-balanced and should be suitable for many different text classification tasks
- This dataset can be used to train a text classifier to automatically categorize news articles into positive and negative categories.
- This dataset can be used to develop a system that can identify positive and negative sentiment in news articles.
- This dataset can be used to study the difference in how positive and negative news is reported by different media outlets
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine that has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), XML, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-----------------------------------------| | text | The text of the news article. (string) | | label | The label of the news article. (integer) |
File: test.csv | Column name | Description | |:--------------|:-----------------------------------------| | text | The text of the news article. (string) | | label | The label of the news article. (integer) |
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature section mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a bag-of-words benchmark data set, where the features are represented by the frequencies of occurrence of each word.
Facebook
TwitterAfter the cold war, some countries gradually seek to regional cooperation when they could not handle various transnational challenges alone. Shanghai Cooperation Organization (SCO) is a good example. It brought Central Asian countries together. This paper applies the text-mining method, using co-word analysis, co-occurrence matrix, cluster analysis, and strategic diagram to analyze the selected articles from newspapers quantitatively and visually. In order to investigate the Chinese government’s attitude toward the SCO, this study collected data from the China Core Newspaper Full-text Database, which contains high-impact government newspapers revealing the Chinese government’s perception of the SCO. This study characterizes the changing role of SCO as perceived by the Chinese government from 2001 to 2019. Beijing’s changing expectations in each of the three identified subperiods are described.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geostatistics analyzes and predicts the values associated with spatial or spatial-temporal phenomena. It incorporates the spatial (and in some cases temporal) coordinates of the data within the analyses. It is a practical means of describing spatial patterns and interpolating values for locations where samples were not taken (and measures the uncertainty of those values, which is critical to informed decision making). This archive contains results of geostatistical analysis of COVID-19 case counts for all available US counties. Test results were obtained with ArcGIS Pro (ESRI). Sources are state health departments, which are scraped and aggregated by the Johns Hopkins Coronavirus Resource Center and then pre-processed by MappingSupport.com.
This update of the Zenodo dataset (version 6) consists of three compressed archives containing geostatistical analyses of SARS-CoV-2 testing data. This dataset utilizes many of the geostatistical techniques used in previous versions of this Zenodo archive, but has been significantly expanded to include analyses of up-to-date U.S. COVID-19 case data (from March 24th to September 8th, 2020):
Archive #1: “1.Geostat. Space-Time analysis of SARS-CoV-2 in the US (Mar24-Sept6).zip” – results of a geostatistical analysis of COVID-19 cases incorporating spatially-weighted hotspots that are conserved over one-week timespans. Results are reported starting from when U.S. COVID-19 case data first became available (March 24th, 2020) for 25 consecutive 1-week intervals (March 24th through to September 6th, 2020). Hotspots, where found, are reported in each individual state, rather than the entire continental United States.
Archive #2: "2.Geostat. Spatial analysis of SARS-CoV-2 in the US (Mar24-Sept8).zip" – the results from geostatistical spatial analyses only of corrected COVID-19 case data for the continental United States, spanning the period from March 24th through September 8th, 2020. The geostatistical techniques utilized in this archive includes ‘Hot Spot’ analysis and ‘Cluster and Outlier’ analysis.
Archive #3: "3.Kriging and Densification of SARS-CoV-2 in LA and MA.zip" – this dataset provides preliminary kriging and densification analysis of COVID-19 case data for certain dates within the U.S. states of Louisiana and Massachusetts.
These archives consist of map files (as both static images and as animations) and data files (including text files which contain the underlying data of said map files [where applicable]) which were generated when performing the following Geostatistical analyses: Hot Spot analysis (Getis-Ord Gi*) [‘Archive #1’: consecutive weeklong Space-Time Hot Spot analysis; ‘Archive #2’: daily Hot Spot Analysis], Cluster and Outlier analysis (Anselin Local Moran's I) [‘Archive #2’], Spatial Autocorrelation (Global Moran's I) [‘Archive #2’], and point-to-point comparisons with Kriging and Densification analysis [‘Archive #3’].
The Word document provided ("Description-of-Archive.Updated-Geostatistical-Analysis-of-SARS-CoV-2 (version 6).docx") details the contents of each file and folder within these three archives and gives general interpretations of these results.
Facebook
TwitterPlease download text file. There we have 22 sentence related to the «cat» topic. - Cat (animal) - UNIX-utility cat for to display the contents of files. - versions of the OS X operating system named after the feline family. Your task is to find two sentences that are closest in meaning to the first sentence in document («In comparison to dogs, cats have not undergone .......») We will use the cosine distance as a measure of proximity.
Steps: 1. Open the file. 2. Each line is the one sentence. Please make them all in lower case form using string function lower(). EXAMPLE: in comparison to dogs, cats have not undergone major changes during the domestication process. 3. Tokenization. Means that splitting the sentences to the words. For that purpose you can use regular expressions, that can split the words by space or any other symbols that aren’t letters. re.split('[^a-z]', t). Do not forgot to remove empty words. EXAMPLE: ['in', 'comparison', 'to', 'dogs', '', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process']. 4. Make a list of all the words that appear in the sentences. Note: all the words are unique. And give the index to the each sentence index from 0 to #of_the_unique_words. You can use dict. Example: {0: ‘mac', 1: 'permanently',2: 'osx',3: 'download',4: 'between',5: ‘based', 6: ‘which', ............., 252: ‘safer', 253: ‘will’}. Hint: we have 254 unique words. 5. And create Matrix with N x D dimensions. N is the number of the sentences and D is the number of the unique words (22 x 254). Fill it in: the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (bag of words).
Of course, during the Task 1 we implanted very simple method. For example int this method «cat» and «cats» two different words, but the meaning is the same.
For the second Task please do the same step from Task 1 (steps 1- 4). In this task you will create Term Frequency — Inverse Document Frequency matrix. Find the cosine distance from first sentence to the all other sentences. Which two sentences is close to the first sentence. You can use scipy.spatial.distance.cosine. Is the any difference from the result of the previous Task? Note: You should not to use any existing libraries for tdf/idf. All the steps similar to the previous example.
Please run the Hierarchy Clustering algorithm for the Task 1 and Task 2. And plot the dendrogram. Please explain your results.
NOTE: by default scipy.cluster.hierarchy it uses euclidean distance. You should change it to the cosine distance.