Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
TravelAgentProject/var-extraction-n-task-classification dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Outline of the different combinations of features extraction and classification methods used in the present study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification accuracy (%) for all subjects using different feature extraction methods.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Automatic Term Recognition (ATR) is a research task that deals with the identification of domain-specific terms. Terms, in simple words, are textual realization of significant concepts in an expertise domain. Additionally, domain-specific terms may be classified into a number of categories, in which each category represents a significant concept. A term classification task is often defined on top of an ATR procedure to perform such categorization. For instance, in the biomedical domain, terms can be classified as drugs, proteins, and genes. This is a reference dataset for terminology extraction and classification research in computational linguistics. It is a set of manually annotated terms in English language that are extracted from the ACL Anthology Reference Corpus (ACL ARC). The ACL ARC is a canonicalised and frozen subset of scientific publications in the domain of Human Language Technologies (HLT). It consists of 10,921 articles from 1965 to 2006. The dataset, called ACL RD-TEC, is comprised of more than 69,000 candidate terms that are manually annotated as valid and invalid terms. Furthermore, valid terms are classified as technology and non-technology terms. Technology terms refer to a method, process, or in general a technological concept in the domain of HLT, e.g. machine translation, word sense disambiguation, and language modelling. On the other hand, non-technology terms refer to important concepts other than technological; examples of such terms in the domain of HLT are multilingual lexicon, corpora, word sense, and language model. The dataset is created to serve as a gold standard for the comparison of the algorithms of term recognition and classification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table contains the data extraction results for the study Machine-based Stereotypes: How Machine Learning Algorithms Evaluate Ethnicity from Face Data. It contains 24 columns and 74 rows.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data is in uncleaned format and is collected using Twitter API. The Tweets has been filtered to keep only the English context. It targets mental health classification of the user at Tweet-level. Also check out notebooks I have provided which demonstrates Data Cleaning and Feature Extraction Techniques on the given dataset
Topic Modelling Features using LDA (Latent Dirichlet Allocation) i.e. summarizing tweet into one of Top k topics Emoji Sentiment Features i.e. count of Positive, Negative and Neutral Expression emoji's present in the tweet
Original Data Source: Depression: Twitter Dataset + Feature Extraction
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WoLLaI Mal-Eng is a carefully curated and annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,402 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns such as sentence#, words, and language. Language annotation is thoughtfully categorized into four distinct classes: Mal, Eng, Mix, and Othr. The words that belong to the Malayalam language and are recognized by Malayalam speakers are annotated as Mal. The words that belong to the English language and are easily recognized by English speakers are annotated as Eng. Words that are formed by combining Malayalam and English words where Malayalam suffixes were added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Mix. The words of diverse elements such as numbers, abbreviations, and named entities are annotated as Othr.
Content:
This dataset is made in response to the mediocre to decent output quality of 0.5B - 3B models finetuned on v1/v1.2 dataset as a way to cut down computation power to extract data while hopefully maintaining the same output quality This dataset has an added input_type tag to separate conversation and advertisement user input for better classification and extraction compared to v1/v1.2 datasets, and the number of data rows in each input_type are equal to ensure there is no… See the full description on the dataset page: https://huggingface.co/datasets/fevohh/Rayman-Extraction-Dataset-v0.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
In response to NASA Topic S7.01, Visual Learning Systems, Inc. (VLS) will develop a novel hyperspectral plug-in toolkit for its award winning Feature AnalystREG software that will (a) leverage VLS' proven algorithms to provide a new, simple, and long-awaited approach to materials classification from hyperspectral imagery (HSI), and (b) improve state-of-the-art Feature Analyst's automated feature extraction (AFE) capabilities by effectively incorporating detailed spectral information into its extraction process. HSI techniques, such as spectral end-member classification, can provide effective materials classification; however, current methods are slow (or manual), cumbersome, complex for analysts, and are limited to materials classification only. Feature Analyst, on the other hand has a simple workflow of (a) an analyst providing a few examples (e.g., pixels of a certain material) and (b) an advanced software agent classifying the rest of the imagery based on the examples. This simple yet powerful approach will be used as a new paradigm for materials classification. In addition, Feature Analyst uses, along with spectral information, feature characteristics such as spatial association, size, shape, texture, pattern, and shadow in its generic AFE process. Incorporating the best spectral classifier techniques with the best AFE approach promises to greatly increase the usefulness and applicability of HSI
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.
Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.
There are three files in this dataset:
Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.
train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.
Columns in train_features.RDS & test_features.RDS:
id_code - image id
diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS
n - number of persistent homology components detected from the image
fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]
l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Temporal Model "UIE-Base" finetuned on the "Fullpate" dataset to solve the tasks of extraction and classification of temporal entities. Model produced in the master's thesis "Extraction and Classification of Time in Unstructured Data [Kirsanov, 2023]".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
prompt
folder, the two bash scripts used to run it on the university HPC.results
folder, which contains the log files, and one csv file per modelplatine_check
indicating which gold results were re-checkedfigure_tables.R
.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Brain tumor is a cluster of abnormal and uncontrolled cells growth, leading to a short life expectancy in their highest grade. Accurately and timely clinical diagnosis of such tumors play critical role in treatment planning and patient care. Various image techniques such as Computed Tomography (CT), Ultra-Sound image, Magnetic Resonance Imaging (MRI) and biopsy are used to evaluate brain tumors. Among stated four, MRI is most common used non-invasive technique, however the key challenge with these images is the low-level visual information captured by MRI machines, that needs highlevel interpretation by experienced radiologist. The manual interpretation is a tedious, challenging, and erroneous task. In this paper, we propose a novel Convolutional Feature based Euclidean Distance (ConFED) method for faster and more accurate tumor classification. The method consists of convolutional features and Euclidean distance based one step learning. The proposed method is evaluated on Contrast-Enhanced Magnetic Resonance Images (CE-MRI) benchmark dataset. Proposed method is more generic as it does not use any handcrafted features, requires minimal preprocessing, and can achieve average accuracy of 97.02% using five-fold cross-validation. Extensive experiments, along with statistical tests, revealed that the proposed method has outperformed state-of-the-art classification methods on the CE-MRI dataset
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table shows the volume of intake water for the mineral extraction industries at the 4-digit Industry Classification, by water source, for Canada. The unit of measure is millions of cubic metres.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data presented here consists of three parts: Dataset 1: In this set, we extract 327,322 faces from our entire collection of 3389 issues, and automatically classified each face as male or female. We present this data as a single table with columns identifying the date, issue, page number, the coordinates identifying the position of the face on the page, and classification (male or female). The coordinates identifying the position of the face on the page are based on the size and resolution of the pages found in the “Time Vault”. Dataset 2: Dataset 2 consists of 8,789 classified faces from 100 selected issues. Human labor was used to identify and extract 3,299 face images from 39 issues, which were later classified by another set of workers. This selection of 39 issues contains one issue per decade spanned by the archive plus one issue per year between 1961 and 1991, and the extracted face images were used to train the face extraction algorithm. The remaining 5,490 faces from 61 issues were extracted via machine learning before being classified by human coders. These 61 issues were chosen to complement the first selection of 39 issues: one issue per year for all years in the archive excluding those between 1961 and 1991. Thus, Dataset 2 contains fully-labelled faces from at least one issue per year. Dataset 3: In the interest of transparency, Dataset 3 consists of the raw data collected to create Dataset 2, and consists of 2 tables. Before explaining these tables we first briefly describe our data collection and verification procedures, which have been fully described elsewhere. A custom AMT interface was used to enable human labors to classify faces according the categories in Table 4. Each worker was given a randomly-selected batch of 25 pages, each with a clearly highlighted face to be categorized, of which three pages were verification pages with known features, which were used for quality control. Each face was labeled by two distinct human coders, determined at random so that the paring of coders varied with the image. A proficiency rating was calculated for each coder by considering all images they annotated and computing the average number of labels that matched those identified by the image’s other coder. The tables in Dataset 2 were created by resolving inconsistencies between the two image coders by selecting the labels from the coder with the highest proficiency rating. Prior to calculating the proficiency score, all faces that were tagged as having ‘Poor’ or ‘Error’ image quality by either of the two coders were eliminated. Due to technical bugs when the AMT interface was first implemented, a small number of images were only labeled once; these were also eliminated from Datasets 2 and 3. In Dataset 3, we present the raw annotations for each coder that tagged each face, along with demographic data for each coder. Dataset 3 consists of two tables: the raw data from each of the two sets of coders, and the demographic information for each of the coders.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study focuses on frequency-based textual category feature extraction methods, investigating the application differences and impacts of "representativeness calculation" and "distinctiveness calculation" in Chinese and English text classification. The core objective is to analyze how text feature extraction algorithms interact with language characteristics and to compare the performance of various algorithms across Chinese and English contexts. To achieve this, I propose a novel feature extraction approach and design a series of experiments to evaluate its effectiveness in both Chinese and English corpora. The dataset for this research includes three Chinese datasets (THUCNews, Sougou Chinese Corpus, CNTC) and three English datasets (20 Newsgroups, R8, R52), which encompass diverse textual features and varying data scales. This selection enables a realistic reflection of the performance differences across different linguistic environments and application scenarios, providing comprehensive and reliable data support for evaluating the effectiveness of the experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Temporal Model "UIE-Large" finetuned on the "TempEval-3" dataset to solve the tasks of extraction and classification of temporal entities. Model produced in the master's thesis "Extraction and Classification of Time in Unstructured Data [Kirsanov, 2023]".
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
FewRel is a large-scale few-shot relation extraction dataset, which contains more than one hundred relations and tens of thousands of annotated instances cross different domains.
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in Maine (MEOILGASQGSP) from 1997 to 2021 about extraction, quantity index, ME, chained, oil, NAICS, mining, gas, GSP, private industries, private, real, industry, GDP, and USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.