100+ datasets found

Multilingual Historical News Article Extraction and Classification Dataset
zenodo.org
csv
Updated Jan 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo (2025). Multilingual Historical News Article Extraction and Classification Dataset [Dataset]. http://doi.org/10.57967/hf/3965
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.57967/hf/3965
Dataset updated
Jan 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 20, 2024
Description
This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.
h
var-extraction-n-task-classification
huggingface.co
Updated Dec 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Travel Agent Project G3 (2024). var-extraction-n-task-classification [Dataset]. https://huggingface.co/datasets/TravelAgentProject/var-extraction-n-task-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2024
Dataset authored and provided by
Travel Agent Project G3
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
TravelAgentProject/var-extraction-n-task-classification dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Classification accuracy (%) for all subjects using different feature...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijun Wang; Yu-Te Wang; Tzyy-Ping Jung (2023). Classification accuracy (%) for all subjects using different feature extraction methods. [Dataset]. http://doi.org/10.1371/journal.pone.0037665.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0037665.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Yijun Wang; Yu-Te Wang; Tzyy-Ping Jung
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification accuracy (%) for all subjects using different feature extraction methods.
f
Outline of the different combinations of features extraction and...
plos.figshare.com
figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Schrouff; Caroline Kussé; Louis Wehenkel; Pierre Maquet; Christophe Phillips (2023). Outline of the different combinations of features extraction and classification methods used in the present study. [Dataset]. http://doi.org/10.1371/journal.pone.0035860.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0035860.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Jessica Schrouff; Caroline Kussé; Louis Wehenkel; Pierre Maquet; Christophe Phillips
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline of the different combinations of features extraction and classification methods used in the present study.
E
ACL RD-TEC: A Reference Dataset for Terminology Extraction and...
catalogue.elra.info
live.european-language-grid.eu
Updated Dec 31, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2014). ACL RD-TEC: A Reference Dataset for Terminology Extraction and Classification Research in Computational Linguistics [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-T0375/
Explore at:
Dataset updated
Dec 31, 2014
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
Automatic Term Recognition (ATR) is a research task that deals with the identification of domain-specific terms. Terms, in simple words, are textual realization of significant concepts in an expertise domain. Additionally, domain-specific terms may be classified into a number of categories, in which each category represents a significant concept. A term classification task is often defined on top of an ATR procedure to perform such categorization. For instance, in the biomedical domain, terms can be classified as drugs, proteins, and genes. This is a reference dataset for terminology extraction and classification research in computational linguistics. It is a set of manually annotated terms in English language that are extracted from the ACL Anthology Reference Corpus (ACL ARC). The ACL ARC is a canonicalised and frozen subset of scientific publications in the domain of Human Language Technologies (HLT). It consists of 10,921 articles from 1965 to 2006. The dataset, called ACL RD-TEC, is comprised of more than 69,000 candidate terms that are manually annotated as valid and invalid terms. Furthermore, valid terms are classified as technology and non-technology terms. Technology terms refer to a method, process, or in general a technological concept in the domain of HLT, e.g. machine translation, word sense disambiguation, and language modelling. On the other hand, non-technology terms refer to important concepts other than technological; examples of such terms in the domain of HLT are multilingual lexicon, corpora, word sense, and language model. The dataset is created to serve as a gold standard for the comparison of the algorithms of term recognition and classification.
Data Extraction table for the study Machine-based Stereotypes: How Machine...
zenodo.org
data.niaid.nih.gov
Updated Dec 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Author; Anonymous Author (2022). Data Extraction table for the study Machine-based Stereotypes: How Machine Learning Algorithms Evaluate Ethnicity from Face Data [Dataset]. http://doi.org/10.5281/zenodo.7430540
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7430540
Dataset updated
Dec 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Author; Anonymous Author
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This table contains the data extraction results for the study Machine-based Stereotypes: How Machine Learning Algorithms Evaluate Ethnicity from Face Data. It contains 24 columns and 74 rows.
o
Depression: Twitter Dataset + Feature Extraction
opendatabay.com
.csv
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Depression: Twitter Dataset + Feature Extraction [Dataset]. https://www.opendatabay.com/data/dataset/528d3302-f98e-4a27-a218-51d2816cabe7
Explore at:
.csvAvailable download formats
Dataset updated
Jun 7, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Mental Health & Wellness
Description
The data is in uncleaned format and is collected using Twitter API. The Tweets has been filtered to keep only the English context. It targets mental health classification of the user at Tweet-level. Also check out notebooks I have provided which demonstrates Data Cleaning and Feature Extraction Techniques on the given dataset

Topic Modelling Features using LDA (Latent Dirichlet Allocation) i.e. summarizing tweet into one of Top k topics Emoji Sentiment Features i.e. count of Positive, Negative and Neutral Expression emoji's present in the tweet

Original Data Source: Depression: Twitter Dataset + Feature Extraction
m
WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English...
data.mendeley.com
Updated Jan 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AFSAL CP (2024). WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English Code-Mixed Text [Dataset]. http://doi.org/10.17632/tzrcrrwz4n.1
Explore at:
Unique identifier
https://doi.org/10.17632/tzrcrrwz4n.1
Dataset updated
Jan 15, 2024
Authors
AFSAL CP
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WoLLaI Mal-Eng is a carefully curated and annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,402 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns such as sentence#, words, and language. Language annotation is thoughtfully categorized into four distinct classes: Mal, Eng, Mix, and Othr. The words that belong to the Malayalam language and are recognized by Malayalam speakers are annotated as Mal. The words that belong to the English language and are easily recognized by English speakers are annotated as Eng. Words that are formed by combining Malayalam and English words where Malayalam suffixes were added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Mix. The words of diverse elements such as numbers, abbreviations, and named entities are annotated as Othr.
h
Rayman-Extraction-Dataset-v0
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fevo, Rayman-Extraction-Dataset-v0 [Dataset]. https://huggingface.co/datasets/fevohh/Rayman-Extraction-Dataset-v0
Explore at:
Authors
fevo
Description
Content:

This dataset is made in response to the mediocre to decent output quality of 0.5B - 3B models finetuned on v1/v1.2 dataset as a way to cut down computation power to extract data while hopefully maintaining the same output quality This dataset has an added input_type tag to separate conversation and advertisement user input for better classification and extraction compared to v1/v1.2 datasets, and the number of data rows in each input_type are equal to ensure there is no… See the full description on the dataset page: https://huggingface.co/datasets/fevohh/Rayman-Extraction-Dataset-v0.
Automated Feature Extraction from Hyperspectral Imagery, Phase I
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Automated Feature Extraction from Hyperspectral Imagery, Phase I [Dataset]. https://data.nasa.gov/dataset/Automated-Feature-Extraction-from-Hyperspectral-Im/vqhg-9n7f
Explore at:
json, application/rssxml, csv, tsv, xml, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
In response to NASA Topic S7.01, Visual Learning Systems, Inc. (VLS) will develop a novel hyperspectral plug-in toolkit for its award winning Feature Analyst^REG software that will (a) leverage VLS' proven algorithms to provide a new, simple, and long-awaited approach to materials classification from hyperspectral imagery (HSI), and (b) improve state-of-the-art Feature Analyst's automated feature extraction (AFE) capabilities by effectively incorporating detailed spectral information into its extraction process. HSI techniques, such as spectral end-member classification, can provide effective materials classification; however, current methods are slow (or manual), cumbersome, complex for analysts, and are limited to materials classification only. Feature Analyst, on the other hand has a simple workflow of (a) an analyst providing a few examples (e.g., pixels of a certain material) and (b) an advanced software agent classifying the rest of the imagery based on the examples. This simple yet powerful approach will be used as a new paradigm for materials classification. In addition, Feature Analyst uses, along with spectral information, feature characteristics such as spatial association, size, shape, texture, pattern, and shadow in its generic AFE process. Incorporating the best spectral classifier techniques with the best AFE approach promises to greatly increase the usefulness and applicability of HSI
Feature Extraction
kaggle.com
Updated Sep 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 4, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jason
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

There are three files in this dataset:

Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

Columns in train_features.RDS & test_features.RDS:

id_code - image id

diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

n - number of persistent homology components detected from the image

fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
u
Temporal Model UIE-Base_Fullpate
fdr.uni-hamburg.de
zip
Updated Nov 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kirsanov Simon; Kirsanov Simon (2023). Temporal Model UIE-Base_Fullpate [Dataset]. http://doi.org/10.25592/uhhfdm.13601
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25592/uhhfdm.13601
Dataset updated
Nov 1, 2023
Authors
Kirsanov Simon; Kirsanov Simon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Temporal Model "UIE-Base" finetuned on the "Fullpate" dataset to solve the tasks of extraction and classification of temporal entities. Model produced in the master's thesis "Extraction and Classification of Time in Unstructured Data [Kirsanov, 2023]".
data and code for "Beyond Human Gold Standards: A Multi-Model Framework for...
zenodo.org
zip
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denis Mongin; Denis Mongin (2025). data and code for "Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction" article [Dataset]. http://doi.org/10.5281/zenodo.15829040
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15829040
Dataset updated
Jul 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Denis Mongin; Denis Mongin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

This is the public repository for the article "Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction" by Delphine S. Courvoisier, Diana Buitrago Garcia, Nils Burgisser, Clément P. Buclin, Michele Iudici, and Denis Mongin.

The uptodate repository can be found here: https://gitlab.unige.ch/trial_integrity/llm_majority_public

The structure of the repository is as follows:

- The folder LLM_inference contains the LLM inferences for the two tasks performed on the abstracts list of the abstract csv file by the list of LLMs described in the model_list.csv file. The two tasks are the task for the classification of the intervention (folder abstract_classification) and the task for the extraction of the number of participants (participant_numbers folder). The initial list of abstract conatined 1080 abstract, some of which were not considered in our final analysis because they were protocols, and not randomized.

- both folders contain the python script used for the inference using the prompt in the prompt folder, the two bash scripts used to run it on the university HPC.

- All inference results are une the results folder, which contains the log files, and one csv file per model

- The file gold.csv contains, for the final list of 1020 abstracts, the tasks performed by each reviewers, the human gold standard, and the platine stndard, with a 0/1 variable platine_check indicating which gold results were re-checked

- The folder R_analysis contains the R files allowing to perform the analysis, produce the tables and the figures:

- the file analysis.R contains the code to read the LLM inferences results, and calculate the accuracy for the different model combinations. It output a file in the results folder

- the file figure_tables.R contains the R code using the result of the analysis.R code to produce the tables and figures of the article. The figures and tables are created in the figures_tables folder. The file trial_publication_info.csv contains the information about the RCT used for this analysis, coming from the data of the study doi.org/10.1016/j.jclinepi.2024.111586 .

- the file help_func.R contains the functions used to format the table results, and is loaded in figure_tables.R.
c
Data from: A Novel Method Based on Convolutional Features with Non-Iterative...
acquire.cqu.edu.au
researchdata.edu.au
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toshi Sinha (2024). A Novel Method Based on Convolutional Features with Non-Iterative Learning for Brain Tumor Classification [Dataset]. http://doi.org/10.25946/21161026.v1
Explore at:
Unique identifier
https://doi.org/10.25946/21161026.v1
Dataset updated
Jan 8, 2024
Dataset provided by
CQUniversity
Authors
Toshi Sinha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brain tumor is a cluster of abnormal and uncontrolled cells growth, leading to a short life expectancy in their highest grade. Accurately and timely clinical diagnosis of such tumors play critical role in treatment planning and patient care. Various image techniques such as Computed Tomography (CT), Ultra-Sound image, Magnetic Resonance Imaging (MRI) and biopsy are used to evaluate brain tumors. Among stated four, MRI is most common used non-invasive technique, however the key challenge with these images is the low-level visual information captured by MRI machines, that needs highlevel interpretation by experienced radiologist. The manual interpretation is a tedious, challenging, and erroneous task. In this paper, we propose a novel Convolutional Feature based Euclidean Distance (ConFED) method for faster and more accurate tumor classification. The method consists of convolutional features and Euclidean distance based one step learning. The proposed method is evaluated on Contrast-Enhanced Magnetic Resonance Images (CE-MRI) benchmark dataset. Proposed method is more generic as it does not use any handcrafted features, requires minimal preprocessing, and can achieve average accuracy of 97.02% using five-fold cross-validation. Extensive experiments, along with statistical tests, revealed that the proposed method has outperformed state-of-the-art classification methods on the CE-MRI dataset
Water intake in mineral extraction industries, by source and North American...
data.wu.ac.at
csv, html, xml
Updated Jul 10, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Canada | Statistique Canada (2018). Water intake in mineral extraction industries, by source and North American Industry Classification System [Dataset]. https://data.wu.ac.at/schema/www_data_gc_ca/YzdiZjQ5MDQtYTdjNi00M2ZiLWIyMWUtMjNmMjZhZDQwNzQ0
Explore at:
html, csv, xmlAvailable download formats
Dataset updated
Jul 10, 2018
Dataset provided by
Statistics Canadahttps://statcan.gc.ca/en
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
This table shows the volume of intake water for the mineral extraction industries at the 4-digit Industry Classification, by water source, for Canada. The unit of measure is millions of cubic metres.
H
Dataset: Faces extracted from Time Magazine 1923-2014
dataverse.harvard.edu
marketplace.sshopencloud.eu
Updated Mar 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Jofre (2020). Dataset: Faces extracted from Time Magazine 1923-2014 [Dataset]. http://doi.org/10.7910/DVN/JMFQT7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/JMFQT7
Dataset updated
Mar 18, 2020
Dataset provided by
Harvard Dataverse
Authors
Ana Jofre
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data presented here consists of three parts: Dataset 1: In this set, we extract 327,322 faces from our entire collection of 3389 issues, and automatically classified each face as male or female. We present this data as a single table with columns identifying the date, issue, page number, the coordinates identifying the position of the face on the page, and classification (male or female). The coordinates identifying the position of the face on the page are based on the size and resolution of the pages found in the “Time Vault”. Dataset 2: Dataset 2 consists of 8,789 classified faces from 100 selected issues. Human labor was used to identify and extract 3,299 face images from 39 issues, which were later classified by another set of workers. This selection of 39 issues contains one issue per decade spanned by the archive plus one issue per year between 1961 and 1991, and the extracted face images were used to train the face extraction algorithm. The remaining 5,490 faces from 61 issues were extracted via machine learning before being classified by human coders. These 61 issues were chosen to complement the first selection of 39 issues: one issue per year for all years in the archive excluding those between 1961 and 1991. Thus, Dataset 2 contains fully-labelled faces from at least one issue per year. Dataset 3: In the interest of transparency, Dataset 3 consists of the raw data collected to create Dataset 2, and consists of 2 tables. Before explaining these tables we first briefly describe our data collection and verification procedures, which have been fully described elsewhere. A custom AMT interface was used to enable human labors to classify faces according the categories in Table 4. Each worker was given a randomly-selected batch of 25 pages, each with a clearly highlighted face to be categorized, of which three pages were verification pages with known features, which were used for quality control. Each face was labeled by two distinct human coders, determined at random so that the paring of coders varied with the image. A proficiency rating was calculated for each coder by considering all images they annotated and computing the average number of labels that matched those identified by the image’s other coder. The tables in Dataset 2 were created by resolving inconsistencies between the two image coders by selecting the labels from the coder with the highest proficiency rating. Prior to calculating the proficiency score, all faces that were tagged as having ‘Poor’ or ‘Error’ image quality by either of the two coders were eliminated. Due to technical bugs when the AMT interface was first implemented, a small number of images were only labeled once; these were also eliminated from Datasets 2 and 3. In Dataset 3, we present the raw annotations for each coder that tagged each face, along with demographic data for each coder. Dataset 3 consists of two tables: the raw data from each of the two sets of coders, and the demographic information for each of the coders.
Article Data
figshare.com
zip
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
乔木温 (2025). Article Data [Dataset]. http://doi.org/10.6084/m9.figshare.28758812.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28758812.v1
Dataset updated
Apr 12, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
乔木温
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study focuses on frequency-based textual category feature extraction methods, investigating the application differences and impacts of "representativeness calculation" and "distinctiveness calculation" in Chinese and English text classification. The core objective is to analyze how text feature extraction algorithms interact with language characteristics and to compare the performance of various algorithms across Chinese and English contexts. To achieve this, I propose a novel feature extraction approach and design a series of experiments to evaluate its effectiveness in both Chinese and English corpora. The dataset for this research includes three Chinese datasets (THUCNews, Sougou Chinese Corpus, CNTC) and three English datasets (20 Newsgroups, R8, R52), which encompass diverse textual features and varying data scales. This selection enables a realistic reflection of the performance differences across different linguistic environments and application scenarios, providing comprehensive and reliable data support for evaluating the effectiveness of the experiments.
u
Temporal Model UIE-Large_TempEval-3
fdr.uni-hamburg.de
zip
Updated Nov 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kirsanov; Kirsanov (2023). Temporal Model UIE-Large_TempEval-3 [Dataset]. http://doi.org/10.25592/uhhfdm.13615
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25592/uhhfdm.13615
Dataset updated
Nov 1, 2023
Authors
Kirsanov; Kirsanov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Temporal Model "UIE-Large" finetuned on the "TempEval-3" dataset to solve the tasks of extraction and classification of temporal entities. Model produced in the master's thesis "Extraction and Classification of Time in Unstructured Data [Kirsanov, 2023]".
h
few_rel
huggingface.co
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsinghua NLP group (2023). few_rel [Dataset]. https://huggingface.co/datasets/thunlp/few_rel
Explore at:
Dataset updated
Aug 29, 2023
Authors
Tsinghua NLP group
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
FewRel is a large-scale few-shot relation extraction dataset, which contains more than one hundred relations and tens of thousands of annotated instances cross different domains.
F
Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in...
fred.stlouisfed.org
json
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in Maine [Dataset]. https://fred.stlouisfed.org/series/MEOILGASQGSP
Explore at:
jsonAvailable download formats
Dataset updated
Sep 30, 2022
License
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Description
Graph and download economic data for Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in Maine (MEOILGASQGSP) from 1997 to 2021 about extraction, quantity index, ME, chained, oil, NAICS, mining, gas, GSP, private industries, private, real, industry, GDP, and USA.

Facebook

Twitter

Click to copy link

Link copied

Cite

Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo (2025). Multilingual Historical News Article Extraction and Classification Dataset [Dataset]. http://doi.org/10.57967/hf/3965

Multilingual Historical News Article Extraction and Classification Dataset

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.57967/hf/3965

Dataset updated

Jan 12, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Dec 20, 2024

Description

This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.

Clear search

Close search

Google apps

Main menu

Multilingual Historical News Article Extraction and Classification Dataset

var-extraction-n-task-classification

Classification accuracy (%) for all subjects using different feature...

Outline of the different combinations of features extraction and...

ACL RD-TEC: A Reference Dataset for Terminology Extraction and...

Data Extraction table for the study Machine-based Stereotypes: How Machine...

Depression: Twitter Dataset + Feature Extraction

WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English...

Rayman-Extraction-Dataset-v0

Automated Feature Extraction from Hyperspectral Imagery, Phase I

Feature Extraction

Temporal Model UIE-Base_Fullpate

data and code for "Beyond Human Gold Standards: A Multi-Model Framework for...

Data from: A Novel Method Based on Convolutional Features with Non-Iterative...

Water intake in mineral extraction industries, by source and North American...

Dataset: Faces extracted from Time Magazine 1923-2014

Article Data

Temporal Model UIE-Large_TempEval-3

few_rel

Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in...

Multilingual Historical News Article Extraction and Classification Dataset