100+ datasets found
  1. Multilingual Historical News Article Extraction and Classification Dataset

    • zenodo.org
    csv
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo (2025). Multilingual Historical News Article Extraction and Classification Dataset [Dataset]. http://doi.org/10.57967/hf/3965
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 20, 2024
    Description

    This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.

  2. h

    var-extraction-n-task-classification

    • huggingface.co
    Updated Dec 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Travel Agent Project G3 (2024). var-extraction-n-task-classification [Dataset]. https://huggingface.co/datasets/TravelAgentProject/var-extraction-n-task-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2024
    Dataset authored and provided by
    Travel Agent Project G3
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    TravelAgentProject/var-extraction-n-task-classification dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. f

    Outline of the different combinations of features extraction and...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Schrouff; Caroline Kussé; Louis Wehenkel; Pierre Maquet; Christophe Phillips (2023). Outline of the different combinations of features extraction and classification methods used in the present study. [Dataset]. http://doi.org/10.1371/journal.pone.0035860.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Jessica Schrouff; Caroline Kussé; Louis Wehenkel; Pierre Maquet; Christophe Phillips
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outline of the different combinations of features extraction and classification methods used in the present study.

  4. f

    Classification accuracy (%) for all subjects using different feature...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yijun Wang; Yu-Te Wang; Tzyy-Ping Jung (2023). Classification accuracy (%) for all subjects using different feature extraction methods. [Dataset]. http://doi.org/10.1371/journal.pone.0037665.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yijun Wang; Yu-Te Wang; Tzyy-Ping Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification accuracy (%) for all subjects using different feature extraction methods.

  5. E

    ACL RD-TEC: A Reference Dataset for Terminology Extraction and...

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Dec 31, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2014). ACL RD-TEC: A Reference Dataset for Terminology Extraction and Classification Research in Computational Linguistics [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-T0375/
    Explore at:
    Dataset updated
    Dec 31, 2014
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    Automatic Term Recognition (ATR) is a research task that deals with the identification of domain-specific terms. Terms, in simple words, are textual realization of significant concepts in an expertise domain. Additionally, domain-specific terms may be classified into a number of categories, in which each category represents a significant concept. A term classification task is often defined on top of an ATR procedure to perform such categorization. For instance, in the biomedical domain, terms can be classified as drugs, proteins, and genes. This is a reference dataset for terminology extraction and classification research in computational linguistics. It is a set of manually annotated terms in English language that are extracted from the ACL Anthology Reference Corpus (ACL ARC). The ACL ARC is a canonicalised and frozen subset of scientific publications in the domain of Human Language Technologies (HLT). It consists of 10,921 articles from 1965 to 2006. The dataset, called ACL RD-TEC, is comprised of more than 69,000 candidate terms that are manually annotated as valid and invalid terms. Furthermore, valid terms are classified as technology and non-technology terms. Technology terms refer to a method, process, or in general a technological concept in the domain of HLT, e.g. machine translation, word sense disambiguation, and language modelling. On the other hand, non-technology terms refer to important concepts other than technological; examples of such terms in the domain of HLT are multilingual lexicon, corpora, word sense, and language model. The dataset is created to serve as a gold standard for the comparison of the algorithms of term recognition and classification.

  6. Data Extraction table for the study Machine-based Stereotypes: How Machine...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Dec 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Author; Anonymous Author (2022). Data Extraction table for the study Machine-based Stereotypes: How Machine Learning Algorithms Evaluate Ethnicity from Face Data [Dataset]. http://doi.org/10.5281/zenodo.7430540
    Explore at:
    Dataset updated
    Dec 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Author; Anonymous Author
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This table contains the data extraction results for the study Machine-based Stereotypes: How Machine Learning Algorithms Evaluate Ethnicity from Face Data. It contains 24 columns and 74 rows.

  7. o

    Depression: Twitter Dataset + Feature Extraction

    • opendatabay.com
    .csv
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Depression: Twitter Dataset + Feature Extraction [Dataset]. https://www.opendatabay.com/data/dataset/528d3302-f98e-4a27-a218-51d2816cabe7
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 7, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Mental Health & Wellness
    Description

    The data is in uncleaned format and is collected using Twitter API. The Tweets has been filtered to keep only the English context. It targets mental health classification of the user at Tweet-level. Also check out notebooks I have provided which demonstrates Data Cleaning and Feature Extraction Techniques on the given dataset

    Topic Modelling Features using LDA (Latent Dirichlet Allocation) i.e. summarizing tweet into one of Top k topics Emoji Sentiment Features i.e. count of Positive, Negative and Neutral Expression emoji's present in the tweet

    Original Data Source: Depression: Twitter Dataset + Feature Extraction

  8. m

    WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English...

    • data.mendeley.com
    Updated Jan 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AFSAL CP (2024). WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English Code-Mixed Text [Dataset]. http://doi.org/10.17632/tzrcrrwz4n.1
    Explore at:
    Dataset updated
    Jan 15, 2024
    Authors
    AFSAL CP
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WoLLaI Mal-Eng is a carefully curated and annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,402 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns such as sentence#, words, and language. Language annotation is thoughtfully categorized into four distinct classes: Mal, Eng, Mix, and Othr. The words that belong to the Malayalam language and are recognized by Malayalam speakers are annotated as Mal. The words that belong to the English language and are easily recognized by English speakers are annotated as Eng. Words that are formed by combining Malayalam and English words where Malayalam suffixes were added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Mix. The words of diverse elements such as numbers, abbreviations, and named entities are annotated as Othr.

  9. h

    Rayman-Extraction-Dataset-v0

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fevo, Rayman-Extraction-Dataset-v0 [Dataset]. https://huggingface.co/datasets/fevohh/Rayman-Extraction-Dataset-v0
    Explore at:
    Authors
    fevo
    Description

    Content:

    This dataset is made in response to the mediocre to decent output quality of 0.5B - 3B models finetuned on v1/v1.2 dataset as a way to cut down computation power to extract data while hopefully maintaining the same output quality This dataset has an added input_type tag to separate conversation and advertisement user input for better classification and extraction compared to v1/v1.2 datasets, and the number of data rows in each input_type are equal to ensure there is no… See the full description on the dataset page: https://huggingface.co/datasets/fevohh/Rayman-Extraction-Dataset-v0.

  10. Automated Feature Extraction from Hyperspectral Imagery, Phase I

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Automated Feature Extraction from Hyperspectral Imagery, Phase I [Dataset]. https://data.nasa.gov/dataset/Automated-Feature-Extraction-from-Hyperspectral-Im/vqhg-9n7f
    Explore at:
    json, application/rssxml, csv, tsv, xml, application/rdfxmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    In response to NASA Topic S7.01, Visual Learning Systems, Inc. (VLS) will develop a novel hyperspectral plug-in toolkit for its award winning Feature AnalystREG software that will (a) leverage VLS' proven algorithms to provide a new, simple, and long-awaited approach to materials classification from hyperspectral imagery (HSI), and (b) improve state-of-the-art Feature Analyst's automated feature extraction (AFE) capabilities by effectively incorporating detailed spectral information into its extraction process. HSI techniques, such as spectral end-member classification, can provide effective materials classification; however, current methods are slow (or manual), cumbersome, complex for analysts, and are limited to materials classification only. Feature Analyst, on the other hand has a simple workflow of (a) an analyst providing a few examples (e.g., pixels of a certain material) and (b) an advanced software agent classifying the rest of the imagery based on the examples. This simple yet powerful approach will be used as a new paradigm for materials classification. In addition, Feature Analyst uses, along with spectral information, feature characteristics such as spatial association, size, shape, texture, pattern, and shadow in its generic AFE process. Incorporating the best spectral classifier techniques with the best AFE approach promises to greatly increase the usefulness and applicability of HSI

  11. Feature Extraction

    • kaggle.com
    Updated Sep 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jason
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

    Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

    There are three files in this dataset:

    1. Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

    2. train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

    Columns in train_features.RDS & test_features.RDS:

    1. id_code - image id

    2. diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

    3. n - number of persistent homology components detected from the image

    4. fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

    5. l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.

  12. u

    Temporal Model UIE-Base_Fullpate

    • fdr.uni-hamburg.de
    zip
    Updated Nov 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirsanov Simon; Kirsanov Simon (2023). Temporal Model UIE-Base_Fullpate [Dataset]. http://doi.org/10.25592/uhhfdm.13601
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2023
    Authors
    Kirsanov Simon; Kirsanov Simon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Temporal Model "UIE-Base" finetuned on the "Fullpate" dataset to solve the tasks of extraction and classification of temporal entities. Model produced in the master's thesis "Extraction and Classification of Time in Unstructured Data [Kirsanov, 2023]".

  13. data and code for "Beyond Human Gold Standards: A Multi-Model Framework for...

    • zenodo.org
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Mongin; Denis Mongin (2025). data and code for "Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction" article [Dataset]. http://doi.org/10.5281/zenodo.15829040
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Denis Mongin; Denis Mongin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This is the public repository for the article "Beyond Human Gold Standards: A Multi-Model Framework for Automated Abstract Classification and Information Extraction" by Delphine S. Courvoisier, Diana Buitrago Garcia, Nils Burgisser, Clément P. Buclin, Michele Iudici, and Denis Mongin.
    The uptodate repository can be found here: https://gitlab.unige.ch/trial_integrity/llm_majority_public

    The structure of the repository is as follows:



    - The folder LLM_inference contains the LLM inferences for the two tasks performed on the abstracts list of the abstract csv file by the list of LLMs described in the model_list.csv file. The two tasks are the task for the classification of the intervention (folder abstract_classification) and the task for the extraction of the number of participants (participant_numbers folder). The initial list of abstract conatined 1080 abstract, some of which were not considered in our final analysis because they were protocols, and not randomized.
    - both folders contain the python script used for the inference using the prompt in the prompt folder, the two bash scripts used to run it on the university HPC.
    - All inference results are une the results folder, which contains the log files, and one csv file per model
    - The file gold.csv contains, for the final list of 1020 abstracts, the tasks performed by each reviewers, the human gold standard, and the platine stndard, with a 0/1 variable platine_check indicating which gold results were re-checked
    - The folder R_analysis contains the R files allowing to perform the analysis, produce the tables and the figures:
    - the file analysis.R contains the code to read the LLM inferences results, and calculate the accuracy for the different model combinations. It output a file in the results folder
    - the file figure_tables.R contains the R code using the result of the analysis.R code to produce the tables and figures of the article. The figures and tables are created in the figures_tables folder. The file trial_publication_info.csv contains the information about the RCT used for this analysis, coming from the data of the study doi.org/10.1016/j.jclinepi.2024.111586 .
    - the file help_func.R contains the functions used to format the table results, and is loaded in figure_tables.R.

  14. c

    Data from: A Novel Method Based on Convolutional Features with Non-Iterative...

    • acquire.cqu.edu.au
    • researchdata.edu.au
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toshi Sinha (2024). A Novel Method Based on Convolutional Features with Non-Iterative Learning for Brain Tumor Classification [Dataset]. http://doi.org/10.25946/21161026.v1
    Explore at:
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    CQUniversity
    Authors
    Toshi Sinha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Brain tumor is a cluster of abnormal and uncontrolled cells growth, leading to a short life expectancy in their highest grade. Accurately and timely clinical diagnosis of such tumors play critical role in treatment planning and patient care. Various image techniques such as Computed Tomography (CT), Ultra-Sound image, Magnetic Resonance Imaging (MRI) and biopsy are used to evaluate brain tumors. Among stated four, MRI is most common used non-invasive technique, however the key challenge with these images is the low-level visual information captured by MRI machines, that needs highlevel interpretation by experienced radiologist. The manual interpretation is a tedious, challenging, and erroneous task. In this paper, we propose a novel Convolutional Feature based Euclidean Distance (ConFED) method for faster and more accurate tumor classification. The method consists of convolutional features and Euclidean distance based one step learning. The proposed method is evaluated on Contrast-Enhanced Magnetic Resonance Images (CE-MRI) benchmark dataset. Proposed method is more generic as it does not use any handcrafted features, requires minimal preprocessing, and can achieve average accuracy of 97.02% using five-fold cross-validation. Extensive experiments, along with statistical tests, revealed that the proposed method has outperformed state-of-the-art classification methods on the CE-MRI dataset

  15. Water intake in mineral extraction industries, by source and North American...

    • data.wu.ac.at
    csv, html, xml
    Updated Jul 10, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics Canada | Statistique Canada (2018). Water intake in mineral extraction industries, by source and North American Industry Classification System [Dataset]. https://data.wu.ac.at/schema/www_data_gc_ca/YzdiZjQ5MDQtYTdjNi00M2ZiLWIyMWUtMjNmMjZhZDQwNzQ0
    Explore at:
    html, csv, xmlAvailable download formats
    Dataset updated
    Jul 10, 2018
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    This table shows the volume of intake water for the mineral extraction industries at the 4-digit Industry Classification, by water source, for Canada. The unit of measure is millions of cubic metres.

  16. H

    Dataset: Faces extracted from Time Magazine 1923-2014

    • dataverse.harvard.edu
    • marketplace.sshopencloud.eu
    Updated Mar 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Jofre (2020). Dataset: Faces extracted from Time Magazine 1923-2014 [Dataset]. http://doi.org/10.7910/DVN/JMFQT7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Ana Jofre
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The data presented here consists of three parts: Dataset 1: In this set, we extract 327,322 faces from our entire collection of 3389 issues, and automatically classified each face as male or female. We present this data as a single table with columns identifying the date, issue, page number, the coordinates identifying the position of the face on the page, and classification (male or female). The coordinates identifying the position of the face on the page are based on the size and resolution of the pages found in the “Time Vault”. Dataset 2: Dataset 2 consists of 8,789 classified faces from 100 selected issues. Human labor was used to identify and extract 3,299 face images from 39 issues, which were later classified by another set of workers. This selection of 39 issues contains one issue per decade spanned by the archive plus one issue per year between 1961 and 1991, and the extracted face images were used to train the face extraction algorithm. The remaining 5,490 faces from 61 issues were extracted via machine learning before being classified by human coders. These 61 issues were chosen to complement the first selection of 39 issues: one issue per year for all years in the archive excluding those between 1961 and 1991. Thus, Dataset 2 contains fully-labelled faces from at least one issue per year. Dataset 3: In the interest of transparency, Dataset 3 consists of the raw data collected to create Dataset 2, and consists of 2 tables. Before explaining these tables we first briefly describe our data collection and verification procedures, which have been fully described elsewhere. A custom AMT interface was used to enable human labors to classify faces according the categories in Table 4. Each worker was given a randomly-selected batch of 25 pages, each with a clearly highlighted face to be categorized, of which three pages were verification pages with known features, which were used for quality control. Each face was labeled by two distinct human coders, determined at random so that the paring of coders varied with the image. A proficiency rating was calculated for each coder by considering all images they annotated and computing the average number of labels that matched those identified by the image’s other coder. The tables in Dataset 2 were created by resolving inconsistencies between the two image coders by selecting the labels from the coder with the highest proficiency rating. Prior to calculating the proficiency score, all faces that were tagged as having ‘Poor’ or ‘Error’ image quality by either of the two coders were eliminated. Due to technical bugs when the AMT interface was first implemented, a small number of images were only labeled once; these were also eliminated from Datasets 2 and 3. In Dataset 3, we present the raw annotations for each coder that tagged each face, along with demographic data for each coder. Dataset 3 consists of two tables: the raw data from each of the two sets of coders, and the demographic information for each of the coders.

  17. Article Data

    • figshare.com
    zip
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    乔木 温 (2025). Article Data [Dataset]. http://doi.org/10.6084/m9.figshare.28758812.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    乔木 温
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study focuses on frequency-based textual category feature extraction methods, investigating the application differences and impacts of "representativeness calculation" and "distinctiveness calculation" in Chinese and English text classification. The core objective is to analyze how text feature extraction algorithms interact with language characteristics and to compare the performance of various algorithms across Chinese and English contexts. To achieve this, I propose a novel feature extraction approach and design a series of experiments to evaluate its effectiveness in both Chinese and English corpora. The dataset for this research includes three Chinese datasets (THUCNews, Sougou Chinese Corpus, CNTC) and three English datasets (20 Newsgroups, R8, R52), which encompass diverse textual features and varying data scales. This selection enables a realistic reflection of the performance differences across different linguistic environments and application scenarios, providing comprehensive and reliable data support for evaluating the effectiveness of the experiments.

  18. u

    Temporal Model UIE-Large_TempEval-3

    • fdr.uni-hamburg.de
    zip
    Updated Nov 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirsanov; Kirsanov (2023). Temporal Model UIE-Large_TempEval-3 [Dataset]. http://doi.org/10.25592/uhhfdm.13615
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2023
    Authors
    Kirsanov; Kirsanov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Temporal Model "UIE-Large" finetuned on the "TempEval-3" dataset to solve the tasks of extraction and classification of temporal entities. Model produced in the master's thesis "Extraction and Classification of Time in Unstructured Data [Kirsanov, 2023]".

  19. h

    few_rel

    • huggingface.co
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsinghua NLP group (2023). few_rel [Dataset]. https://huggingface.co/datasets/thunlp/few_rel
    Explore at:
    Dataset updated
    Aug 29, 2023
    Authors
    Tsinghua NLP group
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    FewRel is a large-scale few-shot relation extraction dataset, which contains more than one hundred relations and tens of thousands of annotated instances cross different domains.

  20. F

    Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in...

    • fred.stlouisfed.org
    json
    Updated Sep 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in Maine [Dataset]. https://fred.stlouisfed.org/series/MEOILGASQGSP
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Sep 30, 2022
    License

    https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain

    Description

    Graph and download economic data for Chain-Type Quantity Index for Real GDP: Oil and Gas Extraction (211) in Maine (MEOILGASQGSP) from 1997 to 2021 about extraction, quantity index, ME, chained, oil, NAICS, mining, gas, GSP, private industries, private, real, industry, GDP, and USA.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo (2025). Multilingual Historical News Article Extraction and Classification Dataset [Dataset]. http://doi.org/10.57967/hf/3965
Organization logo

Multilingual Historical News Article Extraction and Classification Dataset

Explore at:
csvAvailable download formats
Dataset updated
Jan 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johanna Mauermann; Carlos-Emiliano González-Gallardo; Sarah Oberbichler; Sarah Oberbichler; Johanna Mauermann; Carlos-Emiliano González-Gallardo
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Dec 20, 2024
Description

This dataset was created specifically to test LLMs capabilities in processing and extracting topic-specific articles from historical unstructured newspaper issues. While traditional article separation tasks rely on layout information or a combination of layout and semantic understanding, this dataset evaluates a novel approach using OCR'd text and context understanding. This method can considerably improve the corpus building process for individual researchers working on specific topics such as migration or disasters. The dataset consists of French, German, and English newspapers from 1909 and contains multiple layers of information: detailed metadata about each newspaper issue (including identifiers, titles, dates, and institutional information), full-text content of newspaper pages or sections, context window for processing, and human-annotated ground truth extractions. The dataset is structured to enable three-step evaluation of LLMs: first, their ability to classify content as relevant or not relevant to a specific topic (such as the 1908 Messina earthquake), second, their accuracy in extracting complete relevant articles from the broader newspaper text, and third, to correctly mark beginning and end of the articles, especially when several articles where published in the same newspaper issue. By providing human-annotated ground truth, the dataset allows for systematic assessment of how well LLMs can understand historical text, maintain contextual relevance, and perform precise information extraction. This testing framework helps evaluate LLMs' effectiveness in handling real-world historical document processing tasks while maintaining accuracy and contextual understanding.

Search
Clear search
Close search
Google apps
Main menu