4 datasets found
  1. Dataset and Models for Detection of News Agency Releases in Historical...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring (2023). Dataset and Models for Detection of News Agency Releases in Historical Newspapers [Dataset]. http://doi.org/10.5281/zenodo.8333933
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).

    Please cite this report if you are using the models/datasets or find it relevant to your research:

    @article{Marxen:305129,
       title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
       author = {Marxen, Lea},
       pages = {114p},
       year = {2023},
       url = {http://infoscience.epfl.ch/record/305129},
    }


    1. DATA

    The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.

    The distribution of articles in the different sets is as follows:

    Dataset Statistics
    Lg.DocsAgency Mentions
    Trainde333493
    fr9031,122
    Devde3226
    fr110114
    Testde3258
    fr120163

    Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).

    2. MODELS

    The two agency detection and classification models used for the inference on the impresso Corpus are released as well:

    • newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset
    • newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset

    The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.

    Please refer to the report for further information or contact us.

    3. CODE

    https://github.com/impresso/newsagency-classification

    4. CONTACT

    Maud Ehrmann (EPFL-DHLAB)
    Emanuela Boros (EPFL-DHLAB)

  2. American Sign Language Dataset

    • kaggle.com
    Updated Dec 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Rasol Esfandiari (2024). American Sign Language Dataset [Dataset]. https://www.kaggle.com/datasets/esfiam/american-sign-language-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Rasol Esfandiari
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United States
    Description

    About Dataset

    This dataset is designed for training and evaluating machine learning models to recognize American Sign Language (ASL) hand gestures, including both numbers (0-9) and English alphabet letters (a-z). It is a well-organized dataset that can be used for computer vision tasks, particularly image classification and gesture recognition.

    Dataset Structure:

    The dataset contains two main folders: 1. Train:
    - Used for training the model.
    - Includes 36 subdirectories (one for each class: 0-9 and a-z).
    - Each subdirectory contains 56 images of the corresponding class.

    1. Test:
      • Used for evaluating the model.
      • Includes 36 subdirectories (one for each class: 0-9 and a-z).
      • Each subdirectory contains 14 images of the corresponding class.

    Dataset Summary:

    FolderNumber of ClassesTotal Images per ClassTotal Images
    Train36562,016
    Test3614504

    Features:

    • Number of Classes: 36 (10 digits + 26 letters).
    • Image Format: JPEG.

    Applications:

    This dataset is ideal for: - Training convolutional neural networks (CNNs) for ASL recognition. - Exploring data augmentation techniques for image classification. - Developing real-world AI applications like sign language translators.

    Suggested Workflow:

    1. Load the dataset and split it into training and testing sets.
    2. Apply data augmentation to enhance diversity in training data.
    3. Train a CNN model to classify the 36 ASL hand gestures.
    4. Evaluate the model's performance using the provided test set.

    Credits:

    This dataset is curated to facilitate the development of models for sign language recognition and gesture-based interaction systems. If you use this dataset in your research or projects, please consider sharing your findings or improvements!

  3. Codecfake dataset - training set (part 2 of 3)

    • zenodo.org
    bin
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuankun Xie; Yuankun Xie (2024). Codecfake dataset - training set (part 2 of 3) [Dataset]. http://doi.org/10.5281/zenodo.11171720
    Explore at:
    binAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yuankun Xie; Yuankun Xie
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This dataset is the training set (part 2 of 3) of the Codecfake dataset , corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".

    Abstract

    With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)
    models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose
    the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.

    Codecfake Dataset

    Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:

    Codecfake datasetdescriptionlink
    training set (part 1 of 3) & labeltrain_split.zip & train_split.z01 - train_split.z06https://zenodo.org/records/11171708
    training set (part 2 of 3)train_split.z07 - train_split.z14https://zenodo.org/records/11171720
    training set (part 3 of 3)train_split.z15 - train_split.z19https://zenodo.org/records/11171724
    development setdev_split.zip & dev_split.z01 - dev_split.z02https://zenodo.org/records/11169872
    test set (part 1 of 2)Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.ziphttps://zenodo.org/records/11169781
    test set (part 2 of 2)Codec unseen test: C7.ziphttps://zenodo.org/records/11125029

    Countermeasure

    The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.

    The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.

  4. A Dataset of Outdoor RSS Measurements for Localization

    • zenodo.org
    • data.niaid.nih.gov
    tiff, zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. http://doi.org/10.5281/zenodo.10962857
    Explore at:
    tiff, zipAvailable download formats
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update: New version includes additional samples taken in November 2022.

    Dataset Description

    This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

    The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

    Dataset DescriptionSample CountReceiver Count
    No-Tx Samples4610 to 25
    1-Tx Samples482210 to 25
    2-Tx Samples34611 to 12

    The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

    \(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)

    Measurement ParametersDescription
    Frequency462.7 MHz
    Radio Gain35 dB
    Receiver Sample Rate2 MHz
    Sample LengthN=10,000
    Band-pass Filter6 kHz
    Transmitters0 to 2
    Transmission Power1 W

    Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

    Usage Instructions

    Data is provided in .json format, both as one file and as split files.

    import json
    data_file = 'powder_462.7_rss_data.json'
    with open(data_file) as f:
      data = json.load(f)
    

    The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

    • rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.
    • tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.
    • metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

    File Separations and Train/Test Splits

    In the separated_data.zip folder there are several train/test separations of the data.

    • all_data contains all the data in the main JSON file, separated by the number of transmitters.
    • stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.
    • train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.
      • The random split is a random 80/20 split of the data.
      • special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.
      • The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.
      • The seasonal split contains data separated by the month of collection, in April, July, or November
      • The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.
      • campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

    Digital Surface Model

    The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

    To read the data in python:

    import rasterio as rio
    import numpy as np
    import utm
    
    dsm_object = rio.open('dsm.tif')
    dsm_map = dsm_object.read(1)   # a np.array containing elevation values
    dsm_resolution = dsm_object.res   # a tuple containing x,y resolution (0.5 meters) 
    dsm_transform = dsm_object.transform   # an Affine transform for conversion to UTM-12 coordinates
    utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
    utm_top_left = utm_transform @ np.array([0,0,1])
    utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
    latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
    latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
    

    Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

    DSM DOI: https://doi.org/10.5069/G9TH8JNQ

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring (2023). Dataset and Models for Detection of News Agency Releases in Historical Newspapers [Dataset]. http://doi.org/10.5281/zenodo.8333933
Organization logo

Dataset and Models for Detection of News Agency Releases in Historical Newspapers

Explore at:
zipAvailable download formats
Dataset updated
Sep 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).

Please cite this report if you are using the models/datasets or find it relevant to your research:

@article{Marxen:305129,
   title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
   author = {Marxen, Lea},
   pages = {114p},
   year = {2023},
   url = {http://infoscience.epfl.ch/record/305129},
}


1. DATA

The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.

The distribution of articles in the different sets is as follows:

Dataset Statistics
Lg.DocsAgency Mentions
Trainde333493
fr9031,122
Devde3226
fr110114
Testde3258
fr120163

Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).

2. MODELS

The two agency detection and classification models used for the inference on the impresso Corpus are released as well:

  • newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset
  • newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset

The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.

Please refer to the report for further information or contact us.

3. CODE

https://github.com/impresso/newsagency-classification

4. CONTACT

Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)

Search
Clear search
Close search
Google apps
Main menu