Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).
Please cite this report if you are using the models/datasets or find it relevant to your research:
@article{Marxen:305129,
title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
author = {Marxen, Lea},
pages = {114p},
year = {2023},
url = {http://infoscience.epfl.ch/record/305129},
}
1. DATA
The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.
The distribution of articles in the different sets is as follows:
Lg. | Docs | Agency Mentions | |
---|---|---|---|
Train | de | 333 | 493 |
fr | 903 | 1,122 | |
Dev | de | 32 | 26 |
fr | 110 | 114 | |
Test | de | 32 | 58 |
fr | 120 | 163 |
Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).
2. MODELS
The two agency detection and classification models used for the inference on the impresso Corpus are released as well:
The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.
Please refer to the report for further information or contact us.
3. CODE
https://github.com/impresso/newsagency-classification
4. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for training and evaluating machine learning models to recognize American Sign Language (ASL) hand gestures, including both numbers (0-9) and English alphabet letters (a-z). It is a well-organized dataset that can be used for computer vision tasks, particularly image classification and gesture recognition.
The dataset contains two main folders:
1. Train:
- Used for training the model.
- Includes 36 subdirectories (one for each class: 0-9 and a-z).
- Each subdirectory contains 56 images of the corresponding class.
Folder | Number of Classes | Total Images per Class | Total Images |
---|---|---|---|
Train | 36 | 56 | 2,016 |
Test | 36 | 14 | 504 |
This dataset is ideal for: - Training convolutional neural networks (CNNs) for ASL recognition. - Exploring data augmentation techniques for image classification. - Developing real-world AI applications like sign language translators.
This dataset is curated to facilitate the development of models for sign language recognition and gesture-based interaction systems. If you use this dataset in your research or projects, please consider sharing your findings or improvements!
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is the training set (part 2 of 3) of the Codecfake dataset , corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)
models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose
the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:
Codecfake dataset | description | link |
training set (part 1 of 3) & label | train_split.zip & train_split.z01 - train_split.z06 | https://zenodo.org/records/11171708 |
training set (part 2 of 3) | train_split.z07 - train_split.z14 | https://zenodo.org/records/11171720 |
training set (part 3 of 3) | train_split.z15 - train_split.z19 | https://zenodo.org/records/11171724 |
development set | dev_split.zip & dev_split.z01 - dev_split.z02 | https://zenodo.org/records/11169872 |
test set (part 1 of 2) | Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip | https://zenodo.org/records/11169781 |
test set (part 2 of 2) | Codec unseen test: C7.zip | https://zenodo.org/records/11125029 |
The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.
The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Update: New version includes additional samples taken in November 2022.
Dataset Description
This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.
The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.
Dataset Description | Sample Count | Receiver Count |
---|---|---|
No-Tx Samples | 46 | 10 to 25 |
1-Tx Samples | 4822 | 10 to 25 |
2-Tx Samples | 346 | 11 to 12 |
The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:
\(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)
Measurement Parameters | Description |
---|---|
Frequency | 462.7 MHz |
Radio Gain | 35 dB |
Receiver Sample Rate | 2 MHz |
Sample Length | N=10,000 |
Band-pass Filter | 6 kHz |
Transmitters | 0 to 2 |
Transmission Power | 1 W |
Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.
Usage Instructions
Data is provided in .json
format, both as one file and as split files.
import json
data_file = 'powder_462.7_rss_data.json'
with open(data_file) as f:
data = json.load(f)
The json
data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:
rx_data
: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.tx_coords
: A list of coordinates for each transmitter. Each entry contains latitude and longitude.metadata
: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords
File Separations and Train/Test Splits
In the separated_data.zip
folder there are several train/test separations of the data.
all_data
contains all the data in the main JSON file, separated by the number of transmitters.stationary
consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.train_test_splits
contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json
, is equivalent to the file all_data/single_tx.json
.
random
split is a random 80/20 split of the data.special_test_cases
contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.grid
split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid
split.seasonal
split contains data separated by the month of collection, in April, July, or Novembertransportation
split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json
file contains the union of the walking and cycling data.campus.json
contains the on-campus data, so is equivalent to the union of each split, not including unused.json
.Digital Surface Model
The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.
To read the data in python:
import rasterio as rio
import numpy as np
import utm
dsm_object = rio.open('dsm.tif')
dsm_map = dsm_object.read(1) # a np.array containing elevation values
dsm_resolution = dsm_object.res # a tuple containing x,y resolution (0.5 meters)
dsm_transform = dsm_object.transform # an Affine transform for conversion to UTM-12 coordinates
utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
utm_top_left = utm_transform @ np.array([0,0,1])
utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.
DSM DOI: https://doi.org/10.5069/G9TH8JNQ
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).
Please cite this report if you are using the models/datasets or find it relevant to your research:
@article{Marxen:305129,
title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
author = {Marxen, Lea},
pages = {114p},
year = {2023},
url = {http://infoscience.epfl.ch/record/305129},
}
1. DATA
The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.
The distribution of articles in the different sets is as follows:
Lg. | Docs | Agency Mentions | |
---|---|---|---|
Train | de | 333 | 493 |
fr | 903 | 1,122 | |
Dev | de | 32 | 26 |
fr | 110 | 114 | |
Test | de | 32 | 58 |
fr | 120 | 163 |
Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).
2. MODELS
The two agency detection and classification models used for the inference on the impresso Corpus are released as well:
The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.
Please refer to the report for further information or contact us.
3. CODE
https://github.com/impresso/newsagency-classification
4. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)