4 datasets found

Dataset and Models for Detection of News Agency Releases in Historical...
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring (2023). Dataset and Models for Detection of News Agency Releases in Historical Newspapers [Dataset]. http://doi.org/10.5281/zenodo.8333933
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8333933
Dataset updated
Sep 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).

Please cite this report if you are using the models/datasets or find it relevant to your research:

@article{Marxen:305129, title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers}, author = {Marxen, Lea}, pages = {114p}, year = {2023}, url = {http://infoscience.epfl.ch/record/305129}, }

1. DATA

The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.

The distribution of articles in the different sets is as follows:

Dataset Statistics
Lg. Docs Agency Mentions
Train de 333 493
fr 903 1,122
Dev de 32 26
fr 110 114
Test de 32 58
fr 120 163

Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).

2. MODELS

The two agency detection and classification models used for the inference on the impresso Corpus are released as well:

newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset

newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset

The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.

Please refer to the report for further information or contact us.

3. CODE

https://github.com/impresso/newsagency-classification

4. CONTACT

Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)
American Sign Language Dataset
kaggle.com
Updated Dec 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Rasol Esfandiari (2024). American Sign Language Dataset [Dataset]. https://www.kaggle.com/datasets/esfiam/american-sign-language-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Rasol Esfandiari
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
United States
Description
About Dataset

This dataset is designed for training and evaluating machine learning models to recognize American Sign Language (ASL) hand gestures, including both numbers (0-9) and English alphabet letters (a-z). It is a well-organized dataset that can be used for computer vision tasks, particularly image classification and gesture recognition.

Dataset Structure:

The dataset contains two main folders: 1. Train:
- Used for training the model.
- Includes 36 subdirectories (one for each class: 0-9 and a-z).
- Each subdirectory contains 56 images of the corresponding class.

Test:

Used for evaluating the model.

Includes 36 subdirectories (one for each class: 0-9 and a-z).

Each subdirectory contains 14 images of the corresponding class.

Dataset Summary:

Folder Number of Classes Total Images per Class Total Images
Train 36 56 2,016
Test 36 14 504

Features:

Number of Classes: 36 (10 digits + 26 letters).

Image Format: JPEG.

Applications:

This dataset is ideal for: - Training convolutional neural networks (CNNs) for ASL recognition. - Exploring data augmentation techniques for image classification. - Developing real-world AI applications like sign language translators.

Suggested Workflow:

Load the dataset and split it into training and testing sets.

Apply data augmentation to enhance diversity in training data.

Train a CNN model to classify the 36 ASL hand gestures.

Evaluate the model's performance using the provided test set.

Credits:

This dataset is curated to facilitate the development of models for sign language recognition and gesture-based interaction systems. If you use this dataset in your research or projects, please consider sharing your findings or improvements!

Folder	Number of Classes	Total Images per Class	Total Images
Train	36	56	2,016
Test	36	14	504

Codecfake dataset - training set (part 2 of 3)

zenodo.org

bin

Updated May 16, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Yuankun Xie; Yuankun Xie (2024). Codecfake dataset - training set (part 2 of 3) [Dataset]. http://doi.org/10.5281/zenodo.11171720

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11171720

Dataset updated

May 16, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Yuankun Xie; Yuankun Xie

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

This dataset is the training set (part 2 of 3) of the Codecfake dataset , corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)
models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose
the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.

Codecfake Dataset

Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:

Codecfake dataset	description	link
training set (part 1 of 3) & label	train_split.zip & train_split.z01 - train_split.z06	https://zenodo.org/records/11171708
training set (part 2 of 3)	train_split.z07 - train_split.z14	https://zenodo.org/records/11171720
training set (part 3 of 3)	train_split.z15 - train_split.z19	https://zenodo.org/records/11171724
development set	dev_split.zip & dev_split.z01 - dev_split.z02	https://zenodo.org/records/11169872
test set (part 1 of 2)	Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip	https://zenodo.org/records/11169781
test set (part 2 of 2)	Codec unseen test: C7.zip	https://zenodo.org/records/11125029

Countermeasure

The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.

The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.

A Dataset of Outdoor RSS Measurements for Localization
zenodo.org
data.niaid.nih.gov
tiff, zip
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. http://doi.org/10.5281/zenodo.10962857
Explore at:
tiff, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10962857
Dataset updated
Jul 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Update: New version includes additional samples taken in November 2022.

Dataset Description

This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

Dataset Description Sample Count Receiver Count
No-Tx Samples 46 10 to 25
1-Tx Samples 4822 10 to 25
2-Tx Samples 346 11 to 12

The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

\(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)

Measurement Parameters Description
Frequency 462.7 MHz
Radio Gain 35 dB
Receiver Sample Rate 2 MHz
Sample Length N=10,000
Band-pass Filter 6 kHz
Transmitters 0 to 2
Transmission Power 1 W

Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

Usage Instructions

Data is provided in .json format, both as one file and as split files.

import json data_file = 'powder_462.7_rss_data.json' with open(data_file) as f: data = json.load(f)

The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.

tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.

metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

File Separations and Train/Test Splits

In the separated_data.zip folder there are several train/test separations of the data.

all_data contains all the data in the main JSON file, separated by the number of transmitters.

stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.

train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.

The random split is a random 80/20 split of the data.

special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.

The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.

The seasonal split contains data separated by the month of collection, in April, July, or November

The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.

campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

Digital Surface Model

The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

To read the data in python:

import rasterio as rio import numpy as np import utm dsm_object = rio.open('dsm.tif') dsm_map = dsm_object.read(1) # a np.array containing elevation values dsm_resolution = dsm_object.res # a tuple containing x,y resolution (0.5 meters) dsm_transform = dsm_object.transform # an Affine transform for conversion to UTM-12 coordinates utm_transform = np.array(dsm_transform).reshape((3,3))[:2] utm_top_left = utm_transform @ np.array([0,0,1]) utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1]) latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T') latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')

Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

DSM DOI: https://doi.org/10.5069/G9TH8JNQ
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Dataset Description	Sample Count	Receiver Count
No-Tx Samples	46	10 to 25
1-Tx Samples	4822	10 to 25
2-Tx Samples	346	11 to 12

Measurement Parameters	Description
Frequency	462.7 MHz
Radio Gain	35 dB
Receiver Sample Rate	2 MHz
Sample Length	N=10,000
Band-pass Filter	6 kHz
Transmitters	0 to 2
Transmission Power	1 W

Facebook

Twitter

Click to copy link

Link copied

Cite

Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring (2023). Dataset and Models for Detection of News Agency Releases in Historical Newspapers [Dataset]. http://doi.org/10.5281/zenodo.8333933

Dataset and Models for Detection of News Agency Releases in Historical Newspapers

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8333933

Dataset updated

Sep 12, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring; Lea Marxen; Maud Ehrmann; Emanuela Boros; Marten Düring

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).

Please cite this report if you are using the models/datasets or find it relevant to your research:

@article{Marxen:305129,
   title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
   author = {Marxen, Lea},
   pages = {114p},
   year = {2023},
   url = {http://infoscience.epfl.ch/record/305129},
}

1. DATA

The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.

The distribution of articles in the different sets is as follows:

Dataset Statistics
	Lg.	Docs	Agency Mentions
Train	de	333	493
	fr	903	1,122
Dev	de	32	26
	fr	110	114
Test	de	32	58
	fr	120	163

Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).

2. MODELS

The two agency detection and classification models used for the inference on the impresso Corpus are released as well:

newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset
newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset

The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.

Please refer to the report for further information or contact us.

3. CODE

https://github.com/impresso/newsagency-classification

4. CONTACT

Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)

Clear search

Close search

Google apps

Main menu

Dataset and Models for Detection of News Agency Releases in Historical...

American Sign Language Dataset

About Dataset

Dataset Structure:

Dataset Summary:

Features:

Applications:

Suggested Workflow:

Credits:

Codecfake dataset - training set (part 2 of 3)

Abstract

Codecfake Dataset

Countermeasure

A Dataset of Outdoor RSS Measurements for Localization

Dataset and Models for Detection of News Agency Releases in Historical Newspapers