65 datasets found

Normalization methods impact the number of significant genus-level...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean M. Gibbons; Claire Duvallet; Eric J. Alm (2023). Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases. [Dataset]. http://doi.org/10.1371/journal.pcbi.1006102.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1006102.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sean M. Gibbons; Claire Duvallet; Eric J. Alm
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.
Sample dataset for the models trained and tested in the paper 'Can AI be...
zenodo.org
zip
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12934521
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

To use the data, clone the corresponding repository and unzip this zip file in the data folder.
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
datadryad.org
zip
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
Hospital for Sick Children
University of Toronto
Universidade de São Paulo
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
Data from: Isobaric Matching between Runs and Novel PSM-Level Normalization...
acs.figshare.com
txt
Updated Jun 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sung-Huan Yu; Pelagia Kyriakidou; Jürgen Cox (2023). Isobaric Matching between Runs and Novel PSM-Level Normalization in MaxQuant Strongly Improve Reporter Ion-Based Quantification [Dataset]. http://doi.org/10.1021/acs.jproteome.0c00209.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.0c00209.s002
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Sung-Huan Yu; Pelagia Kyriakidou; Jürgen Cox
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Isobaric labeling has the promise of combining high sample multiplexing with precise quantification. However, normalization issues and the missing value problem of complete n-plexes hamper quantification across more than one n-plex. Here, we introduce two novel algorithms implemented in MaxQuant that substantially improve the data analysis with multiple n-plexes. First, isobaric matching between runs makes use of the three-dimensional MS1 features to transfer identifications from identified to unidentified MS/MS spectra between liquid chromatography–mass spectrometry runs in order to utilize reporter ion intensities in unidentified spectra for quantification. On typical datasets, we observe a significant gain in MS/MS spectra that can be used for quantification. Second, we introduce a novel PSM-level normalization, applicable to data with and without the common reference channel. It is a weighted median-based method, in which the weights reflect the number of ions that were used for fragmentation. On a typical dataset, we observe complete removal of batch effects and dominance of the biological sample grouping after normalization. Furthermore, we provide many novel processing and normalization options in Perseus, the companion software for the downstream analysis of quantitative proteomics results. All novel tools and algorithms are available with the regular MaxQuant and Perseus releases, which are downloadable at http://maxquant.org.
c
Dataset of normalised Slovene text KonvNormSl 1.0
clarin.si
live.european-language-grid.eu
Updated Sep 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikola Ljubešić; Katja Zupan; Darja Fišer; Tomaž Erjavec (2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1068?locale-attribute=sl
Explore at:
Dataset updated
Sep 19, 2016
Authors
Nikola Ljubešić; Katja Zupan; Darja Fišer; Tomaž Erjavec
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data used in the experiments described in:

Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/)

Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language)

The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/).

The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
d
Methods for normalizing microbiome data: an ecological perspective
datadryad.org
data.niaid.nih.gov
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tn8qs35
Dataset updated
Oct 30, 2018
Dataset provided by
Dryad
Authors
Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
Time period covered
2018
Description
Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective
d
Cadastral PLSS Standardized Data - PLSSSecond Division (Dalhart) - Version...
catalog.data.gov
gimi9.com
+2more
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Dalhart) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-dalhart-version-1-1
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
u
Cadastral PLSS Standardized Data - PLSSSecond Division (Douglas) - Version...
gstore.unm.edu
gimi9.com
+2more
Updated Sep 25, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). Cadastral PLSS Standardized Data - PLSSSecond Division (Douglas) - Version 1.1 [Dataset]. http://gstore.unm.edu/apps/rgis/datasets/64e5206b-0927-49f6-89c7-14fdbf271ad8/metadata/ISO-19115:2003.html
Explore at:
Dataset updated
Sep 25, 2011
Time period covered
Apr 11, 2011
Area covered
West Bound -110.006112069 East Bound -107.993887964 North Bound 32.0061121667 South Bound 30.9938880847
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
d
Cadastral PLSS Standardized Data - PLSSSecond Division (St Johns) - Version...
catalog.data.gov
gstore.unm.edu
+1more
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (St Johns) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-st-johns-version-1-1
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
d
Cadastral PLSS Standardized Data - PLSSSecond Division (Roswell) - Version...
catalog.data.gov
gstore.unm.edu
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Roswell) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-roswell-version-1-1
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
Traffic Signs Preprocessed
kaggle.com
zip
Updated Aug 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentyn Sichkar (2019). Traffic Signs Preprocessed [Dataset]. https://www.kaggle.com/datasets/valentynsichkar/traffic-signs-preprocessed/versions/1
Explore at:
zip(4471082770 bytes)Available download formats
Dataset updated
Aug 31, 2019
Authors
Valentyn Sichkar
Description
Content

This is ready to use preprocessed data for Traffic Signs saved into the nine pickle files.
Original datasets are in the following files:
- train.pickle
- valid.pickle
- test.pickle

Code with detailed description on how datasets were preprocessed is in datasets_preparing.py

Before preprocessing training dataset was equalized making examples in the classes equal as it is shown on the figure below. Histogram of 43 classes for training dataset with their number of examples for Traffic Signs Classification before and after equalization by adding transformated images (brightness and rotation) from original dataset. After equalization, training dataset has increased up to 86989 examples.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fb5d9f0189353832e769c2bdd8e25243d%2Fhistogram.png?generation=1567275066871451&alt=media" alt="">

Resulted preprocessed nine files are as follows:
- data0.pickle - Shuffling
- data1.pickle - Shuffling, /255.0 Normalization
- data2.pickle - Shuffling, /255.0 + Mean Normalization
- data3.pickle - Shuffling, /255.0 + Mean + STD Normalization
- data4.pickle - Grayscale, Shuffling
- data5.pickle - Grayscale, Shuffling, Local Histogram Equalization
- data6.pickle - Grayscale, Shuffling, Local Histogram Equalization, /255.0 Normalization
- data7.pickle - Grayscale, Shuffling, Local Histogram Equalization, /255.0 + Mean Normalization
- data8.pickle - Grayscale, Shuffling, Local Histogram Equalization, /255.0 + Mean + STD Normalization

Datasets data0 - data3 have RGB images and datasets data4 - data8 have Gray images.

Shapes of data0 - data3 are as following (RGB):
- x_train: (86989, 3, 32, 32)
- y_train: (86989,)
- x_validation: (4410, 3, 32, 32)
- y_validation: (4410,)
- x_test: (12630, 3, 32, 32)
- y_test: (12630,)

Shapes of data4 - data8 are as following (Gray):
- x_train: (86989, 1, 32, 32)
- y_train: (86989,)
- x_validation: (4410, 1, 32, 32)
- y_validation: (4410,)
- x_test: (12630, 1, 32, 32)
- y_test: (12630,)

mean image and standard deviation were calculated from train dataset and applied to validation and testing datasets for appropriate datasets. When using user's image for classification, it has to be preprocessed firstly in the same way and in the same order according to the chosen dataset among nine.

Acknowledgements

Initial data is German Traffic Sign Recognition Benchmarks (GTSRB).
d
Cadastral PLSS Standardized Data - PLSSSecond Division (Santa Fe) - Version...
catalog.data.gov
gstore.unm.edu
+1more
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Santa Fe) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-santa-fe-version-1-1
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
NLUCat
zenodo.org
huggingface.co
+1more
zip
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10721193
Dataset updated
Mar 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLUCat

Dataset Description

Dataset Summary

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

In this repository you'll find the following items:

NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

NLUCat_dataset.json: the completed NLUCat dataset

NLUCat_stats.tsv: statistics about de NLUCat dataset

dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

reports: folder with the reports done as feedback to the annotators during the annotation process

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

Supported Tasks and Leaderboards

Intent classification, spans identification and examples generation.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Data Fields

example: `str`. Example

annotation: `dict`. Annotation of the example

intent: `str`. Intent tag

slots: `list`. List of slots

Tag:`str`. tag to the slot

Text:`str`. Text of the slot

Start_char: `int`. First character of the span

End_char: `int`. Last character of the span

Example

An example looks as follows:

{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},

Data Splits

NLUCat.train: 9128 examples

NLUCat.dev: 1441 examples

NLUCat.test: 1441 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

Source Data

Initial Data Collection and Normalization

We commissioned a company to create fictitious examples for the creation of this dataset.

Who are the source language producers?

We commissioned the writing of the examples to the company m47 labs.

Annotations

Annotation process

The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

Who are the annotators?

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

Personal and Sensitive Information

No personal or sensitive information included.

The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

Considerations for Using the Data

Social Impact of Dataset

We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

Discussion of Biases

When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
g
Cadastral PLSS Standardized Data - PLSSSecond Division (Las Cruces) -...
gimi9.com
gstore.unm.edu
+2more
Updated Dec 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Cadastral PLSS Standardized Data - PLSSSecond Division (Las Cruces) - Version 1.1 [Dataset]. https://gimi9.com/dataset/data-gov_cadastral-plss-standardized-data-plsssecond-division-las-cruces-version-1-1
Explore at:
Dataset updated
Dec 9, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Las Cruces
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
u
Cadastral PLSS Standardized Data - PLSSSecond Division (Carlsbad) - Version...
gstore.unm.edu
csv, geojson, gml +5
Updated Mar 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Earth Data Analysis Center (2025). Cadastral PLSS Standardized Data - PLSSSecond Division (Carlsbad) - Version 1.1 [Dataset]. https://gstore.unm.edu/apps/rgis/datasets/e596d678-49a2-4319-ab2f-5ad238f4feef/metadata/FGDC-STD-001-1998.html
Explore at:
geojson(100), gml(100), shp(100), zip(61), xls(100), kml(100), json(100), csv(100)Available download formats
Dataset updated
Mar 23, 2025
Dataset provided by
Earth Data Analysis Center
Time period covered
Apr 11, 2011
Area covered
New Mexico, West Bounding Coordinate -106.006111716 East Bounding Coordinate -103.993888406 North Bounding Coordinate 33.0061122267 South Bounding Coordinate 30.993888057
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
P
MNIST Dataset
paperswithcode.com
Updated Nov 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y. LeCun; L. Bottou; Y. Bengio; P. Haffner (2021). MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/mnist
Explore at:
Dataset updated
Nov 16, 2021
Authors
Y. LeCun; L. Bottou; Y. Bengio; P. Haffner
Description
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
f
Data from: MS-DAP Platform for Downstream Data Analysis of Label-Free...
acs.figshare.com
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Koopmans; Ka Wan Li; Remco V. Klaassen; August B. Smit (2023). MS-DAP Platform for Downstream Data Analysis of Label-Free Proteomics Uncovers Optimal Workflows in Benchmark Data Sets and Increased Sensitivity in Analysis of Alzheimer’s Biomarker Data [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00513.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.2c00513.s003
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Frank Koopmans; Ka Wan Li; Remco V. Klaassen; August B. Smit
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the rapidly moving proteomics field, a diverse patchwork of data analysis pipelines and algorithms for data normalization and differential expression analysis is used by the community. We generated a mass spectrometry downstream analysis pipeline (MS-DAP) that integrates both popular and recently developed algorithms for normalization and statistical analyses. Additional algorithms can be easily added in the future as plugins. MS-DAP is open-source and facilitates transparent and reproducible proteome science by generating extensive data visualizations and quality reporting, provided as standardized PDF reports. Second, we performed a systematic evaluation of methods for normalization and statistical analysis on a large variety of data sets, including additional data generated in this study, which revealed key differences. Commonly used approaches for differential testing based on moderated t-statistics were consistently outperformed by more recent statistical models, all integrated in MS-DAP. Third, we introduced a novel normalization algorithm that rescues deficiencies observed in commonly used normalization methods. Finally, we used the MS-DAP platform to reanalyze a recently published large-scale proteomics data set of CSF from AD patients. This revealed increased sensitivity, resulting in additional significant target proteins which improved overlap with results reported in related studies and includes a large set of new potential AD biomarkers in addition to previously reported.
w
Cadastral PLSS Standardized Data - PLSSSecond Division (Brownfield) -...
data.wu.ac.at
gstore.unm.edu
+3more
csv, excel, geojson +9
Updated Jun 25, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Earth Data Analysis Center, University of New Mexico (2014). Cadastral PLSS Standardized Data - PLSSSecond Division (Brownfield) - Version 1.1 [Dataset]. https://data.wu.ac.at/odso/data_gov/OTFjYzNhMjEtYzMxNy00OWU5LWJlNzMtZjJlYmQzODFkNDE3
Explore at:
json, zip, csv, xml, shp, wfs, geojson, html, gml, kml, wms, excelAvailable download formats
Dataset updated
Jun 25, 2014
Dataset provided by
Earth Data Analysis Center, University of New Mexico
Area covered
2efadbfcfd842266ea6b06b17dab7b3f423b8b4c
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
c
Cadastral PLSS Standardized Data - PLSSSecond Division (Silver City) -...
s.cnmilf.com
gstore.unm.edu
+2more
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Silver City) - Version 1.1 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-silver-city-version-1-1
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.
c
Cadastral PLSS Standardized Data - geodatabase - Version 1.1
s.cnmilf.com
gstore.unm.edu
+1more
Updated Dec 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2020). Cadastral PLSS Standardized Data - geodatabase - Version 1.1 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cadastral-plss-standardized-data-geodatabase-version-1-1
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
(Point of Contact)
Description
This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sean M. Gibbons; Claire Duvallet; Eric J. Alm (2023). Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases. [Dataset]. http://doi.org/10.1371/journal.pcbi.1006102.t001

Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pcbi.1006102.t001

Dataset updated

Jun 1, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Sean M. Gibbons; Claire Duvallet; Eric J. Alm

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.

Clear search

Close search

Google apps

Main menu

Normalization methods impact the number of significant genus-level...

Sample dataset for the models trained and tested in the paper 'Can AI be...

Data from: A systematic evaluation of normalization methods and probe...

Data from: Isobaric Matching between Runs and Novel PSM-Level Normalization...

Dataset of normalised Slovene text KonvNormSl 1.0

Methods for normalizing microbiome data: an ecological perspective

Cadastral PLSS Standardized Data - PLSSSecond Division (Dalhart) - Version...

Cadastral PLSS Standardized Data - PLSSSecond Division (Douglas) - Version...

Cadastral PLSS Standardized Data - PLSSSecond Division (St Johns) - Version...

Cadastral PLSS Standardized Data - PLSSSecond Division (Roswell) - Version...

Traffic Signs Preprocessed

Content

Acknowledgements

Cadastral PLSS Standardized Data - PLSSSecond Division (Santa Fe) - Version...

NLUCat

NLUCat

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Cadastral PLSS Standardized Data - PLSSSecond Division (Las Cruces) -...

Cadastral PLSS Standardized Data - PLSSSecond Division (Carlsbad) - Version...

MNIST Dataset

Data from: MS-DAP Platform for Downstream Data Analysis of Label-Free...

Cadastral PLSS Standardized Data - PLSSSecond Division (Brownfield) -...

Cadastral PLSS Standardized Data - PLSSSecond Division (Silver City) -...

Cadastral PLSS Standardized Data - geodatabase - Version 1.1

Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.