65 datasets found
  1. Normalization methods impact the number of significant genus-level...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean M. Gibbons; Claire Duvallet; Eric J. Alm (2023). Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases. [Dataset]. http://doi.org/10.1371/journal.pcbi.1006102.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sean M. Gibbons; Claire Duvallet; Eric J. Alm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.

  2. Sample dataset for the models trained and tested in the paper 'Can AI be...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

    This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

    To use the data, clone the corresponding repository and unzip this zip file in the data folder.

  3. Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Hospital for Sick Children
    University of Toronto
    Universidade de São Paulo
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  4. Data from: Isobaric Matching between Runs and Novel PSM-Level Normalization...

    • acs.figshare.com
    txt
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sung-Huan Yu; Pelagia Kyriakidou; Jürgen Cox (2023). Isobaric Matching between Runs and Novel PSM-Level Normalization in MaxQuant Strongly Improve Reporter Ion-Based Quantification [Dataset]. http://doi.org/10.1021/acs.jproteome.0c00209.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Sung-Huan Yu; Pelagia Kyriakidou; Jürgen Cox
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Isobaric labeling has the promise of combining high sample multiplexing with precise quantification. However, normalization issues and the missing value problem of complete n-plexes hamper quantification across more than one n-plex. Here, we introduce two novel algorithms implemented in MaxQuant that substantially improve the data analysis with multiple n-plexes. First, isobaric matching between runs makes use of the three-dimensional MS1 features to transfer identifications from identified to unidentified MS/MS spectra between liquid chromatography–mass spectrometry runs in order to utilize reporter ion intensities in unidentified spectra for quantification. On typical datasets, we observe a significant gain in MS/MS spectra that can be used for quantification. Second, we introduce a novel PSM-level normalization, applicable to data with and without the common reference channel. It is a weighted median-based method, in which the weights reflect the number of ions that were used for fragmentation. On a typical dataset, we observe complete removal of batch effects and dominance of the biological sample grouping after normalization. Furthermore, we provide many novel processing and normalization options in Perseus, the companion software for the downstream analysis of quantitative proteomics results. All novel tools and algorithms are available with the regular MaxQuant and Perseus releases, which are downloadable at http://maxquant.org.

  5. c

    Dataset of normalised Slovene text KonvNormSl 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Sep 19, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikola Ljubešić; Katja Zupan; Darja Fišer; Tomaž Erjavec (2016). Dataset of normalised Slovene text KonvNormSl 1.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1068?locale-attribute=sl
    Explore at:
    Dataset updated
    Sep 19, 2016
    Authors
    Nikola Ljubešić; Katja Zupan; Darja Fišer; Tomaž Erjavec
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Data used in the experiments described in:

    Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/)

    Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

    There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language)

    The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/).

    The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.

  6. d

    Methods for normalizing microbiome data: an ecological perspective

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Oct 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger (2018). Methods for normalizing microbiome data: an ecological perspective [Dataset]. http://doi.org/10.5061/dryad.tn8qs35
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2018
    Dataset provided by
    Dryad
    Authors
    Donald T. McKnight; Roger Huerlimann; Deborah S. Bower; Lin Schwarzkopf; Ross A. Alford; Kyall R. Zenger
    Time period covered
    2018
    Description

    Simulation script 1This R script will simulate two populations of microbiome samples and compare normalization methods.Simulation script 2This R script will simulate two populations of microbiome samples and compare normalization methods via PcOAs.Sample.OTU.distributionOTU distribution used in the paper: Methods for normalizing microbiome data: an ecological perspective

  7. d

    Cadastral PLSS Standardized Data - PLSSSecond Division (Dalhart) - Version...

    • catalog.data.gov
    • gimi9.com
    • +2more
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Dalhart) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-dalhart-version-1-1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  8. u

    Cadastral PLSS Standardized Data - PLSSSecond Division (Douglas) - Version...

    • gstore.unm.edu
    • gimi9.com
    • +2more
    Updated Sep 25, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). Cadastral PLSS Standardized Data - PLSSSecond Division (Douglas) - Version 1.1 [Dataset]. http://gstore.unm.edu/apps/rgis/datasets/64e5206b-0927-49f6-89c7-14fdbf271ad8/metadata/ISO-19115:2003.html
    Explore at:
    Dataset updated
    Sep 25, 2011
    Time period covered
    Apr 11, 2011
    Area covered
    West Bound -110.006112069 East Bound -107.993887964 North Bound 32.0061121667 South Bound 30.9938880847
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  9. d

    Cadastral PLSS Standardized Data - PLSSSecond Division (St Johns) - Version...

    • catalog.data.gov
    • gstore.unm.edu
    • +1more
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (St Johns) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-st-johns-version-1-1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  10. d

    Cadastral PLSS Standardized Data - PLSSSecond Division (Roswell) - Version...

    • catalog.data.gov
    • gstore.unm.edu
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Roswell) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-roswell-version-1-1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  11. Traffic Signs Preprocessed

    • kaggle.com
    zip
    Updated Aug 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentyn Sichkar (2019). Traffic Signs Preprocessed [Dataset]. https://www.kaggle.com/datasets/valentynsichkar/traffic-signs-preprocessed/versions/1
    Explore at:
    zip(4471082770 bytes)Available download formats
    Dataset updated
    Aug 31, 2019
    Authors
    Valentyn Sichkar
    Description

    Content

    This is ready to use preprocessed data for Traffic Signs saved into the nine pickle files.
    Original datasets are in the following files:
    - train.pickle
    - valid.pickle
    - test.pickle


    Code with detailed description on how datasets were preprocessed is in datasets_preparing.py


    Before preprocessing training dataset was equalized making examples in the classes equal as it is shown on the figure below. Histogram of 43 classes for training dataset with their number of examples for Traffic Signs Classification before and after equalization by adding transformated images (brightness and rotation) from original dataset. After equalization, training dataset has increased up to 86989 examples.


    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fb5d9f0189353832e769c2bdd8e25243d%2Fhistogram.png?generation=1567275066871451&alt=media" alt="">


    Resulted preprocessed nine files are as follows:
    - data0.pickle - Shuffling
    - data1.pickle - Shuffling, /255.0 Normalization
    - data2.pickle - Shuffling, /255.0 + Mean Normalization
    - data3.pickle - Shuffling, /255.0 + Mean + STD Normalization
    - data4.pickle - Grayscale, Shuffling
    - data5.pickle - Grayscale, Shuffling, Local Histogram Equalization
    - data6.pickle - Grayscale, Shuffling, Local Histogram Equalization, /255.0 Normalization
    - data7.pickle - Grayscale, Shuffling, Local Histogram Equalization, /255.0 + Mean Normalization
    - data8.pickle - Grayscale, Shuffling, Local Histogram Equalization, /255.0 + Mean + STD Normalization


    Datasets data0 - data3 have RGB images and datasets data4 - data8 have Gray images.


    Shapes of data0 - data3 are as following (RGB):
    - x_train: (86989, 3, 32, 32)
    - y_train: (86989,)
    - x_validation: (4410, 3, 32, 32)
    - y_validation: (4410,)
    - x_test: (12630, 3, 32, 32)
    - y_test: (12630,)


    Shapes of data4 - data8 are as following (Gray):
    - x_train: (86989, 1, 32, 32)
    - y_train: (86989,)
    - x_validation: (4410, 1, 32, 32)
    - y_validation: (4410,)
    - x_test: (12630, 1, 32, 32)
    - y_test: (12630,)


    mean image and standard deviation were calculated from train dataset and applied to validation and testing datasets for appropriate datasets. When using user's image for classification, it has to be preprocessed firstly in the same way and in the same order according to the chosen dataset among nine.

    Acknowledgements

    Initial data is German Traffic Sign Recognition Benchmarks (GTSRB).

  12. d

    Cadastral PLSS Standardized Data - PLSSSecond Division (Santa Fe) - Version...

    • catalog.data.gov
    • gstore.unm.edu
    • +1more
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Santa Fe) - Version 1.1 [Dataset]. https://catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-santa-fe-version-1-1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  13. NLUCat

    • zenodo.org
    • huggingface.co
    • +1more
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    • example: `str`. Example
    • annotation: `dict`. Annotation of the example
    • intent: `str`. Intent tag
    • slots: `list`. List of slots
    • Tag:`str`. tag to the slot
    • Text:`str`. Text of the slot
    • Start_char: `int`. First character of the span
    • End_char: `int`. Last character of the span

    Example


    An example looks as follows:

    {
    "example": "Demana una ambulància; la meva dona està de part.",
    "annotation": {
    "intent": "call_emergency",
    "slots": [
    {
    "Tag": "service",
    "Text": "ambulància",
    "Start_char": 11,
    "End_char": 21
    },
    {
    "Tag": "situation",
    "Text": "la meva dona està de part",
    "Start_char": 23,
    "End_char": 48
    }
    ]
    }
    },


    Data Splits

    • NLUCat.train: 9128 examples
    • NLUCat.dev: 1441 examples
    • NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
    * First step: translation or elaboration of the instructions given to the annotators to write the examples.
    * Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
    * Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
    Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  14. g

    Cadastral PLSS Standardized Data - PLSSSecond Division (Las Cruces) -...

    • gimi9.com
    • gstore.unm.edu
    • +2more
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Cadastral PLSS Standardized Data - PLSSSecond Division (Las Cruces) - Version 1.1 [Dataset]. https://gimi9.com/dataset/data-gov_cadastral-plss-standardized-data-plsssecond-division-las-cruces-version-1-1
    Explore at:
    Dataset updated
    Dec 9, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Las Cruces
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  15. u

    Cadastral PLSS Standardized Data - PLSSSecond Division (Carlsbad) - Version...

    • gstore.unm.edu
    csv, geojson, gml +5
    Updated Mar 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Earth Data Analysis Center (2025). Cadastral PLSS Standardized Data - PLSSSecond Division (Carlsbad) - Version 1.1 [Dataset]. https://gstore.unm.edu/apps/rgis/datasets/e596d678-49a2-4319-ab2f-5ad238f4feef/metadata/FGDC-STD-001-1998.html
    Explore at:
    geojson(100), gml(100), shp(100), zip(61), xls(100), kml(100), json(100), csv(100)Available download formats
    Dataset updated
    Mar 23, 2025
    Dataset provided by
    Earth Data Analysis Center
    Time period covered
    Apr 11, 2011
    Area covered
    New Mexico, West Bounding Coordinate -106.006111716 East Bounding Coordinate -103.993888406 North Bounding Coordinate 33.0061122267 South Bounding Coordinate 30.993888057
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  16. P

    MNIST Dataset

    • paperswithcode.com
    Updated Nov 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. LeCun; L. Bottou; Y. Bengio; P. Haffner (2021). MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/mnist
    Explore at:
    Dataset updated
    Nov 16, 2021
    Authors
    Y. LeCun; L. Bottou; Y. Bengio; P. Haffner
    Description

    The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

  17. f

    Data from: MS-DAP Platform for Downstream Data Analysis of Label-Free...

    • acs.figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Koopmans; Ka Wan Li; Remco V. Klaassen; August B. Smit (2023). MS-DAP Platform for Downstream Data Analysis of Label-Free Proteomics Uncovers Optimal Workflows in Benchmark Data Sets and Increased Sensitivity in Analysis of Alzheimer’s Biomarker Data [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00513.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    ACS Publications
    Authors
    Frank Koopmans; Ka Wan Li; Remco V. Klaassen; August B. Smit
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the rapidly moving proteomics field, a diverse patchwork of data analysis pipelines and algorithms for data normalization and differential expression analysis is used by the community. We generated a mass spectrometry downstream analysis pipeline (MS-DAP) that integrates both popular and recently developed algorithms for normalization and statistical analyses. Additional algorithms can be easily added in the future as plugins. MS-DAP is open-source and facilitates transparent and reproducible proteome science by generating extensive data visualizations and quality reporting, provided as standardized PDF reports. Second, we performed a systematic evaluation of methods for normalization and statistical analysis on a large variety of data sets, including additional data generated in this study, which revealed key differences. Commonly used approaches for differential testing based on moderated t-statistics were consistently outperformed by more recent statistical models, all integrated in MS-DAP. Third, we introduced a novel normalization algorithm that rescues deficiencies observed in commonly used normalization methods. Finally, we used the MS-DAP platform to reanalyze a recently published large-scale proteomics data set of CSF from AD patients. This revealed increased sensitivity, resulting in additional significant target proteins which improved overlap with results reported in related studies and includes a large set of new potential AD biomarkers in addition to previously reported.

  18. w

    Cadastral PLSS Standardized Data - PLSSSecond Division (Brownfield) -...

    • data.wu.ac.at
    • gstore.unm.edu
    • +3more
    csv, excel, geojson +9
    Updated Jun 25, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Earth Data Analysis Center, University of New Mexico (2014). Cadastral PLSS Standardized Data - PLSSSecond Division (Brownfield) - Version 1.1 [Dataset]. https://data.wu.ac.at/odso/data_gov/OTFjYzNhMjEtYzMxNy00OWU5LWJlNzMtZjJlYmQzODFkNDE3
    Explore at:
    json, zip, csv, xml, shp, wfs, geojson, html, gml, kml, wms, excelAvailable download formats
    Dataset updated
    Jun 25, 2014
    Dataset provided by
    Earth Data Analysis Center, University of New Mexico
    Area covered
    2efadbfcfd842266ea6b06b17dab7b3f423b8b4c
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  19. c

    Cadastral PLSS Standardized Data - PLSSSecond Division (Silver City) -...

    • s.cnmilf.com
    • gstore.unm.edu
    • +2more
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). Cadastral PLSS Standardized Data - PLSSSecond Division (Silver City) - Version 1.1 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cadastral-plss-standardized-data-plsssecond-division-silver-city-version-1-1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

  20. c

    Cadastral PLSS Standardized Data - geodatabase - Version 1.1

    • s.cnmilf.com
    • gstore.unm.edu
    • +1more
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). Cadastral PLSS Standardized Data - geodatabase - Version 1.1 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cadastral-plss-standardized-data-geodatabase-version-1-1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Description

    This feature class is part of the Cadastral National Spatial Data Infrastructure (NSDI) CADNSDI publication data set for rectangular and non-rectangular Public Land Survey System (PLSS) data set. The metadata description in the Cadastral Reference System Feature Data Set more fully describes the entire data set. This feature class is the second division of the PLSS is quarter, quarter-quarter, sixteenth or government lot divisions of the PLSS. The second and third divisions are combined into this feature class as an intentional de-normalization of the PLSS hierarchical data. The polygons in this feature class represent the smallest division to the sixteenth that has been defined for the first division. For example In some cases sections have only been divided to the quarter. Divisions below the sixteenth are in the Special Survey or Parcel Feature Class.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sean M. Gibbons; Claire Duvallet; Eric J. Alm (2023). Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases. [Dataset]. http://doi.org/10.1371/journal.pcbi.1006102.t001
Organization logo

Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sean M. Gibbons; Claire Duvallet; Eric J. Alm
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Normalization methods impact the number of significant genus-level associations between cases and controls across multiple diseases.

Search
Clear search
Close search
Google apps
Main menu