15 datasets found
  1. Wine quality dataset with identified duplicates

    • kaggle.com
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aahz78 (2024). Wine quality dataset with identified duplicates [Dataset]. https://www.kaggle.com/datasets/aahz78/wine-quality-dataset-with-identified-duplicates
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    aahz78
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content

    The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.

    Description

    This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.

    The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:

    Studying the similarities between different types of wine.
    Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities.
    Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.
    

    Key Features

    Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy.
    High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates.
    Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis.
    File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.
    

    Usage

    This dataset is useful for:

    Data cleaning and preprocessing exercises.
    Duplicate detection and handling techniques.
    Exploring the impact of duplicates on data analysis and machine learning models.
    Educational purposes for understanding the importance of data quality.
    Studying similarities between different types of wine and their characteristics.
    

    File Structure

    1dd.json: red wine duplicate records.
    1ddw.json wite wine duplicate records.
    ddrw.json: A file containing information about 100% identical characteristics of red and white wines.
    

    Acknowledgements

    This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.

  2. d

    Data from: Representations of emotion concepts: Comparison across pairwise,...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kwon, Mijin; Wager, Tor; Phillips, Jonathan (2023). Representations of emotion concepts: Comparison across pairwise, appraisal feature-based, and word embedding-based similarity spaces [Dataset]. http://doi.org/10.7910/DVN/6DPPKH
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Kwon, Mijin; Wager, Tor; Phillips, Jonathan
    Description

    This repository contains code and data for the publication "Representations of emotion concepts: Comparison across pairwise, appraisal feature-based, and word embedding-based similarity spaces" by Kwon, M., Wager, T., & Phillips, J. (2022), published in the Proceedings of the Annual Meeting of the Cognitive Science Society, 44(44) and can be found at https://escholarship.org/uc/item/8vj3d366. All code is written in R. Updates We have updated our analyses based on feedback received since the annual meeting. The changes do not alter our main findings or conclusion. The included analysis script (EMOCON_cogsci2022_revised.Rmd) reflects these updates: Additional emotion concept pairs: Pairs involving comfortableness, gratefulness, relaxedness, romanticness, sereneness, protectiveness were added to the analysis, which were omitted in the previous version. The revised script includes all pairs and reflects changes in the feature-based similarity matrix, correlation between the similarity measures, loading values from principle component analysis on appraisal features, regression coefficients from multiple regression with the affective features components from PCA, and difference scores of the affective feature components. Scaling before PCA: PCA was rerun after rescaling data. script Scripts used for analysis are included in a R markdown file. data Data for appraisal feature rating, pairwise similarity rating, word embedding from word2vec models trained on Google news and Wikipedia are included. References for these data are listed below and also in the paper. Word embeddings from GPT3 can be accessed via OpenAI's API, with relevant documentation on how-to found here. Word embedding from W2V model trained on Google news: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. Word embeddings from W2v Model trained on Wikipedia: Fares, M., Kutuzov, A., Oepen, S., & Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. Proceedings of the 21st Nordic Conference on Computational Linguistics, 271–276.

  3. n

    Data and code for: Two dominant forms of multisite similarity decline –...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Deane (2024). Data and code for: Two dominant forms of multisite similarity decline – their origins and interpretation [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hds
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    La Trobe University
    Authors
    David Deane
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Contains data and code for peer review of the draft manuscript 'Two dominant forms of multisite similarity decline – their origins and interpretation' in review at Ecology and Evolution (Manuscript ID: ECE-2022-10-01523). The data are a subset of the metaCommunity Ecology: Species, Traits, Environment and Space; "CESTES" database reported in A global database for metacommunity ecology, integrating species, traits, environment and space by A. Jeliazkov, D. Mijatovic, S. Chantepie, N. Andrew, R. Arlettaz, L. Barbaro, et al. Scientific Data 2020 Vol. 7 Issue 1 Pages e6. Data were downloaded from the Figshare repository: https://doi.org/10.6084/m9.figshare.c.4459637 on 9 Nov 2021. Methods The 80 datasets in the original database were first filtered to select only abundance (or similar quantitative) estimate of species in each site. This yielded 69 datasets. Several of these were experimental treatments (e.g., logged vs unlogged) or were time series re-surveys of the same sites. Because our interest was in structures within a single habitat type and point in time, these datasets were subdivided into discrete units. In total, this (coincidentally) resulted in 80 datasets for analysis. These data are in the cestesAbSplit.RData object, with matching metadata in mdat_cestesSplit.csv. cestesAbSplit.RData - a list containing the sites x species matrices for the 80 datasets used for analysis. mdat_cestesSplit.csv - matching metadata for the 80 datasets (name and source of the data, taxonomic grouping, kingdom, realm, extent in km2, total sites, total species. This file is only included to run the R analysis. Data are repeated, with full description in Appendix S5 (see Readme tab). Appendix S5 Metadata.xlsx - metadata and values of standardized effect sizes from empirical analyses with ReadMe file explaining each field. The effect sizes were calculated using the scripts in Rcode_form_of_zeta.R (requires R_function.R) and can be reproduced by running the scripts (some very small numerical differences might occur in simulated values based on randomisation).

  4. Z

    Data from: Codon similarity data in ATTED-II ver 8.0 (Bra, Mtr)

    • data.niaid.nih.gov
    Updated Jul 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Obayashi, Takeshi (2023). Codon similarity data in ATTED-II ver 8.0 (Bra, Mtr) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8127407
    Explore at:
    Dataset updated
    Jul 10, 2023
    Dataset provided by
    Obayashi, Takeshi
    Aoki, Yuichi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Codon similarity data in ATTED-II ver 8.0

    The gene-to-gene codon similarity data is organized in the form of tables, each named according to the Entrez Gene ID of a particular query gene. Each table encompasses three columns, specifying: the Entrez Gene ID of a corresponding gene, an MR (Mutual Rank) value (where a smaller number signifies a stronger relationship), and a Pearson correlation coefficient (where a larger number suggests a stronger association).

    Protein-coding sequences utilized in this study were retrieved from NCBI's RefSeq database. For each gene, a 61-dimensional vector was derived from the count of codons in the protein-coding sequence. In instances where multiple RefSeq sequences were associated with a single gene, the longest sequence was selected for the codon usage calculation. Pearson correlation coefficients (PCCs) were determined between the vectors of any two given genes. These PCCs were subsequently converted into MRs, employed as an index to evaluate the similarity in codon usage between the genes.

  5. d

    Data from: Assessing identity, redundancy and confounds in Gene Ontology...

    • search.dataone.org
    • borealisdata.ca
    • +1more
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gillis, Jesse; Pavlidis, Paul (2023). Assessing identity, redundancy and confounds in Gene Ontology annotations over time [Dataset]. http://doi.org/10.5683/SP2/ZLJTVW
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Gillis, Jesse; Pavlidis, Paul
    Description

    MOTIVATION: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored. RESULTS: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their 'functional identity' over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks. AVAILABILITY: Data available at http://chibi.ubc.ca/assessGO.

  6. Z

    EO-based area monitoring markers computed over the Cypriotic pilot region...

    • data.niaid.nih.gov
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nejc Vesel (2022). EO-based area monitoring markers computed over the Cypriotic pilot region (2022) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7139348
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    Grega Milcinski
    Nejc Vesel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the context of the EU-funded project DIONE (No. 870378), the following EO-based monitoring marker maps were released over the two pilot regions, containing the results produced from a set of image analysis and machine learning techniques. The latest explores the benefits of Copernicus's multispectral high-resolution Sentinel-2 data acquired from 01-08-2021 until 03-09-2022 and provides tailored information on the needs of European paying agencies (e.g. CAPO and NPA), expressed with the following markers.

    Mowing marker: used to detect mowing events on meadow/grass like Features Of Interest (FOI)

    Mean-NDVI marker: used to detect erroneous claims with no vegetation

    Homogeneity marker: used to determine if a parcel geometry consists of a single crop or if multiple things are growing on the parcel

    Bare soil marker: used to detect observation where bare soil is present on the feature of interest. This indicates agricultural activity on the FOI (plowing, harvest)

    Similarity and distance markers: used to give additional context to the crop classification and to detect erroneous claims

    Land marker: used to detect the land type and non-productive EFAs of the FOI

    Crop-type marker: used to detect the specific crop growing on the FOI

    This dataset is comprised of the geopackage file "markers_summary.gpkg", which was computed for the Cypriotic pilot region. Descriptions are given below.

    Markers summary dataset

    Description of the information contained in the corresponding "markers summary" dataset
    
    
        Attribute name
        Description
    
    
    
    
         POLY_ID 
        Unique polygon identifier
    
    
        OBS_ALL
        Number of all available observations
    
    
        OBS_VALID
        Number of all valid observations
    
    
        CROP_CODE
        Crop identifier
    
    
        N_PIXEL
        Number of S2 pixels within FOI
    
    
        PLOTIDCROP
        PLOT_ID_CROP
    
    
        BS_APR_END
        Number of bare-soil observations between 2022-03-01 and 2022-04-30 
    
    
        BS_FEB_JUN
        Number of bare-soil observations between 2022-02-01 and 2022-06-30
    
    
        BS_JAN_APR
        Number of bare-soil observations between 2022-01-01 and 2022-04-30
    
    
        BS_JAN_MAR
        Number of bare-soil observations between 2022-01-01 and 2022-03-31
    
    
        BS_OCT_DEC
        Number of bare-soil observations between 2021-10-01 and 2021-12-31
    
    
        BS_OCT_SEP
        Number of bare-soil observations between 2021-10-01 and 2022-09-30
    
    
        P_CROP
        The FOI label as predicted by the crop group model
    
    
        CR_P_SCORE
        The pseudoprobability of the crop-group prediction. A score close to 1 indicates that the model is very confident in the prediction
    
    
        D_GROUP
        Declared crop group
    
    
        DIST_CROP
        Most similar crop according to the distance marker
    
    
        DIST_SC00
        Distance marker score of a most similar crop
    
    
        DIST_SCORE
        Distance marker score of a FOI when compared to nearby FOIs with the same claim. A value close to 100 indicates that a FOI is not similar to other FOIs with the same claim.
    
    
        HOM_CLASS
        Homogeneous/Heterogeneous
    
    
        HOM_SCORE
        Homogeneity probability assigned by the homogeneity marker
    
    
        P_LAND_GR
        The FOI label as predicted by the land group model using crop groupings based on land use
    
    
        LGRP_SCORE
        The pseudoprobability of the crop-group prediction. A score close to 1 indicates that the model is very confident in the prediction.
    
    
        D_LAND
        Declared land group
    
    
        M_NDVI_A_J
        Mean NDVI value in interval 2022-04-01 till 2022-07-31
    
    
        M_NDVI_F_M
        Mean NDVI value in interval 2022-02-01 till 2022-03-31
    
    
        M_NDVI_J_A
        Mean NDVI value in interval 2022-01-01 till 2022-04-30 
    
    
        MW_FEB_JUN
        Number of mowing events between 2022-02-01 and 2022-06-30
    
    
        MW_OCT_SEP
        Number of mowing events between 2021-10-01 and 2022-09-30
    
    
        MW_ALL
        Number of mowing events in the observation period
    
    
        SIM_CROP
        Most similar crop according to similarity marker
    
    
        SIM_SC00
        Similarity marker score of a most similar crop
    
    
        SIM_SCORE
        Similarity marker score of a FOI when compared to nearby FOIs with the same claim. A value close to 100 indicates that a FOI is not similar to other FOIs with the same claim.
    
  7. f

    Data from: An empirical study of the semantic similarity of geospatial...

    • tandf.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niloofar Aflaki; Kristin Stock; Christopher B. Jones; Hans Guesgen; Jeremy Morley (2023). An empirical study of the semantic similarity of geospatial prepositions and their senses [Dataset]. http://doi.org/10.6084/m9.figshare.20517959.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Niloofar Aflaki; Kristin Stock; Christopher B. Jones; Hans Guesgen; Jeremy Morley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spatial prepositions have been studied in some detail from multiple disciplinary perspectives. However, neither the semantic similarity of these prepositions, nor the relationships between the multiple senses of different spatial prepositions, are well understood. In an empirical study of 24 spatial prepositions, we identify the degree and nature of semantic similarity and extract senses for three semantically similar groups of prepositions using t-SNE, DBSCAN clustering, and Venn diagrams. We validate the work by manual annotation with another data set. We find nuances in meaning among proximity and adjacency prepositions, such as the use of close to instead of near for pairs of lines, and the importance of proximity over contact for the next to preposition, in contrast to other adjacency prepositions.

  8. Food Ingredients and Allergens

    • kaggle.com
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laksika Tharmalingam (2023). Food Ingredients and Allergens [Dataset]. https://www.kaggle.com/datasets/uom190346a/food-ingredients-and-allergens
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Laksika Tharmalingam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Note: If you found this dataset useful then do upvote the dataset so it can reach further kagglers.

    The Food Allergens Dataset is a collection of information regarding allergens present in various food items. The dataset contains allergen information for a range of food ingredients, enabling the identification and analysis of potential allergens in different dishes and products. It serves as a valuable resource for researchers, food manufacturers, healthcare professionals, and individuals with food allergies.

    Size: The dataset consists of a total of 400 records, with each record representing a specific food item and its associated allergens.

    Allergens: The dataset includes a comprehensive list of allergens found in the food items. These allergens encompass a wide range of ingredients, such as dairy, wheat, nuts (almonds, peanuts, pine nuts), seafood (anchovies, fish, shellfish), grains (oats, rice), animal-based ingredients (chicken, pork), plant-based ingredients (celery, mustard, soybeans), and common ingredients (cocoa, eggs). Additionally, the dataset contains entries where no specific allergens are listed.

    Data Structure - The dataset is structured with multiple columns to provide detailed information. The columns include: - Food Item: Represents the name of the food item. - Ingredients: Lists the ingredients present in the food item, categorized into different columns such as sugar, salt, oil, spices, etc. - Allergens: Indicates the allergens associated with the food item, including the specific allergenic ingredients present. - Prediction : food products containing allergens and those that do not (contains , do not contains)

    Potential Models and Analysis:

    Allergen Detection Model: can predict whether it contains allergens or not

    Ingredient Similarity Analysis: This analysis can provide insights into similarities and differences among different types of dishes.

    Allergen Prevalence Analysis: Can gain insights into the prevalence of different allergens in food products.

    Recommender Systems: The dataset can also be used to develop recommender systems for individuals with specific dietary restrictions or allergies.

  9. f

    Assessing Low-Intensity Relationships in Complex Networks

    • plos.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Spitz; Anna Gimmler; Thorsten Stoeck; Katharina Anna Zweig; Emőke-Ágnes Horvát (2023). Assessing Low-Intensity Relationships in Complex Networks [Dataset]. http://doi.org/10.1371/journal.pone.0152536
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Andreas Spitz; Anna Gimmler; Thorsten Stoeck; Katharina Anna Zweig; Emőke-Ágnes Horvát
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many large network data sets are noisy and contain links representing low-intensity relationships that are difficult to differentiate from random interactions. This is especially relevant for high-throughput data from systems biology, large-scale ecological data, but also for Web 2.0 data on human interactions. In these networks with missing and spurious links, it is possible to refine the data based on the principle of structural similarity, which assesses the shared neighborhood of two nodes. By using similarity measures to globally rank all possible links and choosing the top-ranked pairs, true links can be validated, missing links inferred, and spurious observations removed. While many similarity measures have been proposed to this end, there is no general consensus on which one to use. In this article, we first contribute a set of benchmarks for complex networks from three different settings (e-commerce, systems biology, and social networks) and thus enable a quantitative performance analysis of classic node similarity measures. Based on this, we then propose a new methodology for link assessment called z* that assesses the statistical significance of the number of their common neighbors by comparison with the expected value in a suitably chosen random graph model and which is a consistently top-performing algorithm for all benchmarks. In addition to a global ranking of links, we also use this method to identify the most similar neighbors of each single node in a local ranking, thereby showing the versatility of the method in two distinct scenarios and augmenting its applicability. Finally, we perform an exploratory analysis on an oceanographic plankton data set and find that the distribution of microbes follows similar biogeographic rules as those of macroorganisms, a result that rejects the global dispersal hypothesis for microbes.

  10. d

    Data from: Matching seed to site by climate similarity: Techniques to...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Jan 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle D. Doherty; Bradley J. Butterfield; Troy E. Wood (2017). Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration [Dataset]. http://doi.org/10.5061/dryad.43bv0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 10, 2017
    Dataset provided by
    Dryad
    Authors
    Kyle D. Doherty; Bradley J. Butterfield; Troy E. Wood
    Time period covered
    2017
    Area covered
    global, western North America
    Description

    Climate Similarity Support ScriptsThis compressed file contains scripts and supporting data necessary to reproduce the analyses and products associated with Doherty et al. (2017), “Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration.”Climate Similarity Scripts.zip

  11. f

    DataSheet5_Classifying breast cancer using multi-view graph neural network...

    • frontiersin.figshare.com
    txt
    Updated Feb 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanjiao Ren; Yimeng Gao; Wei Du; Weibo Qiao; Wei Li; Qianqian Yang; Yanchun Liang; Gaoyang Li (2024). DataSheet5_Classifying breast cancer using multi-view graph neural network based on multi-omics data.CSV [Dataset]. http://doi.org/10.3389/fgene.2024.1363896.s005
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Frontiers
    Authors
    Yanjiao Ren; Yimeng Gao; Wei Du; Weibo Qiao; Wei Li; Qianqian Yang; Yanchun Liang; Gaoyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction: As the evaluation indices, cancer grading and subtyping have diverse clinical, pathological, and molecular characteristics with prognostic and therapeutic implications. Although researchers have begun to study cancer differentiation and subtype prediction, most of relevant methods are based on traditional machine learning and rely on single omics data. It is necessary to explore a deep learning algorithm that integrates multi-omics data to achieve classification prediction of cancer differentiation and subtypes.Methods: This paper proposes a multi-omics data fusion algorithm based on a multi-view graph neural network (MVGNN) for predicting cancer differentiation and subtype classification. The model framework consists of a graph convolutional network (GCN) module for learning features from different omics data and an attention module for integrating multi-omics data. Three different types of omics data are used. For each type of omics data, feature selection is performed using methods such as the chi-square test and minimum redundancy maximum relevance (mRMR). Weighted patient similarity networks are constructed based on the selected omics features, and GCN is trained using omics features and corresponding similarity networks. Finally, an attention module integrates different types of omics features and performs the final cancer classification prediction.Results: To validate the cancer classification predictive performance of the MVGNN model, we conducted experimental comparisons with traditional machine learning models and currently popular methods based on integrating multi-omics data using 5-fold cross-validation. Additionally, we performed comparative experiments on cancer differentiation and its subtypes based on single omics data, two omics data, and three omics data.Discussion: This paper proposed the MVGNN model and it performed well in cancer classification prediction based on multiple omics data.

  12. Data from: Structural Similarity of Hydrogen-Bonded Networks in Crystals of...

    • figshare.com
    • acs.figshare.com
    txt
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Duong; Thierry Maris; James D. Wuest (2023). Structural Similarity of Hydrogen-Bonded Networks in Crystals of Isomeric Pyridyl-Substituted Diaminotriazines [Dataset]. http://doi.org/10.1021/cg101290r.s003
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Adam Duong; Thierry Maris; James D. Wuest
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Crystals of isomeric pyridyl-substituted diaminotriazines 1a−c and elongated analogues 2a−b were grown under various conditions, and their structures were solved by X-ray crystallography. Analysis of the structures revealed three shared features: (1) The compounds favor flattened conformations; (2) they participate in approximately coplanar hydrogen bonding according to motifs characteristic of diaminotriazines; and (3) these interactions play a key role in directing molecular organization. Together, the consistent molecular topologies and the shared presence of a dominant site of association ensure that the compounds crystallize similarly to give structures that feature chains, tapes, and layers. In certain cases, in fact, the molecular organization adopted by different pyridyl-substituted diaminotriazines is virtually identical, even when the length of the molecule or the orientation of the pyridyl group has been changed. Together, these observations show how functional groups such as diaminotriazinyl, which can control association by forming multiple directional intermolecular interactions according to reliable patterns, can be incorporated within more complex molecular structures to determine how crystallization will occur.

  13. f

    Mixed-effects model fitted to accuracy of all data with difficulty and an...

    • plos.figshare.com
    xls
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miki Ikuta; Koji Miwa (2025). Mixed-effects model fitted to accuracy of all data with difficulty and an interaction of language and category. [Dataset]. http://doi.org/10.1371/journal.pone.0318348.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Miki Ikuta; Koji Miwa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mixed-effects model fitted to accuracy of all data with difficulty and an interaction of language and category.

  14. f

    Data from: Dynamic effect of tonal similarity in bilingual auditory lexical...

    • tandf.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junru Wu; Yiya Chen; Vincent J. van Heuven; Niels O. Schiller (2023). Dynamic effect of tonal similarity in bilingual auditory lexical processing [Dataset]. http://doi.org/10.6084/m9.figshare.7409225.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Junru Wu; Yiya Chen; Vincent J. van Heuven; Niels O. Schiller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Phonological similarity affects bilingual lexical access of etymologically-related translation equivalents (ETEs). Jinan Mandarin (JM) and Standard Chinese (SC) are closely related and share many ETEs, which are usually orthographically and segmentally identical but vary in tonal similarity. Using an auditory lexical decision experiment and Generalised Additive Modelling, the present study investigates how cross-linguistic tonal similarity interacts with language of operation and how the switching of language across blocks influences SC-JM bilinguals’ auditory lexical processing of ETEs. Bilinguals showed a language dominance effect, indicating that ETEs are specified with separated word-form representations. Compared with SC tonal monolinguals, bilinguals showed a discontinuous bilingual auditory lexical advantage, instead of a classical bilingual lexical disadvantage. The dynamic role of cross-linguistic tonal similarity in auditory word processing is discussed in light of the bilinguals’ attentional shift with the change of language mode at the pre-lexical and lexical stages.

  15. f

    Codebook for primary data set.

    • figshare.com
    xlsx
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan-Ole Hesselberg; Pål Ulleberg; Øystein Sørensen; Knut Inge Fostervold; Sigrid Hegna Ingvaldsen; Ida Svege (2025). Codebook for primary data set. [Dataset]. http://doi.org/10.1371/journal.pone.0322696.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Jan-Ole Hesselberg; Pål Ulleberg; Øystein Sørensen; Knut Inge Fostervold; Sigrid Hegna Ingvaldsen; Ida Svege
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Codebook describing the data in the primary data set used as the starting point for the analysis script (see S2 Script). (XLSX)

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
aahz78 (2024). Wine quality dataset with identified duplicates [Dataset]. https://www.kaggle.com/datasets/aahz78/wine-quality-dataset-with-identified-duplicates
Organization logo

Wine quality dataset with identified duplicates

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
aahz78
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview

This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content

The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.

Description

This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.

The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:

Studying the similarities between different types of wine.
Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities.
Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.

Key Features

Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy.
High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates.
Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis.
File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.

Usage

This dataset is useful for:

Data cleaning and preprocessing exercises.
Duplicate detection and handling techniques.
Exploring the impact of duplicates on data analysis and machine learning models.
Educational purposes for understanding the importance of data quality.
Studying similarities between different types of wine and their characteristics.

File Structure

1dd.json: red wine duplicate records.
1ddw.json wite wine duplicate records.
ddrw.json: A file containing information about 100% identical characteristics of red and white wines.

Acknowledgements

This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.

Search
Clear search
Close search
Google apps
Main menu