15 datasets found

Wine quality dataset with identified duplicates
kaggle.com
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aahz78 (2024). Wine quality dataset with identified duplicates [Dataset]. https://www.kaggle.com/datasets/aahz78/wine-quality-dataset-with-identified-duplicates
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
aahz78
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content

The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.

Description

This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.

The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:

Studying the similarities between different types of wine. Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities. Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.

Key Features

Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy. High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates. Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis. File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.

Usage

This dataset is useful for:

Data cleaning and preprocessing exercises. Duplicate detection and handling techniques. Exploring the impact of duplicates on data analysis and machine learning models. Educational purposes for understanding the importance of data quality. Studying similarities between different types of wine and their characteristics.

File Structure

1dd.json: red wine duplicate records. 1ddw.json wite wine duplicate records. ddrw.json: A file containing information about 100% identical characteristics of red and white wines.

Acknowledgements

This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.
d
Data from: Representations of emotion concepts: Comparison across pairwise,...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kwon, Mijin; Wager, Tor; Phillips, Jonathan (2023). Representations of emotion concepts: Comparison across pairwise, appraisal feature-based, and word embedding-based similarity spaces [Dataset]. http://doi.org/10.7910/DVN/6DPPKH
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/6DPPKH
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Kwon, Mijin; Wager, Tor; Phillips, Jonathan
Description
This repository contains code and data for the publication "Representations of emotion concepts: Comparison across pairwise, appraisal feature-based, and word embedding-based similarity spaces" by Kwon, M., Wager, T., & Phillips, J. (2022), published in the Proceedings of the Annual Meeting of the Cognitive Science Society, 44(44) and can be found at https://escholarship.org/uc/item/8vj3d366. All code is written in R. Updates We have updated our analyses based on feedback received since the annual meeting. The changes do not alter our main findings or conclusion. The included analysis script (EMOCON_cogsci2022_revised.Rmd) reflects these updates: Additional emotion concept pairs: Pairs involving comfortableness, gratefulness, relaxedness, romanticness, sereneness, protectiveness were added to the analysis, which were omitted in the previous version. The revised script includes all pairs and reflects changes in the feature-based similarity matrix, correlation between the similarity measures, loading values from principle component analysis on appraisal features, regression coefficients from multiple regression with the affective features components from PCA, and difference scores of the affective feature components. Scaling before PCA: PCA was rerun after rescaling data. script Scripts used for analysis are included in a R markdown file. data Data for appraisal feature rating, pairwise similarity rating, word embedding from word2vec models trained on Google news and Wikipedia are included. References for these data are listed below and also in the paper. Word embeddings from GPT3 can be accessed via OpenAI's API, with relevant documentation on how-to found here. Word embedding from W2V model trained on Google news: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. Word embeddings from W2v Model trained on Wikipedia: Fares, M., Kutuzov, A., Oepen, S., & Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. Proceedings of the 21st Nordic Conference on Computational Linguistics, 271–276.
n
Data and code for: Two dominant forms of multisite similarity decline –...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Deane (2024). Data and code for: Two dominant forms of multisite similarity decline – their origins and interpretation [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hds
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rbnzs7hds
Dataset updated
Sep 10, 2024
Dataset provided by
La Trobe University
Authors
David Deane
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Contains data and code for peer review of the draft manuscript 'Two dominant forms of multisite similarity decline – their origins and interpretation' in review at Ecology and Evolution (Manuscript ID: ECE-2022-10-01523). The data are a subset of the metaCommunity Ecology: Species, Traits, Environment and Space; "CESTES" database reported in A global database for metacommunity ecology, integrating species, traits, environment and space by A. Jeliazkov, D. Mijatovic, S. Chantepie, N. Andrew, R. Arlettaz, L. Barbaro, et al. Scientific Data 2020 Vol. 7 Issue 1 Pages e6. Data were downloaded from the Figshare repository: https://doi.org/10.6084/m9.figshare.c.4459637 on 9 Nov 2021. Methods The 80 datasets in the original database were first filtered to select only abundance (or similar quantitative) estimate of species in each site. This yielded 69 datasets. Several of these were experimental treatments (e.g., logged vs unlogged) or were time series re-surveys of the same sites. Because our interest was in structures within a single habitat type and point in time, these datasets were subdivided into discrete units. In total, this (coincidentally) resulted in 80 datasets for analysis. These data are in the cestesAbSplit.RData object, with matching metadata in mdat_cestesSplit.csv. cestesAbSplit.RData - a list containing the sites x species matrices for the 80 datasets used for analysis. mdat_cestesSplit.csv - matching metadata for the 80 datasets (name and source of the data, taxonomic grouping, kingdom, realm, extent in km2, total sites, total species. This file is only included to run the R analysis. Data are repeated, with full description in Appendix S5 (see Readme tab). Appendix S5 Metadata.xlsx - metadata and values of standardized effect sizes from empirical analyses with ReadMe file explaining each field. The effect sizes were calculated using the scripts in Rcode_form_of_zeta.R (requires R_function.R) and can be reproduced by running the scripts (some very small numerical differences might occur in simulated values based on randomisation).
Z
Data from: Codon similarity data in ATTED-II ver 8.0 (Bra, Mtr)
data.niaid.nih.gov
Updated Jul 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Obayashi, Takeshi (2023). Codon similarity data in ATTED-II ver 8.0 (Bra, Mtr) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8127407
Explore at:
Dataset updated
Jul 10, 2023
Dataset provided by
Obayashi, Takeshi
Aoki, Yuichi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Codon similarity data in ATTED-II ver 8.0

The gene-to-gene codon similarity data is organized in the form of tables, each named according to the Entrez Gene ID of a particular query gene. Each table encompasses three columns, specifying: the Entrez Gene ID of a corresponding gene, an MR (Mutual Rank) value (where a smaller number signifies a stronger relationship), and a Pearson correlation coefficient (where a larger number suggests a stronger association).

Protein-coding sequences utilized in this study were retrieved from NCBI's RefSeq database. For each gene, a 61-dimensional vector was derived from the count of codons in the protein-coding sequence. In instances where multiple RefSeq sequences were associated with a single gene, the longest sequence was selected for the codon usage calculation. Pearson correlation coefficients (PCCs) were determined between the vectors of any two given genes. These PCCs were subsequently converted into MRs, employed as an index to evaluate the similarity in codon usage between the genes.
d
Data from: Assessing identity, redundancy and confounds in Gene Ontology...
search.dataone.org
borealisdata.ca
+1more
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gillis, Jesse; Pavlidis, Paul (2023). Assessing identity, redundancy and confounds in Gene Ontology annotations over time [Dataset]. http://doi.org/10.5683/SP2/ZLJTVW
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/ZLJTVW
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Gillis, Jesse; Pavlidis, Paul
Description
MOTIVATION: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored. RESULTS: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their 'functional identity' over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks. AVAILABILITY: Data available at http://chibi.ubc.ca/assessGO.

EO-based area monitoring markers computed over the Cypriotic pilot region...

data.niaid.nih.gov

Updated Oct 6, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Nejc Vesel (2022). EO-based area monitoring markers computed over the Cypriotic pilot region (2022) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7139348

Explore at:

Dataset updated

Oct 6, 2022

Dataset provided by

Grega Milcinski
Nejc Vesel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In the context of the EU-funded project DIONE (No. 870378), the following EO-based monitoring marker maps were released over the two pilot regions, containing the results produced from a set of image analysis and machine learning techniques. The latest explores the benefits of Copernicus's multispectral high-resolution Sentinel-2 data acquired from 01-08-2021 until 03-09-2022 and provides tailored information on the needs of European paying agencies (e.g. CAPO and NPA), expressed with the following markers.

Mowing marker: used to detect mowing events on meadow/grass like Features Of Interest (FOI)

Mean-NDVI marker: used to detect erroneous claims with no vegetation

Homogeneity marker: used to determine if a parcel geometry consists of a single crop or if multiple things are growing on the parcel

Bare soil marker: used to detect observation where bare soil is present on the feature of interest. This indicates agricultural activity on the FOI (plowing, harvest)

Similarity and distance markers: used to give additional context to the crop classification and to detect erroneous claims

Land marker: used to detect the land type and non-productive EFAs of the FOI

Crop-type marker: used to detect the specific crop growing on the FOI

This dataset is comprised of the geopackage file "markers_summary.gpkg", which was computed for the Cypriotic pilot region. Descriptions are given below.

Markers summary dataset

Description of the information contained in the corresponding "markers summary" dataset


    Attribute name
    Description




     POLY_ID 
    Unique polygon identifier


    OBS_ALL
    Number of all available observations


    OBS_VALID
    Number of all valid observations


    CROP_CODE
    Crop identifier


    N_PIXEL
    Number of S2 pixels within FOI


    PLOTIDCROP
    PLOT_ID_CROP


    BS_APR_END
    Number of bare-soil observations between 2022-03-01 and 2022-04-30 


    BS_FEB_JUN
    Number of bare-soil observations between 2022-02-01 and 2022-06-30


    BS_JAN_APR
    Number of bare-soil observations between 2022-01-01 and 2022-04-30


    BS_JAN_MAR
    Number of bare-soil observations between 2022-01-01 and 2022-03-31


    BS_OCT_DEC
    Number of bare-soil observations between 2021-10-01 and 2021-12-31


    BS_OCT_SEP
    Number of bare-soil observations between 2021-10-01 and 2022-09-30


    P_CROP
    The FOI label as predicted by the crop group model


    CR_P_SCORE
    The pseudoprobability of the crop-group prediction. A score close to 1 indicates that the model is very confident in the prediction


    D_GROUP
    Declared crop group


    DIST_CROP
    Most similar crop according to the distance marker


    DIST_SC00
    Distance marker score of a most similar crop


    DIST_SCORE
    Distance marker score of a FOI when compared to nearby FOIs with the same claim. A value close to 100 indicates that a FOI is not similar to other FOIs with the same claim.


    HOM_CLASS
    Homogeneous/Heterogeneous


    HOM_SCORE
    Homogeneity probability assigned by the homogeneity marker


    P_LAND_GR
    The FOI label as predicted by the land group model using crop groupings based on land use


    LGRP_SCORE
    The pseudoprobability of the crop-group prediction. A score close to 1 indicates that the model is very confident in the prediction.


    D_LAND
    Declared land group


    M_NDVI_A_J
    Mean NDVI value in interval 2022-04-01 till 2022-07-31


    M_NDVI_F_M
    Mean NDVI value in interval 2022-02-01 till 2022-03-31


    M_NDVI_J_A
    Mean NDVI value in interval 2022-01-01 till 2022-04-30 


    MW_FEB_JUN
    Number of mowing events between 2022-02-01 and 2022-06-30


    MW_OCT_SEP
    Number of mowing events between 2021-10-01 and 2022-09-30


    MW_ALL
    Number of mowing events in the observation period


    SIM_CROP
    Most similar crop according to similarity marker


    SIM_SC00
    Similarity marker score of a most similar crop


    SIM_SCORE
    Similarity marker score of a FOI when compared to nearby FOIs with the same claim. A value close to 100 indicates that a FOI is not similar to other FOIs with the same claim.

f
Data from: An empirical study of the semantic similarity of geospatial...
tandf.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niloofar Aflaki; Kristin Stock; Christopher B. Jones; Hans Guesgen; Jeremy Morley (2023). An empirical study of the semantic similarity of geospatial prepositions and their senses [Dataset]. http://doi.org/10.6084/m9.figshare.20517959.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20517959.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Niloofar Aflaki; Kristin Stock; Christopher B. Jones; Hans Guesgen; Jeremy Morley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spatial prepositions have been studied in some detail from multiple disciplinary perspectives. However, neither the semantic similarity of these prepositions, nor the relationships between the multiple senses of different spatial prepositions, are well understood. In an empirical study of 24 spatial prepositions, we identify the degree and nature of semantic similarity and extract senses for three semantically similar groups of prepositions using t-SNE, DBSCAN clustering, and Venn diagrams. We validate the work by manual annotation with another data set. We find nuances in meaning among proximity and adjacency prepositions, such as the use of close to instead of near for pairs of lines, and the importance of proximity over contact for the next to preposition, in contrast to other adjacency prepositions.
Food Ingredients and Allergens
kaggle.com
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laksika Tharmalingam (2023). Food Ingredients and Allergens [Dataset]. https://www.kaggle.com/datasets/uom190346a/food-ingredients-and-allergens
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Laksika Tharmalingam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Note: If you found this dataset useful then do upvote the dataset so it can reach further kagglers.

The Food Allergens Dataset is a collection of information regarding allergens present in various food items. The dataset contains allergen information for a range of food ingredients, enabling the identification and analysis of potential allergens in different dishes and products. It serves as a valuable resource for researchers, food manufacturers, healthcare professionals, and individuals with food allergies.

Size: The dataset consists of a total of 400 records, with each record representing a specific food item and its associated allergens.

Allergens: The dataset includes a comprehensive list of allergens found in the food items. These allergens encompass a wide range of ingredients, such as dairy, wheat, nuts (almonds, peanuts, pine nuts), seafood (anchovies, fish, shellfish), grains (oats, rice), animal-based ingredients (chicken, pork), plant-based ingredients (celery, mustard, soybeans), and common ingredients (cocoa, eggs). Additionally, the dataset contains entries where no specific allergens are listed.

Data Structure - The dataset is structured with multiple columns to provide detailed information. The columns include: - Food Item: Represents the name of the food item. - Ingredients: Lists the ingredients present in the food item, categorized into different columns such as sugar, salt, oil, spices, etc. - Allergens: Indicates the allergens associated with the food item, including the specific allergenic ingredients present. - Prediction : food products containing allergens and those that do not (contains , do not contains)

Potential Models and Analysis:

Allergen Detection Model: can predict whether it contains allergens or not

Ingredient Similarity Analysis: This analysis can provide insights into similarities and differences among different types of dishes.

Allergen Prevalence Analysis: Can gain insights into the prevalence of different allergens in food products.

Recommender Systems: The dataset can also be used to develop recommender systems for individuals with specific dietary restrictions or allergies.
f
Assessing Low-Intensity Relationships in Complex Networks
plos.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Spitz; Anna Gimmler; Thorsten Stoeck; Katharina Anna Zweig; Emőke-Ágnes Horvát (2023). Assessing Low-Intensity Relationships in Complex Networks [Dataset]. http://doi.org/10.1371/journal.pone.0152536
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0152536
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Andreas Spitz; Anna Gimmler; Thorsten Stoeck; Katharina Anna Zweig; Emőke-Ágnes Horvát
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many large network data sets are noisy and contain links representing low-intensity relationships that are difficult to differentiate from random interactions. This is especially relevant for high-throughput data from systems biology, large-scale ecological data, but also for Web 2.0 data on human interactions. In these networks with missing and spurious links, it is possible to refine the data based on the principle of structural similarity, which assesses the shared neighborhood of two nodes. By using similarity measures to globally rank all possible links and choosing the top-ranked pairs, true links can be validated, missing links inferred, and spurious observations removed. While many similarity measures have been proposed to this end, there is no general consensus on which one to use. In this article, we first contribute a set of benchmarks for complex networks from three different settings (e-commerce, systems biology, and social networks) and thus enable a quantitative performance analysis of classic node similarity measures. Based on this, we then propose a new methodology for link assessment called z* that assesses the statistical significance of the number of their common neighbors by comparison with the expected value in a suitably chosen random graph model and which is a consistently top-performing algorithm for all benchmarks. In addition to a global ranking of links, we also use this method to identify the most similar neighbors of each single node in a local ranking, thereby showing the versatility of the method in two distinct scenarios and augmenting its applicability. Finally, we perform an exploratory analysis on an oceanographic plankton data set and find that the distribution of microbes follows similar biogeographic rules as those of macroorganisms, a result that rejects the global dispersal hypothesis for microbes.
d
Data from: Matching seed to site by climate similarity: Techniques to...
datadryad.org
data.niaid.nih.gov
zip
Updated Jan 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle D. Doherty; Bradley J. Butterfield; Troy E. Wood (2017). Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration [Dataset]. http://doi.org/10.5061/dryad.43bv0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.43bv0
Dataset updated
Jan 10, 2017
Dataset provided by
Dryad
Authors
Kyle D. Doherty; Bradley J. Butterfield; Troy E. Wood
Time period covered
2017
Area covered
global, western North America
Description
Climate Similarity Support ScriptsThis compressed file contains scripts and supporting data necessary to reproduce the analyses and products associated with Doherty et al. (2017), “Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration.”Climate Similarity Scripts.zip
f
DataSheet5_Classifying breast cancer using multi-view graph neural network...
frontiersin.figshare.com
txt
Updated Feb 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanjiao Ren; Yimeng Gao; Wei Du; Weibo Qiao; Wei Li; Qianqian Yang; Yanchun Liang; Gaoyang Li (2024). DataSheet5_Classifying breast cancer using multi-view graph neural network based on multi-omics data.CSV [Dataset]. http://doi.org/10.3389/fgene.2024.1363896.s005
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2024.1363896.s005
Dataset updated
Feb 20, 2024
Dataset provided by
Frontiers
Authors
Yanjiao Ren; Yimeng Gao; Wei Du; Weibo Qiao; Wei Li; Qianqian Yang; Yanchun Liang; Gaoyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: As the evaluation indices, cancer grading and subtyping have diverse clinical, pathological, and molecular characteristics with prognostic and therapeutic implications. Although researchers have begun to study cancer differentiation and subtype prediction, most of relevant methods are based on traditional machine learning and rely on single omics data. It is necessary to explore a deep learning algorithm that integrates multi-omics data to achieve classification prediction of cancer differentiation and subtypes.Methods: This paper proposes a multi-omics data fusion algorithm based on a multi-view graph neural network (MVGNN) for predicting cancer differentiation and subtype classification. The model framework consists of a graph convolutional network (GCN) module for learning features from different omics data and an attention module for integrating multi-omics data. Three different types of omics data are used. For each type of omics data, feature selection is performed using methods such as the chi-square test and minimum redundancy maximum relevance (mRMR). Weighted patient similarity networks are constructed based on the selected omics features, and GCN is trained using omics features and corresponding similarity networks. Finally, an attention module integrates different types of omics features and performs the final cancer classification prediction.Results: To validate the cancer classification predictive performance of the MVGNN model, we conducted experimental comparisons with traditional machine learning models and currently popular methods based on integrating multi-omics data using 5-fold cross-validation. Additionally, we performed comparative experiments on cancer differentiation and its subtypes based on single omics data, two omics data, and three omics data.Discussion: This paper proposed the MVGNN model and it performed well in cancer classification prediction based on multiple omics data.
Data from: Structural Similarity of Hydrogen-Bonded Networks in Crystals of...
figshare.com
acs.figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Duong; Thierry Maris; James D. Wuest (2023). Structural Similarity of Hydrogen-Bonded Networks in Crystals of Isomeric Pyridyl-Substituted Diaminotriazines [Dataset]. http://doi.org/10.1021/cg101290r.s003
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/cg101290r.s003
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Adam Duong; Thierry Maris; James D. Wuest
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Crystals of isomeric pyridyl-substituted diaminotriazines 1a−c and elongated analogues 2a−b were grown under various conditions, and their structures were solved by X-ray crystallography. Analysis of the structures revealed three shared features: (1) The compounds favor flattened conformations; (2) they participate in approximately coplanar hydrogen bonding according to motifs characteristic of diaminotriazines; and (3) these interactions play a key role in directing molecular organization. Together, the consistent molecular topologies and the shared presence of a dominant site of association ensure that the compounds crystallize similarly to give structures that feature chains, tapes, and layers. In certain cases, in fact, the molecular organization adopted by different pyridyl-substituted diaminotriazines is virtually identical, even when the length of the molecule or the orientation of the pyridyl group has been changed. Together, these observations show how functional groups such as diaminotriazinyl, which can control association by forming multiple directional intermolecular interactions according to reliable patterns, can be incorporated within more complex molecular structures to determine how crystallization will occur.
f
Mixed-effects model fitted to accuracy of all data with difficulty and an...
plos.figshare.com
xls
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miki Ikuta; Koji Miwa (2025). Mixed-effects model fitted to accuracy of all data with difficulty and an interaction of language and category. [Dataset]. http://doi.org/10.1371/journal.pone.0318348.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0318348.t002
Dataset updated
Feb 11, 2025
Dataset provided by
PLOS ONE
Authors
Miki Ikuta; Koji Miwa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mixed-effects model fitted to accuracy of all data with difficulty and an interaction of language and category.
f
Data from: Dynamic effect of tonal similarity in bilingual auditory lexical...
tandf.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junru Wu; Yiya Chen; Vincent J. van Heuven; Niels O. Schiller (2023). Dynamic effect of tonal similarity in bilingual auditory lexical processing [Dataset]. http://doi.org/10.6084/m9.figshare.7409225.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7409225.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Junru Wu; Yiya Chen; Vincent J. van Heuven; Niels O. Schiller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Phonological similarity affects bilingual lexical access of etymologically-related translation equivalents (ETEs). Jinan Mandarin (JM) and Standard Chinese (SC) are closely related and share many ETEs, which are usually orthographically and segmentally identical but vary in tonal similarity. Using an auditory lexical decision experiment and Generalised Additive Modelling, the present study investigates how cross-linguistic tonal similarity interacts with language of operation and how the switching of language across blocks influences SC-JM bilinguals’ auditory lexical processing of ETEs. Bilinguals showed a language dominance effect, indicating that ETEs are specified with separated word-form representations. Compared with SC tonal monolinguals, bilinguals showed a discontinuous bilingual auditory lexical advantage, instead of a classical bilingual lexical disadvantage. The dynamic role of cross-linguistic tonal similarity in auditory word processing is discussed in light of the bilinguals’ attentional shift with the change of language mode at the pre-lexical and lexical stages.
f
Codebook for primary data set.
figshare.com
xlsx
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan-Ole Hesselberg; Pål Ulleberg; Øystein Sørensen; Knut Inge Fostervold; Sigrid Hegna Ingvaldsen; Ida Svege (2025). Codebook for primary data set. [Dataset]. http://doi.org/10.1371/journal.pone.0322696.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0322696.s001
Dataset updated
May 30, 2025
Dataset provided by
PLOS ONE
Authors
Jan-Ole Hesselberg; Pål Ulleberg; Øystein Sørensen; Knut Inge Fostervold; Sigrid Hegna Ingvaldsen; Ida Svege
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Codebook describing the data in the primary data set used as the starting point for the analysis script (see S2 Script). (XLSX)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

aahz78 (2024). Wine quality dataset with identified duplicates [Dataset]. https://www.kaggle.com/datasets/aahz78/wine-quality-dataset-with-identified-duplicates

Wine quality dataset with identified duplicates

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 1, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

aahz78

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview

This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content

The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.

Description

This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.

The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:

Studying the similarities between different types of wine.
Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities.
Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.

Key Features

Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy.
High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates.
Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis.
File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.

Usage

This dataset is useful for:

Data cleaning and preprocessing exercises.
Duplicate detection and handling techniques.
Exploring the impact of duplicates on data analysis and machine learning models.
Educational purposes for understanding the importance of data quality.
Studying similarities between different types of wine and their characteristics.

File Structure

1dd.json: red wine duplicate records.
1ddw.json wite wine duplicate records.
ddrw.json: A file containing information about 100% identical characteristics of red and white wines.

Acknowledgements

This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.

Clear search

Close search

Google apps

Main menu

Wine quality dataset with identified duplicates

Data from: Representations of emotion concepts: Comparison across pairwise,...

Data and code for: Two dominant forms of multisite similarity decline –...

Data from: Codon similarity data in ATTED-II ver 8.0 (Bra, Mtr)

Data from: Assessing identity, redundancy and confounds in Gene Ontology...

EO-based area monitoring markers computed over the Cypriotic pilot region...

Data from: An empirical study of the semantic similarity of geospatial...

Food Ingredients and Allergens

Assessing Low-Intensity Relationships in Complex Networks

Data from: Matching seed to site by climate similarity: Techniques to...

DataSheet5_Classifying breast cancer using multi-view graph neural network...

Data from: Structural Similarity of Hydrogen-Bonded Networks in Crystals of...

Mixed-effects model fitted to accuracy of all data with difficulty and an...

Data from: Dynamic effect of tonal similarity in bilingual auditory lexical...

Codebook for primary data set.

Wine quality dataset with identified duplicates