MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content
The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.
Description
This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.
The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:
Studying the similarities between different types of wine.
Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities.
Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.
Key Features
Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy.
High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates.
Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis.
File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.
Usage
This dataset is useful for:
Data cleaning and preprocessing exercises.
Duplicate detection and handling techniques.
Exploring the impact of duplicates on data analysis and machine learning models.
Educational purposes for understanding the importance of data quality.
Studying similarities between different types of wine and their characteristics.
File Structure
1dd.json: red wine duplicate records.
1ddw.json wite wine duplicate records.
ddrw.json: A file containing information about 100% identical characteristics of red and white wines.
Acknowledgements
This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.
This repository contains code and data for the publication "Representations of emotion concepts: Comparison across pairwise, appraisal feature-based, and word embedding-based similarity spaces" by Kwon, M., Wager, T., & Phillips, J. (2022), published in the Proceedings of the Annual Meeting of the Cognitive Science Society, 44(44) and can be found at https://escholarship.org/uc/item/8vj3d366. All code is written in R. Updates We have updated our analyses based on feedback received since the annual meeting. The changes do not alter our main findings or conclusion. The included analysis script (EMOCON_cogsci2022_revised.Rmd
) reflects these updates: Additional emotion concept pairs: Pairs involving comfortableness
, gratefulness
, relaxedness
, romanticness
, sereneness
, protectiveness
were added to the analysis, which were omitted in the previous version. The revised script includes all pairs and reflects changes in the feature-based similarity matrix, correlation between the similarity measures, loading values from principle component analysis on appraisal features, regression coefficients from multiple regression with the affective features components from PCA, and difference scores of the affective feature components. Scaling before PCA: PCA was rerun after rescaling data. script
Scripts used for analysis are included in a R markdown file. data
Data for appraisal feature rating, pairwise similarity rating, word embedding from word2vec models trained on Google news and Wikipedia are included. References for these data are listed below and also in the paper. Word embeddings from GPT3 can be accessed via OpenAI's API, with relevant documentation on how-to found here. Word embedding from W2V model trained on Google news: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. Word embeddings from W2v Model trained on Wikipedia: Fares, M., Kutuzov, A., Oepen, S., & Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. Proceedings of the 21st Nordic Conference on Computational Linguistics, 271–276.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Contains data and code for peer review of the draft manuscript 'Two dominant forms of multisite similarity decline – their origins and interpretation' in review at Ecology and Evolution (Manuscript ID: ECE-2022-10-01523). The data are a subset of the metaCommunity Ecology: Species, Traits, Environment and Space; "CESTES" database reported in A global database for metacommunity ecology, integrating species, traits, environment and space by A. Jeliazkov, D. Mijatovic, S. Chantepie, N. Andrew, R. Arlettaz, L. Barbaro, et al. Scientific Data 2020 Vol. 7 Issue 1 Pages e6. Data were downloaded from the Figshare repository: https://doi.org/10.6084/m9.figshare.c.4459637 on 9 Nov 2021. Methods The 80 datasets in the original database were first filtered to select only abundance (or similar quantitative) estimate of species in each site. This yielded 69 datasets. Several of these were experimental treatments (e.g., logged vs unlogged) or were time series re-surveys of the same sites. Because our interest was in structures within a single habitat type and point in time, these datasets were subdivided into discrete units. In total, this (coincidentally) resulted in 80 datasets for analysis. These data are in the cestesAbSplit.RData object, with matching metadata in mdat_cestesSplit.csv. cestesAbSplit.RData - a list containing the sites x species matrices for the 80 datasets used for analysis. mdat_cestesSplit.csv - matching metadata for the 80 datasets (name and source of the data, taxonomic grouping, kingdom, realm, extent in km2, total sites, total species. This file is only included to run the R analysis. Data are repeated, with full description in Appendix S5 (see Readme tab). Appendix S5 Metadata.xlsx - metadata and values of standardized effect sizes from empirical analyses with ReadMe file explaining each field. The effect sizes were calculated using the scripts in Rcode_form_of_zeta.R (requires R_function.R) and can be reproduced by running the scripts (some very small numerical differences might occur in simulated values based on randomisation).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Codon similarity data in ATTED-II ver 8.0
The gene-to-gene codon similarity data is organized in the form of tables, each named according to the Entrez Gene ID of a particular query gene. Each table encompasses three columns, specifying: the Entrez Gene ID of a corresponding gene, an MR (Mutual Rank) value (where a smaller number signifies a stronger relationship), and a Pearson correlation coefficient (where a larger number suggests a stronger association).
Protein-coding sequences utilized in this study were retrieved from NCBI's RefSeq database. For each gene, a 61-dimensional vector was derived from the count of codons in the protein-coding sequence. In instances where multiple RefSeq sequences were associated with a single gene, the longest sequence was selected for the codon usage calculation. Pearson correlation coefficients (PCCs) were determined between the vectors of any two given genes. These PCCs were subsequently converted into MRs, employed as an index to evaluate the similarity in codon usage between the genes.
MOTIVATION: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored. RESULTS: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their 'functional identity' over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks. AVAILABILITY: Data available at http://chibi.ubc.ca/assessGO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the context of the EU-funded project DIONE (No. 870378), the following EO-based monitoring marker maps were released over the two pilot regions, containing the results produced from a set of image analysis and machine learning techniques. The latest explores the benefits of Copernicus's multispectral high-resolution Sentinel-2 data acquired from 01-08-2021 until 03-09-2022 and provides tailored information on the needs of European paying agencies (e.g. CAPO and NPA), expressed with the following markers.
Mowing marker: used to detect mowing events on meadow/grass like Features Of Interest (FOI)
Mean-NDVI marker: used to detect erroneous claims with no vegetation
Homogeneity marker: used to determine if a parcel geometry consists of a single crop or if multiple things are growing on the parcel
Bare soil marker: used to detect observation where bare soil is present on the feature of interest. This indicates agricultural activity on the FOI (plowing, harvest)
Similarity and distance markers: used to give additional context to the crop classification and to detect erroneous claims
Land marker: used to detect the land type and non-productive EFAs of the FOI
Crop-type marker: used to detect the specific crop growing on the FOI
This dataset is comprised of the geopackage file "markers_summary.gpkg", which was computed for the Cypriotic pilot region. Descriptions are given below.
Markers summary dataset
Description of the information contained in the corresponding "markers summary" dataset
Attribute name
Description
POLY_ID
Unique polygon identifier
OBS_ALL
Number of all available observations
OBS_VALID
Number of all valid observations
CROP_CODE
Crop identifier
N_PIXEL
Number of S2 pixels within FOI
PLOTIDCROP
PLOT_ID_CROP
BS_APR_END
Number of bare-soil observations between 2022-03-01 and 2022-04-30
BS_FEB_JUN
Number of bare-soil observations between 2022-02-01 and 2022-06-30
BS_JAN_APR
Number of bare-soil observations between 2022-01-01 and 2022-04-30
BS_JAN_MAR
Number of bare-soil observations between 2022-01-01 and 2022-03-31
BS_OCT_DEC
Number of bare-soil observations between 2021-10-01 and 2021-12-31
BS_OCT_SEP
Number of bare-soil observations between 2021-10-01 and 2022-09-30
P_CROP
The FOI label as predicted by the crop group model
CR_P_SCORE
The pseudoprobability of the crop-group prediction. A score close to 1 indicates that the model is very confident in the prediction
D_GROUP
Declared crop group
DIST_CROP
Most similar crop according to the distance marker
DIST_SC00
Distance marker score of a most similar crop
DIST_SCORE
Distance marker score of a FOI when compared to nearby FOIs with the same claim. A value close to 100 indicates that a FOI is not similar to other FOIs with the same claim.
HOM_CLASS
Homogeneous/Heterogeneous
HOM_SCORE
Homogeneity probability assigned by the homogeneity marker
P_LAND_GR
The FOI label as predicted by the land group model using crop groupings based on land use
LGRP_SCORE
The pseudoprobability of the crop-group prediction. A score close to 1 indicates that the model is very confident in the prediction.
D_LAND
Declared land group
M_NDVI_A_J
Mean NDVI value in interval 2022-04-01 till 2022-07-31
M_NDVI_F_M
Mean NDVI value in interval 2022-02-01 till 2022-03-31
M_NDVI_J_A
Mean NDVI value in interval 2022-01-01 till 2022-04-30
MW_FEB_JUN
Number of mowing events between 2022-02-01 and 2022-06-30
MW_OCT_SEP
Number of mowing events between 2021-10-01 and 2022-09-30
MW_ALL
Number of mowing events in the observation period
SIM_CROP
Most similar crop according to similarity marker
SIM_SC00
Similarity marker score of a most similar crop
SIM_SCORE
Similarity marker score of a FOI when compared to nearby FOIs with the same claim. A value close to 100 indicates that a FOI is not similar to other FOIs with the same claim.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spatial prepositions have been studied in some detail from multiple disciplinary perspectives. However, neither the semantic similarity of these prepositions, nor the relationships between the multiple senses of different spatial prepositions, are well understood. In an empirical study of 24 spatial prepositions, we identify the degree and nature of semantic similarity and extract senses for three semantically similar groups of prepositions using t-SNE, DBSCAN clustering, and Venn diagrams. We validate the work by manual annotation with another data set. We find nuances in meaning among proximity and adjacency prepositions, such as the use of close to instead of near for pairs of lines, and the importance of proximity over contact for the next to preposition, in contrast to other adjacency prepositions.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Note: If you found this dataset useful then do upvote the dataset so it can reach further kagglers.
The Food Allergens Dataset is a collection of information regarding allergens present in various food items. The dataset contains allergen information for a range of food ingredients, enabling the identification and analysis of potential allergens in different dishes and products. It serves as a valuable resource for researchers, food manufacturers, healthcare professionals, and individuals with food allergies.
Size: The dataset consists of a total of 400 records, with each record representing a specific food item and its associated allergens.
Allergens: The dataset includes a comprehensive list of allergens found in the food items. These allergens encompass a wide range of ingredients, such as dairy, wheat, nuts (almonds, peanuts, pine nuts), seafood (anchovies, fish, shellfish), grains (oats, rice), animal-based ingredients (chicken, pork), plant-based ingredients (celery, mustard, soybeans), and common ingredients (cocoa, eggs). Additionally, the dataset contains entries where no specific allergens are listed.
Data Structure - The dataset is structured with multiple columns to provide detailed information. The columns include: - Food Item: Represents the name of the food item. - Ingredients: Lists the ingredients present in the food item, categorized into different columns such as sugar, salt, oil, spices, etc. - Allergens: Indicates the allergens associated with the food item, including the specific allergenic ingredients present. - Prediction : food products containing allergens and those that do not (contains , do not contains)
Potential Models and Analysis:
Allergen Detection Model: can predict whether it contains allergens or not
Ingredient Similarity Analysis: This analysis can provide insights into similarities and differences among different types of dishes.
Allergen Prevalence Analysis: Can gain insights into the prevalence of different allergens in food products.
Recommender Systems: The dataset can also be used to develop recommender systems for individuals with specific dietary restrictions or allergies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many large network data sets are noisy and contain links representing low-intensity relationships that are difficult to differentiate from random interactions. This is especially relevant for high-throughput data from systems biology, large-scale ecological data, but also for Web 2.0 data on human interactions. In these networks with missing and spurious links, it is possible to refine the data based on the principle of structural similarity, which assesses the shared neighborhood of two nodes. By using similarity measures to globally rank all possible links and choosing the top-ranked pairs, true links can be validated, missing links inferred, and spurious observations removed. While many similarity measures have been proposed to this end, there is no general consensus on which one to use. In this article, we first contribute a set of benchmarks for complex networks from three different settings (e-commerce, systems biology, and social networks) and thus enable a quantitative performance analysis of classic node similarity measures. Based on this, we then propose a new methodology for link assessment called z* that assesses the statistical significance of the number of their common neighbors by comparison with the expected value in a suitably chosen random graph model and which is a consistently top-performing algorithm for all benchmarks. In addition to a global ranking of links, we also use this method to identify the most similar neighbors of each single node in a local ranking, thereby showing the versatility of the method in two distinct scenarios and augmenting its applicability. Finally, we perform an exploratory analysis on an oceanographic plankton data set and find that the distribution of microbes follows similar biogeographic rules as those of macroorganisms, a result that rejects the global dispersal hypothesis for microbes.
Climate Similarity Support ScriptsThis compressed file contains scripts and supporting data necessary to reproduce the analyses and products associated with Doherty et al. (2017), “Matching seed to site by climate similarity: Techniques to prioritize plant materials development and use in restoration.”Climate Similarity Scripts.zip
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: As the evaluation indices, cancer grading and subtyping have diverse clinical, pathological, and molecular characteristics with prognostic and therapeutic implications. Although researchers have begun to study cancer differentiation and subtype prediction, most of relevant methods are based on traditional machine learning and rely on single omics data. It is necessary to explore a deep learning algorithm that integrates multi-omics data to achieve classification prediction of cancer differentiation and subtypes.Methods: This paper proposes a multi-omics data fusion algorithm based on a multi-view graph neural network (MVGNN) for predicting cancer differentiation and subtype classification. The model framework consists of a graph convolutional network (GCN) module for learning features from different omics data and an attention module for integrating multi-omics data. Three different types of omics data are used. For each type of omics data, feature selection is performed using methods such as the chi-square test and minimum redundancy maximum relevance (mRMR). Weighted patient similarity networks are constructed based on the selected omics features, and GCN is trained using omics features and corresponding similarity networks. Finally, an attention module integrates different types of omics features and performs the final cancer classification prediction.Results: To validate the cancer classification predictive performance of the MVGNN model, we conducted experimental comparisons with traditional machine learning models and currently popular methods based on integrating multi-omics data using 5-fold cross-validation. Additionally, we performed comparative experiments on cancer differentiation and its subtypes based on single omics data, two omics data, and three omics data.Discussion: This paper proposed the MVGNN model and it performed well in cancer classification prediction based on multiple omics data.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Crystals of isomeric pyridyl-substituted diaminotriazines 1a−c and elongated analogues 2a−b were grown under various conditions, and their structures were solved by X-ray crystallography. Analysis of the structures revealed three shared features: (1) The compounds favor flattened conformations; (2) they participate in approximately coplanar hydrogen bonding according to motifs characteristic of diaminotriazines; and (3) these interactions play a key role in directing molecular organization. Together, the consistent molecular topologies and the shared presence of a dominant site of association ensure that the compounds crystallize similarly to give structures that feature chains, tapes, and layers. In certain cases, in fact, the molecular organization adopted by different pyridyl-substituted diaminotriazines is virtually identical, even when the length of the molecule or the orientation of the pyridyl group has been changed. Together, these observations show how functional groups such as diaminotriazinyl, which can control association by forming multiple directional intermolecular interactions according to reliable patterns, can be incorporated within more complex molecular structures to determine how crystallization will occur.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mixed-effects model fitted to accuracy of all data with difficulty and an interaction of language and category.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phonological similarity affects bilingual lexical access of etymologically-related translation equivalents (ETEs). Jinan Mandarin (JM) and Standard Chinese (SC) are closely related and share many ETEs, which are usually orthographically and segmentally identical but vary in tonal similarity. Using an auditory lexical decision experiment and Generalised Additive Modelling, the present study investigates how cross-linguistic tonal similarity interacts with language of operation and how the switching of language across blocks influences SC-JM bilinguals’ auditory lexical processing of ETEs. Bilinguals showed a language dominance effect, indicating that ETEs are specified with separated word-form representations. Compared with SC tonal monolinguals, bilinguals showed a discontinuous bilingual auditory lexical advantage, instead of a classical bilingual lexical disadvantage. The dynamic role of cross-linguistic tonal similarity in auditory word processing is discussed in light of the bilinguals’ attentional shift with the change of language mode at the pre-lexical and lexical stages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Codebook describing the data in the primary data set used as the starting point for the analysis script (see S2 Script). (XLSX)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview
This dataset is derived from the original Wine Quality dataset and includes identified duplicates for further analysis and exploration. The original dataset consists of chemical properties of red and white wines along with their quality ratings. Content
The dataset contains all the original features along with an additional column indicating the duplicate status. The duplicates were identified based on a comprehensive analysis that highlights records with high similarity. Additionally, the file ddrw.json contains information about red and white wines with 100% identical characteristics.
Description
This dataset aims to provide a refined version of the original wine quality data by highlighting duplicate entries. Duplicates in data can lead to misleading analysis and results. By identifying these duplicates, data scientists and analysts can better understand the structure of the data and apply necessary cleaning and preprocessing steps.
The file ddrw.json provides information on red and white wines that have 100% identical characteristics. This information can be useful for:
Studying the similarities between different types of wine.
Analyzing cases where two different types of wine have the same chemical properties and understanding the reasons behind these similarities.
Conducting a detailed analysis and improving machine learning models for wine quality prediction by considering identical records.
Key Features
Comprehensive Duplicate Identification: The dataset includes duplicates identified through a robust process, ensuring high accuracy.
High Similarity Analysis: The dataset highlights the most and least similar records, providing insights into the nature of the duplicates.
Enhanced Data Quality: By focusing on duplicate detection, this dataset helps in enhancing the overall quality of the data for more accurate analysis.
File ddrw.json: Contains information about 100% identical characteristics of red and white wines, which can be useful for in-depth analysis.
Usage
This dataset is useful for:
Data cleaning and preprocessing exercises.
Duplicate detection and handling techniques.
Exploring the impact of duplicates on data analysis and machine learning models.
Educational purposes for understanding the importance of data quality.
Studying similarities between different types of wine and their characteristics.
File Structure
1dd.json: red wine duplicate records.
1ddw.json wite wine duplicate records.
ddrw.json: A file containing information about 100% identical characteristics of red and white wines.
Acknowledgements
This dataset is built upon the original Wine Quality dataset by Abdelaziz Sami. Special thanks to the original contributors.