97 datasets found
  1. f

    Binary classification using a confusion matrix.

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). Binary classification using a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

  2. H

    GC/MS Simulated Data Sets normalized using median scaling

    • dataverse.harvard.edu
    Updated Jan 25, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denise Scholtens (2017). GC/MS Simulated Data Sets normalized using median scaling [Dataset]. http://doi.org/10.7910/DVN/OYOLXD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Denise Scholtens
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using median scaling as described in Reisetter et al.

  3. MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A...

    • researchdata.edu.au
    • data.mendeley.com
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering (2023). MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A Benchmark Dataset [Dataset]. http://doi.org/10.17632/6D8V9JMVGM.1
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    Mendeley Ltd.
    The University of Western Australia
    Authors
    Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering
    Description

    his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.

    In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.

    The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.

  4. f

    Comparison of the average performance metric values for k-means clustering...

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). Comparison of the average performance metric values for k-means clustering of datasets having features with different (D1–D5) or the same (S1–S5) units. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the average performance metric values for k-means clustering of datasets having features with different (D1–D5) or the same (S1–S5) units.

  5. f

    The performance results for k-means clustering and testing the hypothesis...

    • plos.figshare.com
    xls
    Updated Dec 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chantha Wongoutong (2024). The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Chantha Wongoutong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units.

  6. 🌼 Unveiling the Iris Dataset 🌸

    • kaggle.com
    Updated Jul 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HARISH KUMARdatalab (2023). 🌼 Unveiling the Iris Dataset 🌸 [Dataset]. http://doi.org/10.34740/kaggle/dsv/6209742
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HARISH KUMARdatalab
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context: 🌼 The Iris flower dataset, an iconic multivariate set, was first introduced by the renowned British statistician and biologist, Ronald Fisher in 1936 📝. Commonly known as Anderson's Iris dataset, it was curated by Edgar Anderson to measure the morphologic variation of three Iris species 🌸: Iris Setosa, Iris Virginica, and Iris Versicolor.

    📊 The set comprises 100 samples from each species, with four features - sepal length, sepal width, petal length, and petal width, measured in centimetres.

    🔬 This dataset has since served as a standard test case for various statistical classification techniques in machine learning, including the widely used support vector machines (SVM).

    So, whether you're a newbie dipping your toes into the ML pond or a seasoned data scientist testing out a new classification method, the Iris dataset is a classic starting point! 🎯🚀

    Columns:

    1. Sepal Length: The length of the sepal of the iris flower, is measured in centimetres.
    2. Sepal Width: The width of the sepal of the iris flower, measured in centimetres.
    3. Petal Length: The length of the petal of the iris flower, is measured in centimetres.
    4. Petal Width: The width of the petal of the iris flower, measured in centimetres.
    5. Species:The specific species of the iris flower, categorized into Sentosa, Virginica, and Versicolor.

    Problem Statement:

    1.🎯 Classification Challenge: Can you accurately predict the species of an Iris flower based on the four given measurements: sepal length, sepal width, petal length, and petal width?

    2.💡 Feature Importance: Which feature (sepal length, sepal width, petal length, or petal width) is the most significant in distinguishing between the species of Iris flowers?

    3.📈 Data Scaling: How does standardization (or normalization) of the features affect the performance of your classification models?

    4.🧪 Model Experimentation: Can simpler models such as Logistic Regression perform as well as more complex models like Support Vector Machines or Neural Networks on the Iris dataset? Compare the performance of various models.

    5.🤖 AutoML Challenge: Use AutoML tools (like Google's AutoML or H2O's AutoML) to build a classification model. How does its performance compare with your handcrafted models?

    Kindly, upvote if you find the dataset interesting

  7. P

    Printed Digits Dataset Dataset

    • paperswithcode.com
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Printed Digits Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/printed-digits-dataset
    Explore at:
    Dataset updated
    Apr 2, 2025
    Description

    Description:

    👉 Download the dataset here

    The Printed Digits Dataset is a comprehensive collection of approximately 3,000 grayscale images, specifically curate for numeric digit classification tasks. Originally create with 177 images, this dataset has undergone extensive augmentation to enhance its diversity and utility, making it an ideal resource for machine learning projects such as Sudoku digit recognition.

    Dataset Composition:

    Image Count: The dataset contains around 3,000 images, each representing a single numeric digit from 0 to 9.

    Image Dimensions: Each image is standardized to a 28×28 pixel resolution, maintaining a consistent grayscale format.

    Purpose: This dataset was develop with a specific focus on Sudoku digit classification. Notably, it includes blank images for the digit '0', reflecting the common occurrence of empty cells in Sudoku puzzles.

    Download Dataset

    Augmentation Details:

    To expand the original dataset from 177 images to 3,000, a variety of data augmentation techniques were apply. These include:

    Rotation: Images were rotated to simulate different orientations of printed digits.

    Scaling: Variations in the size of digits were introduced to mimic real-world printing inconsistencies.

    Translation: Digits were shifted within the image frame to represent slight misalignments often seen in printed text.

    Noise Addition: Gaussian noise was added to simulate varying print quality and scanner imperfections.

    Applications:

    Sudoku Digit Recognition: Given its design, this dataset is particularly well-suited for training models to recognize and classify digits in Sudoku puzzles.

    Handwritten Digit Classification: Although the dataset contains printed digits, it can be adapted and utilized in combination with handwritten digit datasets for broader numeric

    classification tasks.

    Optical Character Recognition (OCR): This dataset can also be valuable for training OCR systems, especially those aim at processing low-resolution or small-scale printed text.

    Dataset Quality:

    Uniformity: All images are uniformly scaled and aligned, providing a clean and consistent dataset for model training.

    Diversity: Augmentation has significantly increased the diversity of digit representation, making the dataset robust for training deep learning models.

    Usage Notes:

    Zero Representation: Users should note that the digit '0' is represented by a blank image.

    This design choice aligns with the specific application of Sudoku puzzle solving but may require adjustments if the dataset is use for other numeric classification tasks.

    Preprocessing Required: While the dataset is ready for use, additional preprocessing steps, such as normalization or further augmentation, can be applied based on the specific requirements of the intended machine learning model.

    File Format:

    The images are stored in a standardized format compatible with most machine learning frameworks, ensuring ease of integration into existing workflows.

    Conclusion: The Printed Digits Dataset offers a rich resource for those working on digit classification projects, particularly within the context of Sudoku or other numeric-based puzzles. Its extensive augmentation and attention to application-specific details make it a valuable asset for both academic research and practical Al development.

    This dataset is sourced from Kaggle.

  8. H

    Supply Chain Management (Normalized)

    • dataverse.harvard.edu
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomar Anez; Dimar Anez (2025). Supply Chain Management (Normalized) [Dataset]. http://doi.org/10.7910/DVN/WNB7AY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Diomar Anez; Dimar Anez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides processed and normalized/standardized indices for the management tool group 'Supply Chain Management' (SCM), including related concepts like Supply Chain Integration. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding SCM dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "supply chain management" + "supply chain logistics" + "supply chain". Processing: None. The dataset utilizes the original Google Trends index, which is base-100 normalized against the peak search interest for the specified terms and period. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Supply Chain Management + Supply Chain Integration + Supply Chain. Processing: The annual relative frequency series was normalized by setting the year with the maximum value to 100 and scaling all other values (years) proportionally. Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching SCM-related keywords [("supply chain management" OR ...) AND ("management" OR ...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly publication counts in Crossref. Data deduplicated via DOIs. Processing: For each month, the relative share of SCM-related publications (SCM Count / Total Crossref Count for that month) was calculated. This monthly relative share series was then normalized by setting the month with the maximum relative share to 100 and scaling all other months proportionally. Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Supply Chain Integration (1999, 2000, 2002); Supply Chain Management (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Supply Chain Integration" and "Supply Chain Management" were treated as a single conceptual series for SCM. Normalization: The combined series of original usability percentages was normalized relative to its own highest observed historical value across all included years (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Supply Chain Integration (1999, 2000, 2002); Supply Chain Management (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Supply Chain Integration" and "Supply Chain Management" were treated as a single conceptual series for SCM. Standardization (Z-scores): Original scores (X) were standardized using Z = (X - ?) / ?, with ?=3.0 and ??0.891609. Index Scale Transformation: Z-scores were transformed via Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding SCM dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  9. VGG-16 with batch normalization

    • kaggle.com
    zip
    Updated Dec 15, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyTorch (2017). VGG-16 with batch normalization [Dataset]. https://www.kaggle.com/pytorch/vgg16bn
    Explore at:
    zip(514090274 bytes)Available download formats
    Dataset updated
    Dec 15, 2017
    Dataset authored and provided by
    PyTorch
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    VGG-16

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

    Authors: Karen Simonyan, Andrew Zisserman
    https://arxiv.org/abs/1409.1556

    VGG Architectures

    https://imgur.com/uLXrKxe.jpg" alt="VGG Architecture">

    What is a Pre-trained Model?

    A pre-trained model has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on. Learned features are often transferable to different data. For example, a model trained on a large dataset of bird images will contain learned features like edges or horizontal lines that you would be transferable your dataset.

    Why use a Pre-trained Model?

    Pre-trained models are beneficial to us for many reasons. By using a pre-trained model you are saving time. Someone else has already spent the time and compute resources to learn a lot of features and your model will likely benefit from it.

  10. H

    Change Management (Normalized)

    • dataverse.harvard.edu
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomar Anez; Dimar Anez (2025). Change Management (Normalized) [Dataset]. http://doi.org/10.7910/DVN/J5KRBS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Diomar Anez; Dimar Anez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides processed and normalized/standardized indices for the management tool 'Change Management' (often encompassing Change Management Programs). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Change Management dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "change management programs" + "change management" + "change management business". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Change Management Programs + Change Management. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Change Management-related keywords [("change management programs" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Change Mgmt Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Change Management Programs (2002, 2004, 2010, 2012, 2014, 2017, 2022). Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Change Management Programs (2002-2022). Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Change Management dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  11. WikiMed and PubMedDS: Two large-scale datasets for medical concept...

    • zenodo.org
    zip
    Updated Dec 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé (2021). WikiMed and PubMedDS: Two large-scale datasets for medical concept extraction and normalization research [Dataset]. http://doi.org/10.5281/zenodo.5753476
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).

    WikiMed

    Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.

    WikiMed contains:

    • 393,618 Wikipedia page texts
    • 1,067,083 mentions of medical concepts
    • 57,739 unique UMLS CUIs

    Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.

    PubMedDS

    Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.

    PubMedDS contains:

    • 13,197,430 abstract texts
    • 57,943,354 medical concept mentions
    • 44,881 unique UMLS CUIs

    Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.

    Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.

    Data format

    Both datasets use JSON format with one document per line. Each document has the following structure:

    {
      "_id": "A unique identifier of each document",
      "text": "Contains text over which mentions are ",
      "title": "Title of Wikipedia/PubMed Article",
      "split": "[Not in PubMedDS] Dataset split: 

  12. S

    30 m-scale Annual Global Normalized Difference Urban Index Datasets from...

    • scidb.cn
    Updated Jan 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Di Liu; Qingling Zhang (2023). 30 m-scale Annual Global Normalized Difference Urban Index Datasets from 2000 to 2021 [Dataset]. http://doi.org/10.57760/sciencedb.07081
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Di Liu; Qingling Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Urban areas play a very important role in global climate change. There is an increasing interest in comprehending global urban areas with adequate geographic details for global climate change mitigation. Accurate and frequent urban area information is fundamental to comprehending urbanization processes and land use/cover change, as well as the impact of global climate and environmental change. Defense Meteorological Satellite Program/Operational Line Scan System (DMSP/OLS) night-light (NTL) imagery contributes powerfully to the spatial characterization of global cities, however, its application potential is seriously limited by its coarse resolution. In this paper, we generate annual Normalized Difference Urban Index (NDUI) to characterize global urban areas at a 30 m-resolution from 2000 to 2021 by combining Landsat-5,7,8 Normalized Difference Vegetation Index (NDVI) composites and DMSP/OLS NTL images on the Google Earth Engine (GEE) platform. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI datasets have the potential for urbanization studies at regional and global scales.

  13. Sample dataset for the models trained and tested in the paper 'Can AI be...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

    This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

    To use the data, clone the corresponding repository and unzip this zip file in the data folder.

  14. H

    Business Process Reengineering (Normalized)

    • dataverse.harvard.edu
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diomar Anez; Dimar Anez (2025). Business Process Reengineering (Normalized) [Dataset]. http://doi.org/10.7910/DVN/QBP0E9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Diomar Anez; Dimar Anez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides processed and normalized/standardized indices for the management tool 'Business Process Reengineering' (BPR). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding BPR dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "business process reengineering" + "process reengineering" + "reengineering management". Processing: None. The dataset utilizes the original Google Trends index, which is base-100 normalized against the peak search interest for the specified terms and period. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Reengineering + Business Process Reengineering + Process Reengineering. Processing: The annual relative frequency series was normalized by setting the year with the maximum value to 100 and scaling all other values (years) proportionally. Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching BPR-related keywords [("business process reengineering" OR ...) AND ("management" OR ...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly publication counts in Crossref. Data deduplicated via DOIs. Processing: For each month, the relative share of BPR-related publications (BPR Count / Total Crossref Count for that month) was calculated. This monthly relative share series was then normalized by setting the month with the maximum relative share to 100 and scaling all other months proportionally. Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Reengineering (1993, 1996, 2000, 2002); Business Process Reengineering (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Reengineering" and "Business Process Reengineering" were treated as a single conceptual series for BPR. Normalization: The combined series of original usability percentages was normalized relative to its own highest observed historical value across all included years (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Reengineering (1993, 1996, 2000, 2002); Business Process Reengineering (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Reengineering" and "Business Process Reengineering" were treated as a single conceptual series for BPR. Standardization (Z-scores): Original scores (X) were standardized using Z = (X - ?) / ?, with a theoretically defined neutral mean ?=3.0 and an estimated pooled population standard deviation ??0.891609 (calculated across all tools/years relative to ?=3.0). Index Scale Transformation: Z-scores were transformed to an intuitive index via: Index = 50 + (Z * 22). This scale centers theoretical neutrality (original score: 3.0) at 50 and maps the approximate range [1, 5] to [?1, ?100]. Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding BPR dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.

  15. CIFAR-10: Color Images, 10 Classes

    • kaggle.com
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CIFAR-10: Color Images, 10 Classes [Dataset]. https://www.kaggle.com/datasets/thedevastator/cifar-10-color-images-10-classes/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CIFAR-10: Color Images, 10 Classes

    CIFAR-10: Color Images, 10 Classes

    By cifar10 (From Huggingface) [source]

    About this dataset

    The CIFAR-10 dataset is a highly valuable and widely-used collection of color images that has been an essential resource in the field of computer vision. This dataset comprises a total of 60,000 images, each depicted as a 32x32 pixel array. The images are rich in vibrant colors and exhibit various objects or categories, thereby ensuring visual diversity.

    To facilitate classification tasks, the dataset classifies these images into ten distinct classes. Each class represents a different object or category, contributing to the versatility and applicability of the CIFAR-10 dataset. Remarkably, there are precisely 6,000 images associated with each class.

    For convenient utilization and assessment by machine learning models, the original CIFAR-10 dataset is divided into two vital subsets: namely, the training set and the test set. The training set encompasses image data along with corresponding labels for facilitating model training endeavors effectively. Conversely, the test set consists of a comprehensive compilation of 10,000 color images meticulously aligned with their relevant labels to support evaluation purposes thoroughly.

    Overall, this extensive collection offers researchers an invaluable resource to explore cutting-edge applications for computer vision algorithms and develop robust solutions across various domains such as image recognition systems or object detection technologies

    How to use the dataset

    The CIFAR-10 dataset is an excellent resource for anyone working in the field of computer vision. It consists of a diverse collection of 60,000 color images, each with dimensions of 32x32 pixels. These images are classified into 10 different classes, providing a valuable dataset for training and evaluating machine learning models.

    To make the most out of this dataset, follow these steps:

    Step 1: Understanding the Dataset Before diving into any analysis or model development, it's crucial to familiarize yourself with the structure and content of the CIFAR-10 dataset. The dataset contains two main files: train.csv and test.csv.

    The label column in both files represents the class labels for each image. There are a total of 10 different classes in this dataset, representing various objects or categories. This column is categorical data that will be essential during model training and evaluation.

    The img column in both files contains color images represented as a matrix/array of size 32x32 pixels. These pixel values represent Red-Green-Blue (RGB) intensity values ranging from 0 to 255. This image data will be used as input during model training and prediction phases.

    Step 2: Data Exploration Next, explore the data to gain insights and better understand its characteristics:

    • Analyze Class Distribution: Check if there is an imbalance among different classes by examining how many instances belong to each class.
    • Visualize Images: Display some sample images from each class using their corresponding image arrays/rows.
    • Understand Image Dimensions: Confirm that all images have consistent dimensions (32x32). If not, preprocessing may be required before analysis or modeling.

    Step 3: Preprocessing Depending on your specific use case or algorithm requirements, preprocessing steps may include:

    • Normalization/Scaling: Scale pixel values to a consistent range (e.g., 0-1) to facilitate model convergence.
    • One-Hot Encoding: Convert the categorical label column values into a binary vector representation for classification tasks.

    Step 4: Model Training and Evaluation Using the CIFAR-10 dataset, you can develop and train machine learning models to perform various tasks, such as image classification. Here are general steps for this process:

    • Splitting Data: Divide the training set into separate subsets for training and validation. This division helps assess model performance during training by providing an unbiased evaluation.

    • Model Selection: Choose an appropriate algorithm/model architecture based

    Research Ideas

    • Object Recognition: The CIFAR-10 dataset can be used to train computer vision models for object recognition. By using the labeled images in the training set, a model can learn to classify new images into one of the 10 different classes accurately.
    • Image Segmentation: The dataset can also be used for image segmentation tasks, where the goal is to identify and segment different objects within an image. By training a model on the CIFAR-10 dataset, it can learn to separate and label ind...
  16. PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiangtian Zheng; Nan Xu; Dongqi Wu; Loc Trinh; Tong Huang; S Sivaranjani; Yan Liu; Le Xie; Xiangtian Zheng; Nan Xu; Dongqi Wu; Loc Trinh; Tong Huang; S Sivaranjani; Yan Liu; Le Xie (2021). PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids (Dataset) [Dataset]. http://doi.org/10.5281/zenodo.5130612
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiangtian Zheng; Nan Xu; Dongqi Wu; Loc Trinh; Tong Huang; S Sivaranjani; Yan Liu; Le Xie; Xiangtian Zheng; Nan Xu; Dongqi Wu; Loc Trinh; Tong Huang; S Sivaranjani; Yan Liu; Le Xie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The electric grid is a key enabling infrastructure for the ambitious transition towards carbon neutrality as we grapple with climate change. With deepening penetration of renewable energy resources and electrified transportation, the reliable and secure operation of the electric grid becomes increasingly challenging. In this paper, we present PSML, a first-of-its-kind open-access multi-scale time-series dataset, to aid in the development of data-driven machine learning (ML) based approaches towards reliable operation of future electric grids. The dataset is generated through a novel transmission + distribution (T+D) co-simulation designed to capture the increasingly important interactions and uncertainties of the grid dynamics, containing electric load, renewable generation, weather, voltage and current measurements at multiple spatio-temporal scales. Using PSML, we provide state-of-the-art ML baselines on three challenging use cases of critical importance to achieve: (i) early detection, accurate classification and localization of dynamic disturbance events; (ii) robust hierarchical forecasting of load and renewable energy with the presence of uncertainties and extreme events; and (iii) realistic synthetic generation of physical-law-constrained measurement time series. We envision that this dataset will enable advances for ML in dynamic systems, while simultaneously allowing ML researchers to contribute towards carbon-neutral electricity and mobility.

    Data Navigation

    Please download, unzip and put somewhere for later benchmark results reproduction and data loading and performance evaluation for proposed methods.

    wget https://zenodo.org/record/5130612/files/PSML.zip?download=1
    7z x 'PSML.zip?download=1' -o./
    

    Minute-level Load and Renewable

    • File Name
      • ISO_zone_#.csv: `CAISO_zone_1.csv` contains minute-level load, renewable and weather data from 2018 to 2020 in the zone 1 of CAISO.
    • - Field Description
      • Field `time`: Time of minute resolution.
      • Field `load_power`: Normalized load power.
      • Field `wind_power`: Normalized wind turbine power.
      • Field `solar_power`: Normalized solar PV power.
      • Field `DHI`: Direct normal irradiance.
      • Field `DNI`: Diffuse horizontal irradiance.
      • Field `GHI`: Global horizontal irradiance.
      • Field `Dew Point`: Dew point in degree Celsius.
      • Field `Solar Zeinth Angle`: The angle between the sun's rays and the vertical direction in degree.
      • Field `Wind Speed`: Wind speed (m/s).
      • Field `Relative Humidity`: Relative humidity (%).
      • Field `Temperature`: Temperature in degree Celsius.

    Minute-level PMU Measurements

    • File Name
      • case #: The `case 0` folder contains all data of scenario setting #0.
        • pf_input_#.txt: Selected load, renewable and solar generation for the simulation.
        • pf_result_#.csv: Voltage at nodes and power on branches in the transmission system via T+D simualtion.
    • Filed Description
      • Field `time`: Time of minute resolution.
      • Field `Vm_###`: Voltage magnitude (p.u.) at the bus ### in the simulated model.
      • Field `Va_###`: Voltage angle (rad) at the bus ### in the simulated model.
      • Field `P_#_#_#`: `P_3_4_1` means the active power transferring in the #1 branch from the bus 3 to 4.
      • Field `Q_#_#_#`: `Q_5_20_1` means the reactive power transferring in the #1 branch from the bus 5 to 20.

    Millisecond-level PMU Measurements

    • File Name
      • Forced Oscillation: The folder contains all forced oscillation cases.
        • row_#: The folder contains all data of the disturbance scenario #.
          • dist.csv: Three-phased voltage at nodes in the distribution system via T+D simualtion.
          • info.csv: This file contains the start time, end time, location and type of the disturbance
          • trans.csv: Voltage at nodes and power on branches in the transmission system via T+D simualtion.
      • Natural Oscillation: The folder contains all natural oscillation cases.
        • row_#: The folder contains all data of the disturbance scenario #.
          • dist.csv: Three-phased voltage at nodes in the distribution system via T+D simualtion.
          • info.csv: This file contains the start time, end time, location and type of the disturbance.
          • trans.csv: Voltage at nodes and power on branches in the transmission system via T+D simualtion.
    • Filed Description
      • trans.csv
        • - Field `Time(s)`: Time of millisecond resolution.
        • - Field `VOLT ###`: Voltage magnitude (p.u.) at the bus ### in the transmission model.
        • - Field `POWR ### TO ### CKT #`: `POWR 151 TO 152 CKT '1 '` means the active power transferring in the #1 branch from the bus 151 to 152.
        • - Field `VARS ### TO ### CKT #`: `VARS 151 TO 152 CKT '1 '` means the reactive power transferring in the #1 branch from the bus 151 to 152.
      • dist.csv
        • Field `Time(s)`: Time of millisecond resolution.
        • Field `####.###.#`: `3005.633.1` means per-unit voltage magnitude of the phase A at the bus 633 of the distribution grid, the one connecting to the bus 3005 in the transmission system.
  17. f

    Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach (2023). Table_2_Comparison of Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data.xlsx [Dataset]. http://doi.org/10.3389/fgene.2020.00594.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Pierre R. Bushel; Stephen S. Ferguson; Sreenivasa C. Ramaiahgari; Richard S. Paules; Scott S. Auerbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.

  18. m

    KU-BdSL: Khulna University Bengali Sign Language dataset

    • data.mendeley.com
    Updated Jul 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Al Jaid Jim (2023). KU-BdSL: Khulna University Bengali Sign Language dataset [Dataset]. http://doi.org/10.17632/scpvm2nbkm.4
    Explore at:
    Dataset updated
    Jul 28, 2023
    Authors
    Abdullah Al Jaid Jim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Khulna
    Description

    The KU-BdSL refers to a Bengali sign language dataset, which includes three variants of the data. The variants are - (i) Uni-scale Sign Language Dataset (USLD), (ii) Multi-scale Sign Language Dataset (MSLD), and (iii) Annotated Multi-scale Sign Language Dataset (AMSLD). The dataset consists of images representing single-hand gestures for BdSL alphabets. Several smartphones are taken into account to capture images from 39 participants (30 males and 9 females). These 39 participants associated with the dataset creation have not offered any financial benefit. Each version includes 30 classes that resemble the 38 consonants ('shoroborno') of Bengali alphabets. There is a total of 1,500 images in jpg format in each variant. The images are captured on flat surfaces at different times of the day to vary the brightness and contrast. Class names are Unicode values corresponding to the Bengali alphabets for USLD and MSLD.

    Folder Names: 2433 -> ‘Chandra Bindu’ 2434 -> ‘Anusshar’ 2435 -> ‘Bisharga’ 2453 -> ‘Ka’ 2454 -> ‘Kha’ 2455 -> ‘Ga’ 2456 -> ‘Gha’ 2457 -> ‘Uo’ 2458 -> ‘Ca’ 2459 -> ‘Cha’ 2460-2479 -> ‘Borgio Ja/Anta Ja’ 2461 -> ‘Jha’ 2462 -> ‘Yo’ 2463 -> ‘Ta’ 2464 -> ‘Tha’ 2465 -> ‘Da’ 2466 -> ‘Dha’ 2467-2472 -> ‘Murdha Na/Donto Na’ 2468-2510 -> ‘ta/Khanda ta’ 2469 -> ‘tha’ 2470 -> ‘da’ 2471 -> ‘dha’ 2474 -> ‘pa’ 2475 -> ‘fa’ 2476-2477 -> ‘Ba/Bha’ 2478 -> ‘Ma’ 2480-2524-2525 -> ‘Ba-y Ra/Da-y Ra/Dha-y Ra’ 2482 -> ‘La’ 2486-2488-2487 -> ‘Talobbo sha/Danta sa/Murdha Sha’ 2489 -> ‘Ha’

    USLD: USLD has a unique size for all the images that is 512*512 pixels. The intended hand position is placed in the middle of the majority of cases in this dataset. MSLD: The raw images are stored in MSLD so that researchers can make changes to the dataset. The use of various smartphones yields us a wide variety of image sizes. AMSLD: AMSLD has multi-scale annotated data, which is suitable for tasks like localization and classification. From many annotation formats, the YOLO DarkNet annotation has been selected. Each image has an annotation text file containing five numbers separated by white space. The initial number is an integer, and the rest are floating numbers. The first number of the file indicates the class ID corresponding to the label of that image. Class IDs are mapped in a separate text file named 'obj.names'. The second and third values are the beginning normalized coordinates, while the fourth and fifth define the bounding box's normalized width and height.

    This dataset is supported by Research and Innovation Center, Khulna University, Khulna-9208, Bangladesh and all the data from this dataset is free to download, modify, and use. The previous version (Version 1) of this dataset contains the oral permission of the volunteers, and the rest versions have written consent of the participants. Therefore, we encourage researchers to use these versions (Version 2 or Version 3 or Version 4) for research objective.

  19. [Crypto] CoinGecko vs CoinMarketCap Data

    • kaggle.com
    Updated May 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherpa (2020). [Crypto] CoinGecko vs CoinMarketCap Data [Dataset]. https://www.kaggle.com/thesherpafromalabama/coingecko-vs-coinmarketcap-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 11, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sherpa
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Use the CMC_CG_Combo dataset, unless you want to recollect and DIY!

    Context

    On a quest to compare different cryptoexchanges, I came up with the idea to compare metrics across multiple platforms (at the moment just two). CoinGecko and CoinMarketCap are two of the biggest websites for monitoring both exchanges and cryptoprojects. In response to over-inflated volumes faked by crypto exchanges, both websites came up with independent metrics for assessing the worth of a given exchange.

    Content

    Collected on May 10, 2020

    CoinGecko's data is a bit more holistic, containing metrics across a multitude of areas (you can read more in the original blog post here. The data from CoinGecko consists of the following:

    -Exchange Name -Trust Score (on a scale of N/A-10) -Type (centralized/decentralized) -AML (risk: How well prepared are they to handle financial crime?) -API Coverage (Blanket Measure that includes: (1) Tickers Data (2) Historical Trades Data (3) Order Book Data (4) Candlestick/OHLC (5) WebSocket API (6) API Trading (7) Public Documentation -API Last Updated (When was the API last updated?) -Bid Ask Spread (Average buy/sell spread across all pairs) -Candlestick (Available/Not) -Combined Orderbook Percentile (See above link) -Estimated_Reserves (estimated holdings of major crypto) -Grade_Score (Overall API score) -Historical Data (available/not) -Jurisdiction Risk (risk: risk of Terrorist activity/bribery/corruption?) -KYC Procedures (risk: Know Your Customer?) -License and Authorization (risk: has exchange sought regulatory approval?) -Liquidity (don't confuse with "CMC Liquidity". THIS column is a combo of (1) Web traffic & Reported Volume (2) Order book spread (3) Trading Activity (4) Trust Score on Trading Pairs -Negative News (risk: any bad news?) -Normalized Trading Volume (Trading Volume normalized to web traffic) -Normalized Volume Percentile (see above blog link) -Orderbook (available/not) -Public Documentation (got well documented API available to everyone?) -Regulatory Compliance (risk rating from compliance perspective) -Regulatory last updated (last time regulatory metrics were updated) -Reported Trading Volume (volume as listed by the exchange) -Reported Normalized Trading Volume (Ratio of normalized to reported volume [0-1]) -Sanctions (risk: risk of sanctions?) -Scale (based on: (1) Normalized Trading Volume Percentile (2) Normalized Order Book Depth Percentile -Senior Public Figure (risk: does exchange have transparent public relations? etc) -Tickers (tick tick tick...) -Trading via API (can data be traded through the API?) -Websocket (got websockets?)

    -Green Pairs (Percentage of trading pairs deemed to have good liquidity) -Yellow Pairs (Percentage of trading pairs deemed to have fair liquidity -Red Pairs (Percentage of trading pairs deemed to have poor liquidity) -Unknown Pairs (percentage of trading pairs that do not have sufficient order book data)

    ~

    Again, CoinMarketCap only has one metric (that was recently updated and scales from 1-1000, 1000 being very liquid and 1 not. You can go check the article out for yourself. In the dataset, this is the "CMC Liquidity" column, not to be confused with the "Liquidity" column, which refers to the CoinGecko Metric!

    Acknowledgements

    Thanks to coingecko and cmc for making their data scrapable :)

    [CMC, you should try to give us a little more access to the figures that define your metric. Thanks!]

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  20. s

    Danish Similarity Data Set

    • sprogteknologi.dk
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for sprogteknologi (2024). Danish Similarity Data Set [Dataset]. https://sprogteknologi.dk/dataset/danish-similarity-data-set
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/csvAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Center for sprogteknologi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Denmark
    Description

    The Danish similarity dataset is a gold standard resource for evaluation of Danish word embedding models. The dataset consists of 99 word pairs rated by 38 human judges according to their semantic similarity, i.e. the extend to which the two words are similar in meaning, in a normalized 0-1 range. Note that this dataset provides a way of measuring similarity rather than relatedness/association. Description of files included in this material: (Note: In both of the included files, rows correspond to items (word pairs) and columns to properties of each item.) All_sims_da.csv: Contains the non-normalized mean similarity scores over all judges, along with the non-normalized scores given by each of the 38 judges on the scale 0-6, where 0 is given to the most dissimilar items and 6 to the most similar items. Gold_sims_da.csv: Contains the similarity gold standard for each item, which is the normalized mean similarity score for a given item over all judges. Scores are normalized to a 0-1 range, where 0 denotes the minimum degree of similarity and 1 denotes the maximum degree of similarity.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Chantha Wongoutong (2024). Binary classification using a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t002

Binary classification using a confusion matrix.

Related Article
Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Dec 6, 2024
Dataset provided by
PLOS ONE
Authors
Chantha Wongoutong
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

Search
Clear search
Close search
Google apps
Main menu