22 datasets found
  1. KNN DATASET

    • kaggle.com
    zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratyush_Ranjan (2023). KNN DATASET [Dataset]. https://www.kaggle.com/datasets/pratyushranjan01/knn-dataset/data
    Explore at:
    zip(59421 bytes)Available download formats
    Dataset updated
    Jul 5, 2023
    Authors
    Pratyush_Ranjan
    Description

    K-Nearest Neighbors (KNN) is a popular supervised learning algorithm used for classification and regression tasks. It is a non-parametric algorithm that makes predictions based on the similarity between input features and their neighboring data points.

    In the context of a KNN dataset, it typically refers to a dataset that is suitable for applying the KNN algorithm. Here are a few characteristics of a dataset that can work well with KNN:

    Numerical features: KNN works with numerical features, so the dataset should contain numerical attributes. If categorical features are present, they need to be converted into numerical representations through techniques like one-hot encoding or label encoding.

    Similarity measure: KNN relies on a distance metric to determine the similarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity. The dataset should have features that can be effectively compared using a distance metric.

    Feature scaling: Since KNN uses distance calculations, it's generally a good practice to scale the features. Features with larger scales can dominate the distance calculations and lead to biased results. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling values to a range, e.g., 0 to 1).

    Sufficient data points: KNN performs best when the dataset has a sufficient number of data points for each class or target value. Having too few instances per class can lead to overfitting or inaccurate predictions.

    It's important to note that the suitability of a dataset for KNN depends on the specific problem and domain. It's always recommended to analyze and preprocess the dataset based on its characteristics before applying any machine learning algorithm, including KNN.

  2. Hyperparameter settings of classification model.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of classification model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  3. Z

    Large-scale attributed graph & hypergraph datasets: TWeibo, Amazon2M,...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Dec 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, Yiran; Guo, Gongyao; Yang, Renchi; Shi, Jieming (2023). Large-scale attributed graph & hypergraph datasets: TWeibo, Amazon2M, Amazon, MAG-PM [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10426623
    Explore at:
    Dataset updated
    Dec 23, 2023
    Dataset provided by
    Hong Kong Polytechnic University
    Hong Kong Baptist University
    Authors
    Li, Yiran; Guo, Gongyao; Yang, Renchi; Shi, Jieming
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we provide additional large-scale datasets used in our work "A Versatile Framework for Attributed Network Clustering via K-Nearest Neighbor Augmentation", along with the index files for constructing KNN graphs using ScaNN and Faiss.

    Usage:

    cd ANCKA/

    unzip ~/Download_path/ANCKA_data.zip -d data/

  4. d

    Data from: Data release for: Evaluating k-nearest neighbor (kNN) imputation...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data release for: Evaluating k-nearest neighbor (kNN) imputation models for species-level aboveground forest biomass mapping in Northeast China [Dataset]. https://catalog.data.gov/dataset/data-release-for-evaluating-k-nearest-neighbor-knn-imputation-models-for-species-level-abo
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Northeast China
    Description

    Quantifying spatially explicit or pixel-level aboveground forest biomass (AFB) across large regions is critical for measuring forest carbon sequestration capacity, assessing forest carbon balance, and revealing changes in the structure and function of forest ecosystems. When AFB is measured at the species level using widely available remote sensing data, regional changes in forest composition can readily be monitored. In this study, wall-to-wall maps of species-level AFB were generated for forests in Northeast China by integrating forest inventory data with Moderate Resolution Imaging Spectroradiometer (MODIS) images and environmental variables through applying the optimal k-nearest neighbor (kNN) imputation model. By comparing the prediction accuracy of 630 kNN models, we found that the models with random forest (RF) as the distance metric showed the highest accuracy. Compared to the use of single-month MODIS data for September, there was no appreciable improvement for the estimation accuracy of species-level AFB by using multi-month MODIS data. When k > 7, the accuracy improvement of the RF-based kNN models using the single MODIS predictors for September was essentially negligible. Therefore, the kNN model using the RF distance metric, single-month (September) MODIS predictors and k = 7 was the optimal model to impute the species-level AFB for entire Northeast China. Our imputation results showed that average AFB of all species over Northeast China was 101.98 Mg/ha around 2000. Among 17 widespread species, larch was most dominant, with the largest AFB (20.88 Mg/ha), followed by white birch (13.84 Mg/ha). Amur corktree and willow had low AFB (0.91 and 0.96 Mg/ha, respectively). Environmental variables (e.g., climate and topography) had strong relationships with species-level AFB. By integrating forest inventory data and remote sensing data with complete spatial coverage using the optimal kNN model, we successfully mapped the AFB distribution of the 17 tree species over Northeast China. We also evaluated the accuracy of AFB at different spatial scales. The AFB estimation accuracy significantly improved from stand level up to the ecotype level, indicating that the AFB maps generated from this study are more suitable to apply to forest ecosystem models (e.g., LINKAGES) which require species-level attributes at the ecotype scale.

  5. H

    Data from: The Laws of Anomaly: A Framework for Regression Model Selection...

    • dataverse.harvard.edu
    Updated Oct 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyanuj Boruah (2025). The Laws of Anomaly: A Framework for Regression Model Selection Based on a Large Scale Empirical Study of Structural Data Challenges [Dataset]. http://doi.org/10.7910/DVN/VH9JJA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Priyanuj Boruah
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The "ensemble-first" strategy, while a popular heuristic for tabular regression, lacks a formal framework and fails on specific data challenges. This thesis introduces the Efficiency-Based Model Selection Framework (EMSF), a new methodology that aligns model architecture with a dataset's primary structural challenge. We benchmarked over 20 models across 100 real-world datasets, categorized into four novel cohorts: high row-to-size (computational efficiency), wide data (parameter efficiency), and messy data (data efficiency). This large-scale empirical study establishes three fundamental laws of applied regression. The Law of Ensemble Dominance confirms that ensembles are the most efficient choice in over 70% of standard cases. The Law of Anomaly Supremacy proves the critical exceptions: we provide the first large-scale evidence that K-Nearest Neighbors (KNN) excels on high-dimensional data, and that robust models like the Huber Regressor are "silver bullet" solutions for datasets with hidden outliers, winning with performance margins exceeding 1500%. Finally, the Law of Predictive Futility reframes benchmarking as a diagnostic tool for identifying datasets that lack predictive signal. The EMSF provides a practical, evidence-based playbook for practitioners to move beyond a one-size-fits-all approach to model selection.

  6. Knn Brest Cancer Data set

    • kaggle.com
    zip
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghaffar Baloch (2025). Knn Brest Cancer Data set [Dataset]. https://www.kaggle.com/datasets/haider8578/knn-brest-cancer-data-set
    Explore at:
    zip(49822 bytes)Available download formats
    Dataset updated
    Jul 16, 2025
    Authors
    Ghaffar Baloch
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

    This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

    Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

    Attribute Information:

    1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

    Ten real-valued features are computed for each cell nucleus:

    a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

    The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

    All feature values are recoded with four significant digits.

    Missing attribute values: none

    Class distribution: 357 benign, 212 malignant

  7. f

    Tourist attraction description text data.

    • figshare.com
    • plos.figshare.com
    txt
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Tourist attraction description text data. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.s003
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  8. laion-400M

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Sep 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Romain Beaumont (2021). laion-400M [Dataset]. https://www.kaggle.com/romainbeaumont/laion400m
    Explore at:
    zip(48835402620 bytes)Available download formats
    Dataset updated
    Sep 5, 2021
    Authors
    Romain Beaumont
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Concept

    The LAION-400M dataset is completely openly, freely accessible.

    Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.

    All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3

    The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching.

    The image-text-pairs have been extracted from the Common Crawl webdata dump and are from random web pages crawled between 2014 and 2021.

    Use img2dataset to download subsets of this.

    Dataset Statistics

    The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like this: Number of unique samples 413M Number with height or width >= 1024 26M
    Number with height and width >= 1024 9.6M
    Number with height or width >= 512 112M
    Number with height and width >= 512 67M
    Number with height or width >= 256 268M
    Number with height and width >= 256 211M

    By using the KNN index specialized datasets can also be extracted by domains of interest. They are (or will be) sufficient in size to train domain specialized models.

    Random Samples from the dataset

    http://gallerytest.christoph-schuhmann.de/photos/index.php?/category/4 (todo: replace link with local gallery) https://rom1504.github.io/clip-retrieval/ is a simple visualization of the dataset. There you can search among the dataset using clip and a knn index.

    LAION-400M Open Dataset structure

    We produced the dataset in several formats to address the various use cases:

    • a 50GB url+caption metadata dataset in parquet files. This can be use to compute statistics and redownload part of the dataset
    • a 10TB webdataset with 256x256 images, captions and metadata. This is a full version of the dataset, that can be used directly for training
    • a 1TB set of the 400M text and image clip embeddings, useful to rebuild new knn indices
    • two 4GB knn indices allowing to easily search in the dataset

    In this kaggle, we provide the url and caption metadata dataset. Check https://laion.ai/laion-400-open-dataset/ for the other formats and the full explanation.

    Url and caption metadata dataset.

    We provide 32 parquet files of size around 1GB (total 50GB) with the image URLs, the associated texts and additional metadata in the following format:

    SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT

    where

    SAMPLE_ID: A unique identifier LICENSE: If a Creative Commons License could be extracted from the image data, we name it here like e.g. “creativecommons.org/licenses/by-nc-sa/3.0/” - otherwise you’ll find it here a “?” NSFW: CLIP had been used to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing the number of false negatives at the cost of more false positives. Possible values are “UNLIKELY”, “UNSURE” and “NSFW” similarity: Value of the cosine similarity between the text and image embedding WIDTH and HEIGHT: image size as the image was embedded. Originals that were larger than 4K size were resized to 4K

    This metadata dataset is best used to redownload the whole dataset or a subset of it. The img2dataset tool can be used to efficiently download such subsets.

  9. Customized stop words list.

    • plos.figshare.com
    txt
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Customized stop words list. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  10. t

    Public Multimodal Dataset - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Public Multimodal Dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/public-multimodal-dataset
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used for training the kNN-Diffusion model, which consists of a large-scale retrieval method for training a text-to-image model without any text data.

  11. Hyperparameter settings for Word2Vec and Doc2vec.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings for Word2Vec and Doc2vec. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  12. Distribution of experimental dataset categories.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Distribution of experimental dataset categories. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  13. Binary contingency table.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Binary contingency table. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  14. multiple-sclerosis

    • kaggle.com
    zip
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BURAK TAŞCI (2025). multiple-sclerosis [Dataset]. https://www.kaggle.com/buraktaci/multiple-sclerosis
    Explore at:
    zip(446273066 bytes)Available download formats
    Dataset updated
    Jun 15, 2025
    Authors
    BURAK TAŞCI
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Cite:Macin G, Tasci B, Tasci I, Faust O, Barua PD, Dogan S, Tuncer T, Tan R-S, Acharya UR. An Accurate Multiple Sclerosis Detection Model Based on Exemplar Multiple Parameters Local Phase Quantization: ExMPLPQ. Applied Sciences. 2022; 12(10):4920. https://doi.org/10.3390/app12104920

    An Accurate Multiple Sclerosis Detection Model Based on Exemplar Multiple Parameters Local Phase Quantization: ExMPLPQ Multiple sclerosis (MS) is a chronic demyelinating condition characterized by plaques in the white matter of the central nervous system that can be detected using magnetic resonance imaging (MRI). Many deep learning models for automated MS detection based on MRI have been presented in the literature. We developed a computationally lightweight machine learning model for MS diagnosis using a novel handcrafted feature engineering approach. The study dataset comprised axial and sagittal brain MRI images that were prospectively acquired from 72 MS and 59 healthy subjects who attended the Ozal University Medical Faculty in 2021. The dataset was divided into three study subsets: axial images only (n = 1652), sagittal images only (n = 1775), and combined axial and sagittal images (n = 3427) of both MS and healthy classes. All images were resized to 224 × 224. Subsequently, the features were generated with a fixed-size patch-based (exemplar) feature extraction model based on local phase quantization (LPQ) with three-parameter settings. The resulting exemplar multiple parameters LPQ (ExMPLPQ) features were concatenated to form a large final feature vector. The top discriminative features were selected using iterative neighborhood component analysis (INCA). Finally, a k-nearest neighbor (kNN) algorithm, Fine kNN, was deployed to perform binary classification of the brain images into MS vs. healthy classes. The ExMPLPQ-based model attained 98.37%, 97.75%, and 98.22% binary classification accuracy rates for axial, sagittal, and hybrid datasets, respectively, using Fine kNN with 10-fold cross-validation. Furthermore, our model outperformed 19 established pre-trained deep learning models that were trained and tested with the same data. Unlike deep models, the ExMPLPQ-based model is computationally lightweight yet highly accurate. It has the potential to be implemented as an automated diagnostic tool to screen brain MRIs for white matter lesions in suspected MS patients.

  15. The optimal combination of text representations and classifiers for...

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). The optimal combination of text representations and classifiers for different-size text sets. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t011
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The optimal combination of text representations and classifiers for different-size text sets.

  16. Classification performance of different combinations of text representation...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Classification performance of different combinations of text representation method & classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification performance of different combinations of text representation method & classifier.

  17. Classification performance of the entire test set in the MLP classifier...

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Classification performance of the entire test set in the MLP classifier during the improved processes. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification performance of the entire test set in the MLP classifier during the improved processes.

  18. Quantity ratio of the true and predicted values of various attractions in...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Quantity ratio of the true and predicted values of various attractions in the two provinces. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t013
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantity ratio of the true and predicted values of various attractions in the two provinces.

  19. Comparing true-predicted values of the top 2–3 categories in different-level...

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Comparing true-predicted values of the top 2–3 categories in different-level attractions in the two provinces. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t014
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparing true-predicted values of the top 2–3 categories in different-level attractions in the two provinces.

  20. Comparing true and predicted values of the top 2–3 categories in different...

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Comparing true and predicted values of the top 2–3 categories in different level attractions. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparing true and predicted values of the top 2–3 categories in different level attractions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pratyush_Ranjan (2023). KNN DATASET [Dataset]. https://www.kaggle.com/datasets/pratyushranjan01/knn-dataset/data
Organization logo

KNN DATASET

Play with this dataset as much as you can since i believe in learning by doing.

Explore at:
zip(59421 bytes)Available download formats
Dataset updated
Jul 5, 2023
Authors
Pratyush_Ranjan
Description

K-Nearest Neighbors (KNN) is a popular supervised learning algorithm used for classification and regression tasks. It is a non-parametric algorithm that makes predictions based on the similarity between input features and their neighboring data points.

In the context of a KNN dataset, it typically refers to a dataset that is suitable for applying the KNN algorithm. Here are a few characteristics of a dataset that can work well with KNN:

Numerical features: KNN works with numerical features, so the dataset should contain numerical attributes. If categorical features are present, they need to be converted into numerical representations through techniques like one-hot encoding or label encoding.

Similarity measure: KNN relies on a distance metric to determine the similarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity. The dataset should have features that can be effectively compared using a distance metric.

Feature scaling: Since KNN uses distance calculations, it's generally a good practice to scale the features. Features with larger scales can dominate the distance calculations and lead to biased results. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling values to a range, e.g., 0 to 1).

Sufficient data points: KNN performs best when the dataset has a sufficient number of data points for each class or target value. Having too few instances per class can lead to overfitting or inaccurate predictions.

It's important to note that the suitability of a dataset for KNN depends on the specific problem and domain. It's always recommended to analyze and preprocess the dataset based on its characteristics before applying any machine learning algorithm, including KNN.

Search
Clear search
Close search
Google apps
Main menu