22 datasets found

KNN DATASET
kaggle.com
zip
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pratyush_Ranjan (2023). KNN DATASET [Dataset]. https://www.kaggle.com/datasets/pratyushranjan01/knn-dataset/data
Explore at:
zip(59421 bytes)Available download formats
Dataset updated
Jul 5, 2023
Authors
Pratyush_Ranjan
Description
K-Nearest Neighbors (KNN) is a popular supervised learning algorithm used for classification and regression tasks. It is a non-parametric algorithm that makes predictions based on the similarity between input features and their neighboring data points.

In the context of a KNN dataset, it typically refers to a dataset that is suitable for applying the KNN algorithm. Here are a few characteristics of a dataset that can work well with KNN:

Numerical features: KNN works with numerical features, so the dataset should contain numerical attributes. If categorical features are present, they need to be converted into numerical representations through techniques like one-hot encoding or label encoding.

Similarity measure: KNN relies on a distance metric to determine the similarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity. The dataset should have features that can be effectively compared using a distance metric.

Feature scaling: Since KNN uses distance calculations, it's generally a good practice to scale the features. Features with larger scales can dominate the distance calculations and lead to biased results. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling values to a range, e.g., 0 to 1).

Sufficient data points: KNN performs best when the dataset has a sufficient number of data points for each class or target value. Having too few instances per class can lead to overfitting or inaccurate predictions.

It's important to note that the suitability of a dataset for KNN depends on the specific problem and domain. It's always recommended to analyze and preprocess the dataset based on its characteristics before applying any machine learning algorithm, including KNN.
Hyperparameter settings of classification model.
plos.figshare.com
xls
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of classification model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t006
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
Z
Large-scale attributed graph & hypergraph datasets: TWeibo, Amazon2M,...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Dec 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Yiran; Guo, Gongyao; Yang, Renchi; Shi, Jieming (2023). Large-scale attributed graph & hypergraph datasets: TWeibo, Amazon2M, Amazon, MAG-PM [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10426623
Explore at:
Dataset updated
Dec 23, 2023
Dataset provided by
Hong Kong Polytechnic University
Hong Kong Baptist University
Authors
Li, Yiran; Guo, Gongyao; Yang, Renchi; Shi, Jieming
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we provide additional large-scale datasets used in our work "A Versatile Framework for Attributed Network Clustering via K-Nearest Neighbor Augmentation", along with the index files for constructing KNN graphs using ScaNN and Faiss.

Usage:

cd ANCKA/

unzip ~/Download_path/ANCKA_data.zip -d data/
d
Data from: Data release for: Evaluating k-nearest neighbor (kNN) imputation...
catalog.data.gov
data.usgs.gov
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data release for: Evaluating k-nearest neighbor (kNN) imputation models for species-level aboveground forest biomass mapping in Northeast China [Dataset]. https://catalog.data.gov/dataset/data-release-for-evaluating-k-nearest-neighbor-knn-imputation-models-for-species-level-abo
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Northeast China
Description
Quantifying spatially explicit or pixel-level aboveground forest biomass (AFB) across large regions is critical for measuring forest carbon sequestration capacity, assessing forest carbon balance, and revealing changes in the structure and function of forest ecosystems. When AFB is measured at the species level using widely available remote sensing data, regional changes in forest composition can readily be monitored. In this study, wall-to-wall maps of species-level AFB were generated for forests in Northeast China by integrating forest inventory data with Moderate Resolution Imaging Spectroradiometer (MODIS) images and environmental variables through applying the optimal k-nearest neighbor (kNN) imputation model. By comparing the prediction accuracy of 630 kNN models, we found that the models with random forest (RF) as the distance metric showed the highest accuracy. Compared to the use of single-month MODIS data for September, there was no appreciable improvement for the estimation accuracy of species-level AFB by using multi-month MODIS data. When k > 7, the accuracy improvement of the RF-based kNN models using the single MODIS predictors for September was essentially negligible. Therefore, the kNN model using the RF distance metric, single-month (September) MODIS predictors and k = 7 was the optimal model to impute the species-level AFB for entire Northeast China. Our imputation results showed that average AFB of all species over Northeast China was 101.98 Mg/ha around 2000. Among 17 widespread species, larch was most dominant, with the largest AFB (20.88 Mg/ha), followed by white birch (13.84 Mg/ha). Amur corktree and willow had low AFB (0.91 and 0.96 Mg/ha, respectively). Environmental variables (e.g., climate and topography) had strong relationships with species-level AFB. By integrating forest inventory data and remote sensing data with complete spatial coverage using the optimal kNN model, we successfully mapped the AFB distribution of the 17 tree species over Northeast China. We also evaluated the accuracy of AFB at different spatial scales. The AFB estimation accuracy significantly improved from stand level up to the ecotype level, indicating that the AFB maps generated from this study are more suitable to apply to forest ecosystem models (e.g., LINKAGES) which require species-level attributes at the ecotype scale.
H
Data from: The Laws of Anomaly: A Framework for Regression Model Selection...
dataverse.harvard.edu
Updated Oct 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanuj Boruah (2025). The Laws of Anomaly: A Framework for Regression Model Selection Based on a Large Scale Empirical Study of Structural Data Challenges [Dataset]. http://doi.org/10.7910/DVN/VH9JJA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/VH9JJA
Dataset updated
Oct 22, 2025
Dataset provided by
Harvard Dataverse
Authors
Priyanuj Boruah
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The "ensemble-first" strategy, while a popular heuristic for tabular regression, lacks a formal framework and fails on specific data challenges. This thesis introduces the Efficiency-Based Model Selection Framework (EMSF), a new methodology that aligns model architecture with a dataset's primary structural challenge. We benchmarked over 20 models across 100 real-world datasets, categorized into four novel cohorts: high row-to-size (computational efficiency), wide data (parameter efficiency), and messy data (data efficiency). This large-scale empirical study establishes three fundamental laws of applied regression. The Law of Ensemble Dominance confirms that ensembles are the most efficient choice in over 70% of standard cases. The Law of Anomaly Supremacy proves the critical exceptions: we provide the first large-scale evidence that K-Nearest Neighbors (KNN) excels on high-dimensional data, and that robust models like the Huber Regressor are "silver bullet" solutions for datasets with hidden outliers, winning with performance margins exceeding 1500%. Finally, the Law of Predictive Futility reframes benchmarking as a diagnostic tool for identifying datasets that lack predictive signal. The EMSF provides a practical, evidence-based playbook for practitioners to move beyond a one-size-fits-all approach to model selection.
Knn Brest Cancer Data set
kaggle.com
zip
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghaffar Baloch (2025). Knn Brest Cancer Data set [Dataset]. https://www.kaggle.com/datasets/haider8578/knn-brest-cancer-data-set
Explore at:
zip(49822 bytes)Available download formats
Dataset updated
Jul 16, 2025
Authors
Ghaffar Baloch
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant
f
Tourist attraction description text data.
figshare.com
plos.figshare.com
txt
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Tourist attraction description text data. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.s003
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.s003
Dataset updated
Oct 18, 2024
Dataset provided by
PLOS ONE
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
laion-400M
kaggle.com
opendatalab.com
+1more
zip
Updated Sep 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romain Beaumont (2021). laion-400M [Dataset]. https://www.kaggle.com/romainbeaumont/laion400m
Explore at:
zip(48835402620 bytes)Available download formats
Dataset updated
Sep 5, 2021
Authors
Romain Beaumont
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Concept

The LAION-400M dataset is completely openly, freely accessible.

Check https://laion.ai/laion-400-open-dataset/ for the full description of this dataset.

All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3

The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching.

The image-text-pairs have been extracted from the Common Crawl webdata dump and are from random web pages crawled between 2014 and 2021.

Use img2dataset to download subsets of this.

Dataset Statistics

The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like this: Number of unique samples 413M Number with height or width >= 1024 26M
Number with height and width >= 1024 9.6M
Number with height or width >= 512 112M
Number with height and width >= 512 67M
Number with height or width >= 256 268M
Number with height and width >= 256 211M

By using the KNN index specialized datasets can also be extracted by domains of interest. They are (or will be) sufficient in size to train domain specialized models.

Random Samples from the dataset

http://gallerytest.christoph-schuhmann.de/photos/index.php?/category/4 (todo: replace link with local gallery) https://rom1504.github.io/clip-retrieval/ is a simple visualization of the dataset. There you can search among the dataset using clip and a knn index.

LAION-400M Open Dataset structure

We produced the dataset in several formats to address the various use cases:

a 50GB url+caption metadata dataset in parquet files. This can be use to compute statistics and redownload part of the dataset

a 10TB webdataset with 256x256 images, captions and metadata. This is a full version of the dataset, that can be used directly for training

a 1TB set of the 400M text and image clip embeddings, useful to rebuild new knn indices

two 4GB knn indices allowing to easily search in the dataset

In this kaggle, we provide the url and caption metadata dataset. Check https://laion.ai/laion-400-open-dataset/ for the other formats and the full explanation.

Url and caption metadata dataset.

We provide 32 parquet files of size around 1GB (total 50GB) with the image URLs, the associated texts and additional metadata in the following format:

SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT

where

SAMPLE_ID: A unique identifier LICENSE: If a Creative Commons License could be extracted from the image data, we name it here like e.g. “creativecommons.org/licenses/by-nc-sa/3.0/” - otherwise you’ll find it here a “?” NSFW: CLIP had been used to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing the number of false negatives at the cost of more false positives. Possible values are “UNLIKELY”, “UNSURE” and “NSFW” similarity: Value of the cosine similarity between the text and image embedding WIDTH and HEIGHT: image size as the image was embedded. Originals that were larger than 4K size were resized to 4K

This metadata dataset is best used to redownload the whole dataset or a subset of it. The img2dataset tool can be used to efficiently download such subsets.
Customized stop words list.
plos.figshare.com
txt
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Customized stop words list. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.s002
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
t
Public Multimodal Dataset - Dataset - LDM
service.tib.eu
resodate.org
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Public Multimodal Dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/public-multimodal-dataset
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used for training the kNN-Diffusion model, which consists of a large-scale retrieval method for training a text-to-image model without any text data.
Hyperparameter settings for Word2Vec and Doc2vec.
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings for Word2Vec and Doc2vec. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t005
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
Distribution of experimental dataset categories.
plos.figshare.com
xls
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Distribution of experimental dataset categories. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t004
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
Binary contingency table.
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Binary contingency table. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t001
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
multiple-sclerosis
kaggle.com
zip
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BURAK TAŞCI (2025). multiple-sclerosis [Dataset]. https://www.kaggle.com/buraktaci/multiple-sclerosis
Explore at:
zip(446273066 bytes)Available download formats
Dataset updated
Jun 15, 2025
Authors
BURAK TAŞCI
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Cite:Macin G, Tasci B, Tasci I, Faust O, Barua PD, Dogan S, Tuncer T, Tan R-S, Acharya UR. An Accurate Multiple Sclerosis Detection Model Based on Exemplar Multiple Parameters Local Phase Quantization: ExMPLPQ. Applied Sciences. 2022; 12(10):4920. https://doi.org/10.3390/app12104920

An Accurate Multiple Sclerosis Detection Model Based on Exemplar Multiple Parameters Local Phase Quantization: ExMPLPQ Multiple sclerosis (MS) is a chronic demyelinating condition characterized by plaques in the white matter of the central nervous system that can be detected using magnetic resonance imaging (MRI). Many deep learning models for automated MS detection based on MRI have been presented in the literature. We developed a computationally lightweight machine learning model for MS diagnosis using a novel handcrafted feature engineering approach. The study dataset comprised axial and sagittal brain MRI images that were prospectively acquired from 72 MS and 59 healthy subjects who attended the Ozal University Medical Faculty in 2021. The dataset was divided into three study subsets: axial images only (n = 1652), sagittal images only (n = 1775), and combined axial and sagittal images (n = 3427) of both MS and healthy classes. All images were resized to 224 × 224. Subsequently, the features were generated with a fixed-size patch-based (exemplar) feature extraction model based on local phase quantization (LPQ) with three-parameter settings. The resulting exemplar multiple parameters LPQ (ExMPLPQ) features were concatenated to form a large final feature vector. The top discriminative features were selected using iterative neighborhood component analysis (INCA). Finally, a k-nearest neighbor (kNN) algorithm, Fine kNN, was deployed to perform binary classification of the brain images into MS vs. healthy classes. The ExMPLPQ-based model attained 98.37%, 97.75%, and 98.22% binary classification accuracy rates for axial, sagittal, and hybrid datasets, respectively, using Fine kNN with 10-fold cross-validation. Furthermore, our model outperformed 19 established pre-trained deep learning models that were trained and tested with the same data. Unlike deep models, the ExMPLPQ-based model is computationally lightweight yet highly accurate. It has the potential to be implemented as an automated diagnostic tool to screen brain MRIs for white matter lesions in suspected MS patients.
The optimal combination of text representations and classifiers for...
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). The optimal combination of text representations and classifiers for different-size text sets. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t011
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The optimal combination of text representations and classifiers for different-size text sets.
Classification performance of different combinations of text representation...
plos.figshare.com
figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Classification performance of different combinations of text representation method & classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t010
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification performance of different combinations of text representation method & classifier.
Classification performance of the entire test set in the MLP classifier...
plos.figshare.com
xls
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Classification performance of the entire test set in the MLP classifier during the improved processes. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t008
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification performance of the entire test set in the MLP classifier during the improved processes.
Quantity ratio of the true and predicted values of various attractions in...
plos.figshare.com
figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Quantity ratio of the true and predicted values of various attractions in the two provinces. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t013
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t013
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantity ratio of the true and predicted values of various attractions in the two provinces.
Comparing true-predicted values of the top 2–3 categories in different-level...
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Comparing true-predicted values of the top 2–3 categories in different-level attractions in the two provinces. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t014
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t014
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparing true-predicted values of the top 2–3 categories in different-level attractions in the two provinces.
Comparing true and predicted values of the top 2–3 categories in different...
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Comparing true and predicted values of the top 2–3 categories in different level attractions. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t012
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t012
Dataset updated
Oct 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparing true and predicted values of the top 2–3 categories in different level attractions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Pratyush_Ranjan (2023). KNN DATASET [Dataset]. https://www.kaggle.com/datasets/pratyushranjan01/knn-dataset/data

KNN DATASET

Play with this dataset as much as you can since i believe in learning by doing.

Explore at:

zip(59421 bytes)Available download formats

Dataset updated

Jul 5, 2023

Authors

Pratyush_Ranjan

Description

K-Nearest Neighbors (KNN) is a popular supervised learning algorithm used for classification and regression tasks. It is a non-parametric algorithm that makes predictions based on the similarity between input features and their neighboring data points.

In the context of a KNN dataset, it typically refers to a dataset that is suitable for applying the KNN algorithm. Here are a few characteristics of a dataset that can work well with KNN:

Numerical features: KNN works with numerical features, so the dataset should contain numerical attributes. If categorical features are present, they need to be converted into numerical representations through techniques like one-hot encoding or label encoding.

Similarity measure: KNN relies on a distance metric to determine the similarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity. The dataset should have features that can be effectively compared using a distance metric.

Feature scaling: Since KNN uses distance calculations, it's generally a good practice to scale the features. Features with larger scales can dominate the distance calculations and lead to biased results. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling values to a range, e.g., 0 to 1).

Sufficient data points: KNN performs best when the dataset has a sufficient number of data points for each class or target value. Having too few instances per class can lead to overfitting or inaccurate predictions.

It's important to note that the suitability of a dataset for KNN depends on the specific problem and domain. It's always recommended to analyze and preprocess the dataset based on its characteristics before applying any machine learning algorithm, including KNN.

Clear search

Close search

Google apps

Main menu

KNN DATASET

Hyperparameter settings of classification model.

Large-scale attributed graph & hypergraph datasets: TWeibo, Amazon2M,...

Data from: Data release for: Evaluating k-nearest neighbor (kNN) imputation...

Data from: The Laws of Anomaly: A Framework for Regression Model Selection...

Knn Brest Cancer Data set

Tourist attraction description text data.

laion-400M

Concept

Dataset Statistics

Random Samples from the dataset

LAION-400M Open Dataset structure

Url and caption metadata dataset.

Customized stop words list.

Public Multimodal Dataset - Dataset - LDM

Hyperparameter settings for Word2Vec and Doc2vec.

Distribution of experimental dataset categories.

Binary contingency table.

multiple-sclerosis

The optimal combination of text representations and classifiers for...

Classification performance of different combinations of text representation...

Classification performance of the entire test set in the MLP classifier...

Quantity ratio of the true and predicted values of various attractions in...

Comparing true-predicted values of the top 2–3 categories in different-level...

Comparing true and predicted values of the top 2–3 categories in different...

KNN DATASET

Play with this dataset as much as you can since i believe in learning by doing.