49 datasets found

m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
f
Confusion matrix.
figshare.com
xls
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaoxia Mou; Heming Zhang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0288140.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288140.t002
Dataset updated
Jul 7, 2023
Dataset provided by
PLOS ONE
Authors
Shaoxia Mou; Heming Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
M
Data from: Characterizing and classifying neuroendocrine neoplasms through...
datacatalog.mskcc.org
data.niaid.nih.gov
+2more
Updated Sep 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nanayakkara, Jina; Yang, Xiaojing; Tyryshkin, Kathrin; Wong, Justin J.M.; Vanderbeck, Kaitlin; Ginter, Paula S.; Scognamiglio, Theresa; Chen, Yao-Tseng; Panarelli, Nicole; Cheung, Nai-Kong; Dijk, Frederike; Ben-Dov, Iddo Z.; Kim, Michelle Kang; Singh, Simron; Morozov, Pavel; Max, Klaas E. A.; Tuschl, Thomas; Renwick, Neil (2023). Characterizing and classifying neuroendocrine neoplasms through microRNA sequencing and data mining [Dataset]. http://doi.org/10.5061/dryad.fn2z34tqj
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.fn2z34tqj
Dataset updated
Sep 19, 2023
Dataset provided by
MSK Library
Authors
Nanayakkara, Jina; Yang, Xiaojing; Tyryshkin, Kathrin; Wong, Justin J.M.; Vanderbeck, Kaitlin; Ginter, Paula S.; Scognamiglio, Theresa; Chen, Yao-Tseng; Panarelli, Nicole; Cheung, Nai-Kong; Dijk, Frederike; Ben-Dov, Iddo Z.; Kim, Michelle Kang; Singh, Simron; Morozov, Pavel; Max, Klaas E. A.; Tuschl, Thomas; Renwick, Neil
Description
From Dryad entry:

"Abstract
Neuroendocrine neoplasms (NENs) are clinically diverse and incompletely characterized cancers that are challenging to classify. MicroRNAs (miRNAs) are small regulatory RNAs that can be used to classify cancers. Recently, a morphology-based classification framework for evaluating NENs from different anatomic sites was proposed by experts, with the requirement of improved molecular data integration. Here, we compiled 378 miRNA expression profiles to examine NEN classification through comprehensive miRNA profiling and data mining. Following data preprocessing, our final study cohort included 221 NEN and 114 non-NEN samples, representing 15 NEN pathological types and five site-matched non-NEN control groups. Unsupervised hierarchical clustering of miRNA expression profiles clearly separated NENs from non-NENs. Comparative analyses showed that miR-375 and miR-7 expression is substantially higher in NEN cases than non-NEN controls. Correlation analyses showed that NENs from diverse anatomic sites have convergent miRNA expression programs, likely reflecting morphologic and functional similarities. Using machine learning approaches, we identified 17 miRNAs to discriminate 15 NEN pathological types and subsequently constructed a multi-layer classifier, correctly identifying 217 (98%) of 221 samples and overturning one histologic diagnosis. Through our research, we have identified common and type-specific miRNA tissue markers and constructed an accurate miRNA-based classifier, advancing our understanding of NEN diversity.

Methods
Sequencing-based miRNA expression profiles from 378 clinical samples, comprising 239 neuroendocrine neoplasm (NEN) cases and 139 site-matched non-NEN controls, were used in this study. Expression profiles were either compiled from published studies (n=149) or generated through small RNA sequencing (n=229). Prior to sequencing, total RNA was isolated from formalin-fixed paraffin-embedded (FFPE) tissue blocks or fresh-frozen (FF) tissue samples. Small RNA cDNA libraries were sequenced on HiSeq 2500 Illumina platforms using an established small RNA sequencing (Hafner et al., 2012 Methods) and sequence annotation pipeline (Brown et al., 2013 Front Genet) to generate miRNA expression profiles. Scaling our existing approach to miRNA-based NEN classification (Panarelli et al., 2019 Endocr Relat Cancer; Ren et al., 2017 Oncotarget), we constructed and cross-validated a multi-layer classifier for discriminating NEN pathological types based on selected miRNAs.

Usage notes
Diagnostic histopathology and small RNA cDNA library preparation information for all samples are presented in Table S1 of the associated manuscript."
Zenodo Open Metadata snapshot - Training dataset for records classifier...
zenodo.org
application/gzip, bin
Updated Dec 14, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Ioannidis; Alex Ioannidis (2022). Zenodo Open Metadata snapshot - Training dataset for records classifier building [Dataset]. http://doi.org/10.5281/zenodo.1255786
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1255786
Dataset updated
Dec 14, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alex Ioannidis; Alex Ioannidis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Zenodo's published open access records' metadata, including also records that have been marked by the Zenodo staff as spam and deleted.

The dataset is a gzipped compressed JSON-lines file, where each line is a JSON object representation of a Zenodo record.

Each object contains the terms:
part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which are corresponding to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

The term files contains a list of dictionaries containing filetype, size, and filename only.
The term license contains a short Zenodo ID of the license (e.g "cc-by").
The term spam contains a boolean value, determining whether a given record was marked as a spam record by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included to allow for testing and getting familiar with the format without having to download the entire dataset.
SIAM 2007 Text Mining Competition dataset
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+2more
application/rdfxml +5
Updated Jun 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). SIAM 2007 Text Mining Competition dataset [Dataset]. https://data.nasa.gov/dataset/SIAM-2007-Text-Mining-Competition-dataset/skkr-s98t
Explore at:
csv, application/rssxml, json, tsv, application/rdfxml, xmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Subject Area: Text Mining

Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available.

How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight.

Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself.

Anomalies/Faults: This is a document category classification problem.
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
f
Results for Random Forest classification models using different feature sets...
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janna Axenbeck; Patrick Breithaupt (2023). Results for Random Forest classification models using different feature sets and target variables. [Dataset]. http://doi.org/10.1371/journal.pone.0249583.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0249583.t005
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Janna Axenbeck; Patrick Breithaupt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation metrics are presented for the test sample.
d
Application of image processing and machine learning techniques to...
search.dataone.org
data.griidc.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daly, Kendra (2025). Application of image processing and machine learning techniques to distinguish suspected oil droplets from plankton and other particles for the SIPPER imaging system [Dataset]. http://doi.org/10.7266/N74X55RS
Explore at:
Unique identifier
https://doi.org/10.7266/N74X55RS
Dataset updated
Feb 5, 2025
Dataset provided by
GRIIDC
Authors
Daly, Kendra
Description
Image classification features and examples of statistical results for the data mining approach using a one-versus-one strategy to implement a SVM (support vector machine) multi-class classifier. Data published in: Fefilatyev, S., K. Kramer, L. Hall, D. Goldgof, R. Kasturi, A. Remsen, K. Daly. 2011. Detection of Anomalous Particles from the Deepwater Horizon Oil Spill Using the SIPPER3 Underwater Imaging Platform. Proceedings of International Conference on Data Mining Workshops, p. 741-748. Awarded Data Mining Practice Prize at the IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, December 11-14, 2011. DOI 10.1109/ICDMW.2011.65.
Healthcare datasets.
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Talayeh Razzaghi; Oleg Roderick; Ilya Safro; Nicholas Marko (2023). Healthcare datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0155119.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0155119.t005
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Talayeh Razzaghi; Oleg Roderick; Ilya Safro; Nicholas Marko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The set “Example 1” has 10000 observations in each class. In set “Example 2”, the majority and minority classes contain 50400, and 33600 observations, respectively. For details about the data see [8].
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
Updated Nov 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
EthanolLevel UCR Archive Dataset
data.niaid.nih.gov
Updated May 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Southampton (2024). EthanolLevel UCR Archive Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11190984
Explore at:
Dataset updated
May 15, 2024
Dataset provided by
University of Californiahttp://universityofcalifornia.edu/
University of Southampton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

This dataset is part of the project with Scotch Whisky Research Institute into detecting forged spirits in a non-intrusive manner. One way of detecting forgery without sampling the wine is through inspecting ethanol level by spectrograph. The dataset covers 20 different bottle types and four levels of alcohol: 35%, 38%, 40% and 45%. Each series is a spectrograph of 1751 observations. This dataset is an example of when it is wrong to merge and resample, because the train/test split are constructed so that the same bottle type is never in both train and test sets. There are 4 classes. - Class 1: E35 - Class 2: E38 - Class 3: E40 - Class 4: E45 For more information about this dataset, see [1,2].

[1] Lines, Jason, Sarah Taylor, and Anthony Bagnall. "Hive-cote: The hierarchical vote collective of transformation-based ensembles for time series classification." Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016.

[2] J. Large, E. K. Kemsley, N.Wellner, I. Goodall, and A. Bagnall, Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018.

Donator: A. Bagnall
Sensitivity, specificity and G-mean of financial risk problem with five risk...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Talayeh Razzaghi; Oleg Roderick; Ilya Safro; Nicholas Marko (2023). Sensitivity, specificity and G-mean of financial risk problem with five risk classes (Example 1) using ML(W)SVM and REM imputation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0155119.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0155119.t007
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Talayeh Razzaghi; Oleg Roderick; Ilya Safro; Nicholas Marko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sensitivity, specificity and G-mean of financial risk problem with five risk classes (Example 1) using ML(W)SVM and REM imputation methods.
SAT Questions and Answers for LLM 🏛️
kaggle.com
Updated Oct 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). SAT Questions and Answers for LLM 🏛️ [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/sat-history-questions-and-answers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
SAT History Questions and Answers 🏛️ - Text Classification Dataset

This dataset contains a collection of questions and answers for the SAT Subject Test in World History and US History. Each question is accompanied by a corresponding answers and the correct response.

The dataset includes questions from various topics, time periods, and regions on both World History and US History.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

OTHER DATASETS FOR THE TEXT ANALYSIS:

Google Play Messengers - 6,000 Reviews ⭐️

20,000 Customers Reviews on Banks ⭐️

Amazon Reviews Dataset

Content

For each question, we extracted: - id: number of the question, - subject: SAT subject (World History or US History), - prompt: text of the question, - A: answer A, - B: answer B, - C: answer C, - D: answer D, - E: answer E, - answer: letter of the correct answer to the question

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: answer questions, sat, gpa, university, school, exam, college, web scraping, parsing, online database, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data, machine learning
Character classification data for license plates
figshare.com
txt
Updated Mar 13, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Rawat; M.T. Manry; Fernando Martinez (2016). Character classification data for license plates [Dataset]. http://doi.org/10.6084/m9.figshare.3113449.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3113449.v1
Dataset updated
Mar 13, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Rohit Rawat; M.T. Manry; Fernando Martinez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Licence Plate Character Classification DataAuthors:Rohit Rawat, Dr. M. T. Manry, Fernando MartinezImage Processing and Neural Networks LabThe University of Texas at Arlingtonhttp://www.uta.edu/faculty/manry/This dataset has 49 numerical features extracted from character images extracted from license plate images. The dataset has 12757 images extracted from plate images split into training and testing sets. The data has 36 output classes belonging to letters 'A' to 'Z' excluding the characters 'O' and 'Q', numbers '0' to '9', and two state map characters.Data is tab separated, one line per example, with the correct class between 1 and 36 at the end of the line.This data should be cited as:Rawat, Rohit; Manry, M.T.; Martinez, Fernando (2016): Character classification data for license plates. figshare. https://dx.doi.org/10.6084/m9.figshare.3113449.v1
T
ag_news_subset
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
Explore at:
Unique identifier
https://identifiers.org/arxiv:1509.01626
Dataset updated
Dec 6, 2022
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ag_news_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
d
Rock mass quality and structural geology observations in northwest Prince...
catalog.data.gov
data.usgs.gov
Updated Jul 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Rock mass quality and structural geology observations in northwest Prince William Sound, Alaska from the summer of 2021 [Dataset]. https://catalog.data.gov/dataset/rock-mass-quality-and-structural-geology-observations-in-northwest-prince-william-sound-al
Explore at:
Dataset updated
Jul 31, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Prince William Sound, Alaska
Description
Multiple subaerial landslides adjacent to Prince William Sound, Alaska (for example, Dai and others, 2020; Higman and others, 2023; Schaefer and others, 2024) pose a threat to the public because of their potential to generate ocean waves (Dai and others, 2020; Barnhart and others, 2021; Barnhart and others, 2022) that could impact towns and marine activities. One bedrock landslide on the west side of Barry Arm fjord drew international attention in 2020 because of its large size (~500 M m3) and tsunamigenic potential (Dai and others, 2020). As part of the U.S. Geological Survey response to the detection of the potentially tsunamigenic landslide at Barry Arm, as well as a broader effort to evaluate bedrock landslide and tsunamigenic potential throughout Prince William Sound (for example, Schaefer and others, 2024), we assessed rock mass quality and collected structural geology data in a large part of northwest Prince William Sound (including Barry Arm) in June and July, 2021. The quality (strength) of a rock mass depends on the properties of intact rock and the characteristics of discontinuities (for example, bedding, fractures, cleavage) that cut the rock. Rock mass quality can be estimated in the field using a variety of classification schemes. In the summer of 2021, most of our fieldwork was boat-based and was therefore conducted at sites along the coastline. A small number of sites in and near Barry Arm were accessed by helicopter, and sites near the town of Whittier were accessed by driving and hiking. At each field site, we made our measurements at rock outcrops, which were typically found at the base of cliffs, along ridge lines, in flat areas in coastal zones, and in areas recently scoured and plucked by glaciers. In two dimensions, outcrops ranged in size from about 30 m2 to 100 m2. We visited a total of 73 sites in the field. Most sites were in metamorphosed Cretaceous flysch, but a few were in Tertiary granitic rocks (Nelson and others, 1985; Winkler, 1992; Wilson and others, 2015). Of the 73 sites, we collected rock mass quality data and structural data at 54 sites, and only strike and dip of bedding in flysch at 19 sites. At each of the 54 sites, we collected data that we later used to classify rock mass quality according to four commonly used classiﬁcation schemes; Rock Mass Quality (Q, for example, Barton and others, 1974, Coe and others, 2005); Rock Mass Rating (RMR, for example, Bieniawski, 1989); Slope Mass Rating (SMR, for example, Romana, 1995, Moore and others, 2009) and Geologic Strength Index (GSI, for example, Marinos and Hoek, 2000, Marinos and others, 2005). We also determined Rock Quality Designation (RQD, for example, Deere and Deere, 1989, Palmström, 1982) and estimated intact rock strength using a Proceq Rock Schmidt Type N hammer (see RatingsReadMe.pdf for details). Schmidt hammer rebound values were converted to Uniaxial Compressive Strength (UCS) using equations developed for the same rock types that we observed in the field, but at different locations. For flysch, rebound values from the Type N Schmidt hammer were converted to UCS by first converting Type N rebound values to Type L rebound values, then using these Type L values in the equation shown in Table 3 and Figure 3 of Morales and others (2004). For granitic rocks, UCS values were calculated using Type N rebound values in equation 2 of Katz and others (2000). Additionally, we collected strikes and dips of any observed bedding, fractures, and cleavage. All four rock mass quality classification schemes use data from characteristics of discontinuities present in the rock. Discontinuity data that we collected in the field included: total number of discontinuities, roughness of the surface of the discontinuities, number of sets of discontinuities, type of ﬁlling or alteration on the surface of discontinuities, aperture or “openness” of discontinuities, and the amount of water present. A file of a blank field data collection sheet (FieldDataCollectionSheet) is included in this data release. Numerical ratings for each of these factors are assigned based on the correlation of ﬁeld measurements and observations with descriptive rankings. The rankings used for Q, RMR, SMR, and GSI classification schemes are shown in Table 1, Table 2, Table 3, and Figures 1 and 2. Additional details regarding descriptive rankings and numerical ratings not shown in the tables and figures are provided in the RatingsReadMe.pdf. All field measurements, numerical ranking values, and calculated Q, RMR, SMR, GSI, and RQD values are included in the RMQMeasurements_Ratings_Values2021 file (.csv and .xlsx). Site names beginning with “JAC”, followed by numbers, are locations where both rock mass quality and structural data were collected. Site names beginning with “JACSD”, “srl”, and “fault” are locations where only the strike and dip of bedding was measured. Question marks in the data files indicate a lack of certainty in field observations. Abbreviations of rating parameters (for example, R4e, Jw, etc.) for the RMR, SMR, and Q classification systems used in column headings are defined in more detail in Tables 1 and 2. All structural measurements are provided in the StructuralData2021 file (.csv and .xlsx). The planar and toppling calculations used for determining SMR values are included in the SMRCalculationsWorksheet2021 file (.csv and .xlsx). Final Q, RMR, SMR, GSI, and RQD values for each site are presented in a separate file (FinalRockStength_QualityValues2021, .csv and .xlsx). All rock mass quality values are positively correlated with rock quality. That is, as Q, RMR, SMR, GSI, and RQD values increase, rock quality increases. Additional information in this release includes photos, field sketches, and geographic data. Photos from each site are included in a separate folder (2021PhotosbySiteName), organized by the individual site names and the names of the photographers. Field sketches for eight sites are in a SketchesinFieldNotesbySiteName zipped folder. A Google Earth 2021SiteLocations.kml file showing site locations, site names, and geographic coordinates is also included. Samples of rock were collected at some of the 2021 sites in the summer of 2022. These sample names are noted in a column in the RMQMeasurements_Rating_Values2021 file. Physcial samples are held by Lauren N. Schaefer with the U.S. Geological Survey, Geologic Hazards Science Center in Golden, Colorado. Disclaimer: Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. References Barton, N., Lien, R., and Lunde, J., 1974, Engineering classiﬁcation of rock masses for the design of tunnel support: Rock Mechanics, v. 6, p. 189-236. https://doi.org/10.1007/BF01239496 Barnhart, K.R., Jones, R.P., George, D.L., Coe, J.A., and Staley, D.M., 2021, Preliminary assessment of the wave generating potential from landslides at Barry Arm, Prince William Sound, Alaska: U.S. Geological Survey Open-File Report 2021–1071, 28 p., https://doi.org/10.3133/ofr20211071. Barnhart, K.R., Collins, A. L., Avdievitch, N. N., Jones, R.P., George, D.L., Coe, J.A., and Staley, D.M., 2022, Simulated inundation extent and depth in Harriman Fjord and Barry Arm, western Prince William Sound, Alaska resulting from the hypothetical rapid motion of landslides into Barry Arm Fjord, Prince William Sound, Alaska: U.S. Geological Survey data release, https://doi.org/10.5066/P9QGWH9Z Bieniawski, Z.T., 1989, Engineering rock mass classifications a complete manual for engineers and geologist in mining, civil, and petroleum engineering: John Wiley & Sons, New York, 251 p. Coe, J.A., Harp, E.L., Tarr, A.C., and Michael, J.A., 2005, Rock-fall hazard assessment of Little Mill campground, American Fork Canyon, Uinta National Forest, Utah: U.S. Geological Survey Open File Report 2005-1229, 48 p., two 1:3000-scale plates. http://pubs.usgs.gov/of/2005/1229/ Dai, C., Higman, B., Lynett, P. J., Jacquemart, M., Howat, I. M., Liljedahl, A. K., Dufresne, A., Freymueller, J.T., Geertsema, M., Ward Jones, M., and Haeussler, P.J., 2020, Detection and assessment of a large and potentially tsunamigenic periglacial landslide in Barry Arm, Alaska. Geophysical Research Letters, v. 47 (22), e2020GL089800. https://doi.org/10.1029/2020GL089800 Deere, D.U., and Deere, D.W., 1989, Rock Quality Designation (RQD) after twenty years: Contract Report GL-89-1, U.S. Army Engineer Waterways Experiment Station, Vicksburg, Miss., 25 p. Higman, B., Lahusen, S.R., Belair, G.M., Staley, D.M., and Jacquemart, M., 2023, Inventory of Large Slope Instabilities, Prince William Sound, Alaska: U.S. Geological Survey data release, https://doi.org/10.5066/P9XGMHHP Katz, O., Reches, Z., and Roegiers, J.-C., 2000, Evaluation of mechanical rock properties using a Schmidt hammer: International Journal of Rock Mechanics and Mining Sciences, v. 37, p. 723-728. https://doi.org/10.1016/S1365-1609(00)00004-6 Marinos, P., and Hoek, E., 2000, GSI: a geologically friendly tool for rock mass strength estimation. In: Proceedings of the GeoEng2000 at the international conference on geotechnical and geological engineering, Melbourne, Technomic publishers, Lancaster, pp. 1422–1446. Marinos, V., Marinos, P., and Hoek, E., 2005, The geological strength index: applications and limitations: Bulletin of Engineering Geology and the Environment, v. 64, p. 55-65 https://doi.org/10.1007/s10064-004-0270-5 Moore, J.R., Sandrers, J.W., Dietrich, W.E., and Glaser S.D., 2009, Influence of rock mass strength on the erosion rate of alpine cliffs: Earth Surface Processes and Landforms, v. 34, p. 1339-1352. https://doi.org/10.1002/esp.1821 Morales, T., Uribe-Etxebarria, G., Uriarte, J.A., and Fernández de Valderrama, I., 2004, Geomechanical characterisation of rock masses in Alpine regions: the Basque Arc (Basque-Cantabrian basin, Northern Spain): Engineering Geology, v. 71, p. 343–362.
m
ChemTables Sample: dataset for table classification in chemical patents
data.mendeley.com
Updated Nov 4, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenan Zhai (2020). ChemTables Sample: dataset for table classification in chemical patents [Dataset]. http://doi.org/10.17632/g7tjh7tbrj.1
Explore at:
Unique identifier
https://doi.org/10.17632/g7tjh7tbrj.1
Dataset updated
Nov 4, 2020
Authors
Zenan Zhai
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of the tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Categorisation of tables based on the nature of their content can help to support finding tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. To enable the research on methods for automatic table categorization, we developed a new dataset, called ChemTables, which consists of 7,886 chemical patent tables with labels of their content type. This sample is 10% of the created ChemTables dataset. We also provide a stratified 60:20:20 split for train/dev/test set here, which can be used as a standard split for evaluating methods on table categorization task on this dataset.
d
Classification of Swift and XMM-Newton sources - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Classification of Swift and XMM-Newton sources - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/4bc3af0b-902a-5251-9c17-b9ffd6721a01
Explore at:
Dataset updated
Oct 23, 2023
Description
With the advent of very large X-ray surveys, an automated classification of X-ray sources becomes increasingly valuable. This work proposes a revisited naive Bayes classification of the X-ray sources in the Swift-XRT and XMM- Newton catalogs into four classes - AGN, stars, X-ray binaries (XRBs), and cataclysmic variables (CVs) - based on their spatial, spectral, and timing properties and their multiwavelength counterparts. An outlier measure is used to identify objects of other natures. The classifier is optimized to maximize the classification performance of a chosen class (here XRBs), and it is adapted to data mining purposes. We augmented the X-ray catalogs with multiwavelength data, source class, and variability properties. We then built a reference sample of about 25000 X-ray sources of known nature. From this sample, the distribution of each property was carefully estimated and taken as reference to assign probabilities of belonging to each class. The classification was then performed on the whole catalog, combining the information from each property. Using the algorithm on the Swift reference sample, we retrieved 99%, 98%, 92%, and 34% of AGN, stars, XRBs, and CVs, respectively, and the false positive rates are 3%, 1%, 9%, and 15%. Similar results are obtained on XMM sources. When applied to a carefully selected test sample, representing 55% of the X-ray catalog, the classification gives consistent results in terms of distributions of source properties. A substantial fraction of sources not belonging to any class is efficiently retrieved using the outlier measure, as well as AGN and stars with properties deviating from the bulk of their class. Our algorithm is then compared to a random forest method; the two showed similar performances, but the algorithm presented in this paper improved insight into the grounds of each classification. This robust classification method can be tailored to include additional or different source classes and can be applied to other X-ray catalogs. The transparency of the classification compared to other methods makes it a useful tool in the search for homogeneous populations or rare source types, including multi-messenger events. Such a tool will be increasingly valuable with the development of surveys of unprecedented size, such as LSST, SKA, and Athena, and the search for counterparts of multi-messenger events.
f
Data from: New Variable Selection Method Using Interval Segmentation Purity...
figshare.com
acs.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li-Juan Tang; Wen Du; Hai-Yan Fu; Jian-Hui Jiang; Hai-Long Wu; Guo-Li Shen; Ru-Qin Yu (2023). New Variable Selection Method Using Interval Segmentation Purity with Application to Blockwise Kernel Transform Support Vector Machine Classification of High-Dimensional Microarray Data [Dataset]. http://doi.org/10.1021/ci900032q.s001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1021/ci900032q.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Li-Juan Tang; Wen Du; Hai-Yan Fu; Jian-Hui Jiang; Hai-Long Wu; Guo-Li Shen; Ru-Qin Yu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
One problem with discriminant analysis of microarray data is representation of each sample by a large number of genes that are possibly irrelevant, insignificant, or redundant. Methods of variable selection are, therefore, of great significance in microarray data analysis. A new method for key gene selection has been proposed on the basis of interval segmentation purity that is defined as the purity of samples belonging to a certain class in intervals segmented by a mode search algorithm. This method identifies key variables most discriminative for each class, which offers possibility of unraveling the biological implication of selected genes. A salient advantage of the new strategy over existing methods is the capability of selecting genes that, though possibly exhibit a multimodal distribution, are the most discriminative for the classes of interest, considering that the expression levels of some genes may reflect systematic difference in within-class samples derived from different pathogenic mechanisms. On the basis of the key genes selected for individual classes, a support vector machine with block-wise kernel transform is developed for the classification of different classes. The combination of the proposed gene mining approach with support vector machine is demonstrated in cancer classification using two public data sets. The results reveal that significant genes have been identified for each class, and the classification model shows satisfactory performance in training and prediction for both data sets.
U
Riverine Sand Mining/Scofield Island Restoration (BA-40): 2018 habitat...
data.usgs.gov
catalog.data.gov
Updated Nov 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Holly Beck; Hana Thurman; Nicholas Enwright; Jason Dugas; Wyatt Cheney (2021). Riverine Sand Mining/Scofield Island Restoration (BA-40): 2018 habitat classification, detailed habitat classes [Dataset]. http://doi.org/10.5066/P97NSPBM
Explore at:
Unique identifier
https://doi.org/10.5066/P97NSPBM
Dataset updated
Nov 19, 2021
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Holly Beck; Hana Thurman; Nicholas Enwright; Jason Dugas; Wyatt Cheney
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Dec 22, 2018
Description
The Barrier Island Comprehensive Monitoring (BICM) program was developed by Louisiana’s Coastal Protection and Restoration Authority (CPRA) and is implemented as a component of the System Wide Assessment and Monitoring Program (SWAMP). The program uses both historical data and contemporary data collections to assess and monitor changes in the aerial and subaqueous extent of islands, habitat types, sediment texture and geotechnical properties, environmental processes, and vegetation composition. Examples of BICM datasets include still and video aerial photography for documenting shoreline changes, shoreline positions, habitat mapping, land change analyses, light detection and ranging (lidar) surveys for topographic elevations, single-beam and swath bathymetry, and sediment grab samples. For more information about the BICM program, see Kindinger and others (2013). The U.S. Geological Survey, Wetland and Aquatic Research Center provides support to the BICM program through the develop ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.

Explore at:

Unique identifier

https://doi.org/10.17632/6cm9wyd5g5.1

Dataset updated

Nov 14, 2018

Authors

Scott Herford

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

North Carolina

Description

The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

Clear search

Close search

Google apps

Main menu

Educational Attainment in North Carolina Public Schools: Use of statistical...

Confusion matrix.

Data from: Characterizing and classifying neuroendocrine neoplasms through...

Zenodo Open Metadata snapshot - Training dataset for records classifier...

SIAM 2007 Text Mining Competition dataset

Malaria disease and grading system dataset from public hospitals reflecting...

Results for Random Forest classification models using different feature sets...

Application of image processing and machine learning techniques to...

Healthcare datasets.

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

EthanolLevel UCR Archive Dataset

Sensitivity, specificity and G-mean of financial risk problem with five risk...

SAT Questions and Answers for LLM 🏛️

SAT History Questions and Answers 🏛️ - Text Classification Dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

OTHER DATASETS FOR THE TEXT ANALYSIS:

Content

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

Character classification data for license plates

ag_news_subset

Rock mass quality and structural geology observations in northwest Prince...

ChemTables Sample: dataset for table classification in chemical patents

Classification of Swift and XMM-Newton sources - Dataset - B2FIND

Data from: New Variable Selection Method Using Interval Segmentation Purity...

Riverine Sand Mining/Scofield Island Restoration (BA-40): 2018 habitat...

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.