11 datasets found

e
Classification and Prediction
paper.erudition.co.in
html
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2025). Classification and Prediction [Dataset]. https://paper.erudition.co.in/makaut/master-of-computer-applications-2-years/3/data-warehousing-and-data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Classification and Prediction of Data Warehousing and Data Mining, 3rd Semester , Master of Computer Applications (2 Years)
HTRU2
figshare.com
zip
Updated Apr 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Lyon (2016). HTRU2 [Dataset]. http://doi.org/10.6084/m9.figshare.3080389.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3080389.v1
Dataset updated
Apr 1, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Robert Lyon
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Overview HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South) [1]. Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter (see [2] for more uses). As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes. Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation (see [2] for an introduction to pulsar astrophysics to find out why). Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find. Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, (see [4,5,6,7,8,9]) which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class. At present multi-class labels are unavailable, given the costs associated with data annotation. The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. Each candidate is described by 8 continuous variables. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). These are summarised below: 1. Mean of the integrated profile. 2. Standard deviation of the integrated profile. 3. Excess kurtosis of the integrated profile. 4. Skewness of the integrated profile. 5. Mean of the DM-SNR curve. 6. Standard deviation of the DM-SNR curve. 7. Excess kurtosis of the DM-SNR curve. 8. Skewness of the DM-SNR curve. HTRU 2 Summary 17,898 total examples. 1,639 positive examples. 16,259 negative examples. The data is presented in two formats: CSV and ARFF (used by the WEKA data mining tool). Candidates are stored in both files in separate rows. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive). Please note that the data contains no positional information or other astronomical details. It is simply feature data extracted from candidate files using the PulsarFeatureLab tool (see [10]).2. Citing our work If you use the dataset in your work please cite us using the DOI of the dataset, and the paper: R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach MNRAS, 2016. 3. Acknowledgements This data was obtained with the support of grant EP/I028099/1 for the University of Manchester Centre for Doctoral Training in Computer Science, from the UK Engineering and Physical Sciences Research Council (EPSRC). The raw observational data was collected by the High Time Resolution Universe Collaboration using the Parkes Observatory, funded by the Commonwealth of Australia and managed by the CSIRO. 4. References [1] M.~J. Keith et al., "The High Time Resolution Universe Pulsar Survey - I. System Configuration and Initial Discoveries",2010, Monthly Notices of the Royal Astronomical Society, vol. 409, pp. 619-627. DOI: 10.1111/j.1365-2966.2010.17325.x [2] D. R. Lorimer and M. Kramer, "Handbook of Pulsar Astronomy", Cambridge University Press, 2005. [3] R. J. Lyon, "Why Are Pulsars Hard To Find?", PhD Thesis, University of Manchester, 2015. [4] R. J. Lyon et al., "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach", Monthly Notices of the Royal Astronomical Society, submitted. [5] R. P. Eatough et al., "Selection of radio pulsar candidates using artificial neural networks", Monthly Notices of the Royal Astronomical Society, vol. 407, no. 4, pp. 2443-2450, 2010. [6] S. D. Bates et al., "The high time resolution universe pulsar survey vi. an artificial neural network and timing of 75 pulsars", Monthly Notices of the Royal Astronomical Society, vol. 427, no. 2, pp. 1052-1065, 2012. [7] D. Thornton, "The High Time Resolution Radio Sky", PhD thesis, University of Manchester, Jodrell Bank Centre for Astrophysics School of Physics and Astronomy, 2013. [8] K. J. Lee et al., "PEACE: pulsar evaluation algorithm for candidate extraction a software package for post-analysis processing of pulsar survey candidates", Monthly Notices of the Royal Astronomical Society, vol. 433, no. 1, pp. 688-694, 2013. [9] V. Morello et al., "SPINN: a straightforward machine learning solution to the pulsar candidate selection problem", Monthly Notices of the Royal Astronomical Society, vol. 443, no. 2, pp. 1651-1662, 2014. [10] R. J. Lyon, "PulsarFeatureLab", 2015, https://dx.doi.org/10.6084/m9.figshare.1536472.v1.
Classification results of the studies analyzed in A State-of-the-Art Review...
zenodo.org
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Luis Alonso-Rocha; Antonio Martínez-Rojas; Antonio Martínez-Rojas; José González-Enríquez; José González-Enríquez; Jesús M. Sánchez-Oliva; José Luis Alonso-Rocha; Jesús M. Sánchez-Oliva (2025). Classification results of the studies analyzed in A State-of-the-Art Review to Examine the Impact of Intelligent Document Processing in Banking Automations [Dataset]. http://doi.org/10.5281/zenodo.15268178
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15268178
Dataset updated
Apr 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
José Luis Alonso-Rocha; Antonio Martínez-Rojas; Antonio Martínez-Rojas; José González-Enríquez; José González-Enríquez; Jesús M. Sánchez-Oliva; José Luis Alonso-Rocha; Jesús M. Sánchez-Oliva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This spreadsheet presents the meticulously classified results from the conducting phase of our systematic literature review titled "From Manual to Automated: A State-of-the-Art Review to Examine the Impact of Intelligent Document Processing in Banking Automation". Each entry within this document represents an individual study analyzed during our research, categorized according to a carefully designed classification framework to ensure a comprehensive and clear understanding of the evolving landscape in banking automation through Intelligent Document Processing (IDP) technologies.

Classification Framework Overview

RQ1. General Study Characterization

Date: indicates the year of publication of the study.

Contribution Source: refers to the type of publication in which the study appears, such as a journal article or conference paper.

Validation: describes the context in which the study’s findings are validated, distinguishing between research environments and industrial or practical applications.

Contribution Type: defines the nature of the study’s main contribution, whether it presents an algorithm, a theoretical analysis, a framework, a method, or a model.

Public Data Exposure: reflects whether the study generates original datasets and makes them publicly accessible, distinguishing between contributions that provide new open data and those that rely on existing sources or do not disclose their data.

RQ2. Machine Learning Approaches and Trends

Learning Paradigm: classifies the study’s learning approach as supervised or unsupervised.

AI Subfield: identifies the primary Artificial Intelligence (AI) domain of the study, such as data mining, computer vision, or natural language processing (NLP).

Model Category: describes the specific type of Machine Learning (ML) model applied in the study, including rule-based models, regression models, clustering, support vector machines, decision trees, or neural networks.

RQ3. Business Automation Strategies

Automation Compatibility: assesses whether the study’s proposal aligns with Robotic Process Automation (RPA) or fits within a broader, more general automation context.

IDP Life Cycle Stage: defines the phase of the IDP life cycle addressed by the study, such as preprocessing, data extraction, or classification.

Business Environment Integration: assesses whether the proposed solution is designed for integration within business environments or remains conceptual.

Data Preparation Techniques: describes the preprocessing steps applied to structure and enhance raw inputs, employed in the study, including cleaning, transformation, vectorization, or token and label manipulation.

RQ4. Application Areas

Application Domain: identifies the sector or industry targeted by the study, such as banking, finance, fraud detection, accounting, or auditing.

Case Study: specifies the particular application context or document type addressed by the study, for example, checks, invoices, signatures, or broader document categories.

This classification scheme is instrumental in providing a structured, in-depth analysis of the field's current state, trends, and future directions. The framework aids in navigating the vast amount of information in the domain, offering researchers, practitioners, and policymakers a clear vision of the significant aspects of each study to foster informed decisions and further innovation in banking automations through IDP.
T
ag_news_subset
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
Explore at:
Unique identifier
https://identifiers.org/arxiv:1509.01626
Dataset updated
Dec 6, 2022
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ag_news_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Bacteria Dataset
kaggle.com
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). Bacteria Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7955145
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7955145
Dataset updated
Mar 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Dataset Overview

This dataset provides a comprehensive overview of 200 unique bacterial species, highlighting their scientific classification, natural habitats, and potential impacts on human health. Designed for data scientists and researchers, this collection serves as a foundational resource for studies in microbiology, public health, and environmental science. Each entry has been meticulously compiled to offer insights into the diverse roles bacteria play in ecosystems and their interactions with humans.

Data Science Applications

With 200 carefully curated entries, this dataset is ideal for a variety of data science applications, including but not limited to: - Predictive modeling to understand factors influencing bacterial habitats and human health implications. - Clustering analyses to uncover patterns and relationships among bacterial families and their characteristics. - Data visualization projects to illustrate the diversity of bacterial life and its relevance to ecosystems and health.

Column Descriptors

Name: The scientific name of the bacterial species.

Family: The taxonomic family to which the bacterium belongs.

Where Found: Natural habitats or common environments where the bacterium is typically found, including multiple locations if applicable.

Harmful to Humans: Indicates whether the bacterium is known to have harmful effects on human health ("Yes" or "No").

Ethically Mined Data

The compilation of this dataset adheres to ethical data mining practices, ensuring respect for intellectual property rights and scientific integrity. No proprietary or confidential information has been included without appropriate permissions and acknowledgments.

Sources

The data within this dataset has been gathered and synthesized from a range of authoritative sources, ensuring reliability and accuracy:

Websites: - CDC (Centers for Disease Control and Prevention): Offers extensive information on pathogenic bacteria and their impact on human health. - WHO (World Health Organization): Provides global health-related data, including details on bacteria responsible for infectious diseases.

Scientific Journals: - "Journal of Bacteriology": A peer-reviewed scientific journal that publishes research articles on the biology of bacteria. - "Microbiology": Offers articles on microbiology, virology, and molecular biology, with a focus on novel bacterial species and their functions.

Textbooks: - "Brock Biology of Microorganisms" by Michael T. Madigan et al.: A comprehensive textbook covering the principles of microbiology, including detailed information on bacteria. - "Prescott's Microbiology" by Joanne Willey, Linda Sherwood, and Christopher J. Woolverton: Provides a thorough introduction to the field of microbiology, with an emphasis on bacterial species and their roles.

This dataset represents a synthesis of credible scientific knowledge aimed at fostering research and education in microbiology and related fields.
o
Hate Speech and Offensive Language Detection
opendatabay.com
.undefined
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Hate Speech and Offensive Language Detection [Dataset]. https://www.opendatabay.com/data/ai-ml/32413cb6-d9db-4c1a-a3b2-23ce6e55bce2
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 8, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.

The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.

For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes

How to use the dataset Introduction:

Dataset Overview:

The dataset is presented in a CSV file format named 'train.csv'. It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither. Each row represents a tweet along with the corresponding annotations provided by multiple annotators. The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language). Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.

Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.

Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.

Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:

Distribution of tweet counts per classification category (hate speech, offensive language, neither). Most common words/phrases associated with each class. Co-occurrence analysis to identify correlations between hate speech and offensive language. Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/

Research Ideas Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language. Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive. Content Moderation: Social media platforms can use this dataset to improve their content moderation systems. By using machine learning algorithms trained on this data, it becomes easier to automatically detect and remove offensive or hateful content from the platform, reducing the burden on human moderators and improving user experience by keeping online spaces free from toxic behavior Acknowledgements If you use this dataset in your research, please credit the original authors. Data Source

License License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Infor
Replication package for DRAGON: Robust Classification for Very Large...
zenodo.org
bin, zip
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Replication package for DRAGON: Robust Classification for Very Large Collections of Software Repositories [Dataset]. http://doi.org/10.5281/zenodo.15424419
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15424419
Dataset updated
May 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DRAGON: Multi-Label Classification

This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.

Key Components:

Repository Mining: Scripts to extract repositories for dataset creation.

Dataset Preparation: Jupyter notebooks for cleaning and transforming data.

Data Processing: Conversion into a Hugging Face dataset format.

Model Training: Training scripts for DRAGON and LEGION, with configurable preprocessing options.

Evaluation: Threshold tuning and performance assessment.

Setup

Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:

python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt

Project Structure

repository_mining/: Contains scripts for mining the initial set of repositories.

repository_mining/doc/: Includes documentation with the necessary information for repository mining.

dataset_creation/: Contains all the notebooks to be run sequentially to prepare the dataset.

multilabel_class/: Contains scripts for classification, threshold tuning, and evaluation.

multilabel_class/model_output/: trained model organized by: first dataset, then model variantion.

data/: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.

1️⃣ Data Mining

To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/ folder. Detailed information and steps for repository mining can be found in:

repository_mining/doc/

2️⃣ Dataset Creation

After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/ folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.

3️⃣ Data Processing

Once the dataset is prepared, convert it into a Hugging Face dataset using:

python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv

4️⃣ Classification / Training

Train the DRAGON Model

After processing the dataset, train the DRAGON model with the following command:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

Ensure Configuration is Set Correctly

Modify the configuration file multilabel_class/utils/config.py to set the following parameter to True:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts }

Training DRAGON Without Sentence Pairs

To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs to False in the config file:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': False }

Train DRAGON on a Benchmark Dataset

To train DRAGON on a benchmark dataset, use:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset

Ensure the use_sentence_pairs parameter is set to True in config.py.

Train LEGION on the DRAGON Dataset

To train LEGION on the DRAGON dataset, use:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

Ensure the use_sentence_pairs parameter is set to False in config.py:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': False }

Train LEGION on a Baseline Dataset

To train LEGION on a baseline dataset, run:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset

5️⃣ Model Evaluation

Once thresholds are tuned, you can evaluate the model using:

python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

This evaluation script computes standard multi-label classification metrics including:

Micro and macro F1@1..5-score

Precision@1..5 and recall@1..5

Ensure that the model variant and dataset path correspond to the previously trained model.

Recommended: Evaluation via Notebooks

We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:

DRAGON_replication/multilabel_class/notebooks/

These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.

Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.

Instructions for Unzipping Files

Several folders in this replication package have been compressed into .zip files to reduce package size. Before running any code, you must unzip all the provided .zip files in-place—that is, extract each archive into the same directory as the .zip file, using the same name as the zip file (without the .zip extension).

For example:

DRAGON_replication\data\02_processed_dataset\2024-05-22.zip

should be extracted to:

DRAGON_replication\data\02_processed_dataset\2024-05-22\

List of .zip files to extract

DRAGON_replication\data\02_processed_dataset\2024-05-22.zip

DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip

DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip

DRAGON_replication\dataset_creation\data.zip

DRAGON_replication\multilabel_class\model_output\2024-05-22.zip

DRAGON_replication\multilabel_class\model_output\LEGION.zip

Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.

This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Global
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
d
North American Product Classification System (NAPCS).
datadiscoverystudio.org
catalog.data.gov
+2more
html
Updated Dec 6, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). North American Product Classification System (NAPCS). [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/2e2b2396ad204648b94b4760844f8ca4/html
Explore at:
htmlAvailable download formats
Dataset updated
Dec 6, 2016
Description
description: The North American Product Classification System (NAPCS) is a joint multi-phase initiative to develop a comprehensive demand-oriented product classification developed by the statistical agencies of Canada, Mexico, and the United States. Work to date has focused on the products produced by service industries in 12 NAICS sectors 48-49 through 81. With that work provisionally complete, this web page provides an overview of and progress report on the NAPCS initiative and presents the final versions of the product lists developed so far for the service industries included in those 12 sectors. Work is underway developing NAPCS products of industries in NAICS sectors not yet covered (Sector 11: Agriculture, Forestry, Fishing and Hunting, Sector 21: Mining, Sector 22: Utilities, Sector 23: Construction, Sector 31-33: Manufacturing, Sector 42: Wholesale Trade, and Sector 44-45: Retail Trade). Provisional lists will be announced on this site as they are decided.; abstract: The North American Product Classification System (NAPCS) is a joint multi-phase initiative to develop a comprehensive demand-oriented product classification developed by the statistical agencies of Canada, Mexico, and the United States. Work to date has focused on the products produced by service industries in 12 NAICS sectors 48-49 through 81. With that work provisionally complete, this web page provides an overview of and progress report on the NAPCS initiative and presents the final versions of the product lists developed so far for the service industries included in those 12 sectors. Work is underway developing NAPCS products of industries in NAICS sectors not yet covered (Sector 11: Agriculture, Forestry, Fishing and Hunting, Sector 21: Mining, Sector 22: Utilities, Sector 23: Construction, Sector 31-33: Manufacturing, Sector 42: Wholesale Trade, and Sector 44-45: Retail Trade). Provisional lists will be announced on this site as they are decided.
f
Supplementary Material for: Artificial Intelligence-driven Prediction...
karger.figshare.com
docx
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kováčová M.; Hlaváč V.; Koževnikovová R.; Rauš K.; Gatěk J.; Souček P. (2024). Supplementary Material for: Artificial Intelligence-driven Prediction Revealed CFTR Associated With Therapy Outcome Of Breast Cancer: A Feasibility Study [Dataset]. http://doi.org/10.6084/m9.figshare.26317882.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26317882.v1
Dataset updated
Jul 17, 2024
Dataset provided by
Karger Publishers
Authors
Kováčová M.; Hlaváč V.; Koževnikovová R.; Rauš K.; Gatěk J.; Souček P.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: In silico tools capable of predicting the functional consequences of genomic differences between individuals, many of which are AI-driven, have been the most effective over the past two decades for non-synonymous single nucleotide variants (nsSNVs). When appropriately selected for the purpose of the study, a high predictive performance can be expected. In this feasibility study, we investigate the distribution of nsSNVs with an allele frequency below 5%. To classify the putative functional consequence, a tier-based filtration led by AI-driven predictors and scoring system were implemented to the overall-decision making process, resulting in a list of prioritised genes. Methods: The study has been conducted on breast cancer patients of homogeneous ethnicity. Germline rare variants have been sequenced in genes that influence pharmacokinetic parameters of anticancer drugs or molecular signalling pathways in cancer. After AI-driven functional pathogenicity classification and data mining in pharmacogenomic (PGx) databases, variants were collapsed to the gene level and ranked according to their putative deleterious role. Results: In breast cancer patients, seven of the twelve genes prioritised based on the predictions were found to be associated with response to oncotherapy, histological grade, and tumour subtype. Most importantly, we showed that the group of patients with at least one rare nsSNVs in Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) had significantly reduced disease-free (Log Rank, p=0.002) and overall survival (Log Rank, p=0.006). Conclusion: AI-driven in silico analysis with PGx data mining provided an effective approach navigating for functional consequences across germline genetic background, which can be easily integrated into the overall decision-making process for future studies. The study revealed a statistically significant association with numerous clinicopathological parameters, including treatment response. Our study indicates that CFTR may be involved in the processes influencing the effectiveness of oncotherapy or in the malignant progression of the disease itself.
T
1:100,000 Landuse data in the Yellow River Upstream (2000)
data.tpdc.ac.cn
tpdc.ac.cn
zip
Updated Apr 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xian XUE; Heqiang DU (2021). 1:100,000 Landuse data in the Yellow River Upstream (2000) [Dataset]. https://data.tpdc.ac.cn/en/data/f0eefe17-5277-4264-82c5-5676252cbed6
Explore at:
zipAvailable download formats
Dataset updated
Apr 19, 2021
Dataset provided by
TPDC
Authors
Xian XUE; Heqiang DU
Area covered
Yellow River,
Description
Ⅰ. Overview This data set is based on Landsat MSS, TM and ETM Remote sensing data by means of satellite remote sensing. Using a hierarchical land cover classification system, the data divides the whole region into six first-class classifications (cultivated land, forest land, grassland, water area, urban and rural areas, industrial and mining land, residential land and unused land), and 31 second-class classifications. Ⅱ. Data processing description The data set is based on Landsat MSS, TM and ETM Remote sensing data as the base map, the data set projection is set as Alberts equal product projection, the scale is set at 1:24,000 for human-computer interactive visual interpretation, and the data set storage form is ESRI coverage format. Ⅲ. Data content description The data set adopts a hierarchical land cover classification system, which is divided into 6 first-class classifications (cultivated land, forest land, grassland, water area, urban and rural areas, industrial and mining land, residential land and unused land), and 31 second-class classifications. Ⅳ. Data use description The data can be mainly used in national land resources survey, climate change, hydrology and ecological research.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Einetic (2025). Classification and Prediction [Dataset]. https://paper.erudition.co.in/makaut/master-of-computer-applications-2-years/3/data-warehousing-and-data-mining

Classification and Prediction

4

Explore at:

htmlAvailable download formats

Dataset updated

Jun 18, 2025

Dataset authored and provided by

Einetic

License

https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms

Description

Question Paper Solutions of chapter Classification and Prediction of Data Warehousing and Data Mining, 3rd Semester , Master of Computer Applications (2 Years)

Clear search

Close search

Google apps

Main menu

Classification and Prediction

HTRU2

Classification results of the studies analyzed in A State-of-the-Art Review...

ag_news_subset

Bacteria Dataset

Dataset Overview

Data Science Applications

Column Descriptors

Ethically Mined Data

Sources

Hate Speech and Offensive Language Detection

Replication package for DRAGON: Robust Classification for Very Large...

DRAGON: Multi-Label Classification

Key Components:

Setup

Project Structure

1️⃣ Data Mining

2️⃣ Dataset Creation

3️⃣ Data Processing

4️⃣ Classification / Training

Train the DRAGON Model

Ensure Configuration is Set Correctly

Training DRAGON Without Sentence Pairs

Train DRAGON on a Benchmark Dataset

Train LEGION on the DRAGON Dataset

Train LEGION on a Baseline Dataset

5️⃣ Model Evaluation

Recommended: Evaluation via Notebooks

Instructions for Unzipping Files

List of .zip files to extract

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

North American Product Classification System (NAPCS).

Supplementary Material for: Artificial Intelligence-driven Prediction...

1:100,000 Landuse data in the Yellow River Upstream (2000)

Classification and Prediction

4

List of `.zip` files to extract