100+ datasets found
  1. Preprocessing Tool Kit

    • kaggle.com
    Updated Jan 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmal Dash (2022). Preprocessing Tool Kit [Dataset]. https://www.kaggle.com/nirmaldash/preprocessing-tool-kit/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nirmal Dash
    Description

    Dataset

    This dataset was created by Nirmal Dash

    Contents

  2. f

    Data_Sheet_1_FuNP (Fusion of Neuroimaging Preprocessing) Pipelines: A Fully...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo-yong Park; Kyoungseob Byeon; Hyunjin Park (2023). Data_Sheet_1_FuNP (Fusion of Neuroimaging Preprocessing) Pipelines: A Fully Automated Preprocessing Software for Functional Magnetic Resonance Imaging.docx [Dataset]. http://doi.org/10.3389/fninf.2019.00005.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Bo-yong Park; Kyoungseob Byeon; Hyunjin Park
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The preprocessing of functional magnetic resonance imaging (fMRI) data is necessary to remove unwanted artifacts and transform the data into a standard format. There are several neuroimaging data processing tools that are widely used, such as SPM, AFNI, FSL, FreeSurfer, Workbench, and fMRIPrep. Different data preprocessing pipelines yield differing results, which might reduce the reproducibility of neuroimaging studies. Here, we developed a preprocessing pipeline for T1-weighted structural MRI and fMRI data by combining components of well-known software packages to fully incorporate recent developments in MRI preprocessing into a single coherent software package. The developed software, called FuNP (Fusion of Neuroimaging Preprocessing) pipelines, is fully automatic and provides both volume- and surface-based preprocessing pipelines with a user-friendly graphical interface. The reliability of the software was assessed by comparing resting-state networks (RSNs) obtained using FuNP with pre-defined RSNs using open research data (n = 90). The obtained RSNs were well-matched with the pre-defined RSNs, suggesting that the pipelines in FuNP are reliable. In addition, image quality metrics (IQMs) were calculated from the results of three different software packages (i.e., FuNP, FSL, and fMRIPrep) to compare the quality of the preprocessed data. We found that our FuNP outperformed other software in terms of temporal characteristics and artifacts removal. We validated our pipeline with independent local data (n = 28) in terms of IQMs. The IQMs of our local data were similar to those obtained from the open research data. The codes for FuNP are available online to help researchers.

  3. u

    Processing Tools for Verifying the Yang Pressure Boundary Condition in...

    • rdr.ucl.ac.uk
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharp C. Y. Lo (2025). Processing Tools for Verifying the Yang Pressure Boundary Condition in HemeLB [Dataset]. http://doi.org/10.5522/04/29001521.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    University College London
    Authors
    Sharp C. Y. Lo
    License

    https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html

    Description

    This release provides the preprocessing and postprocessing scripts used to verify the Yang pressure boundary condition implemented in HemeLB. The scripts are designed to support the preparation and analysis of simulations of Poiseuille flow in cylindrical pipe domains, which are derived from a common cylinder geometry but differ in orientation: one aligned with the computational grid and the other rotated by 20 degrees to introduce grid misalignment. These tools facilitate input generation and postprocessing across a range of spatial resolutions, enabling evaluation of the Yang pressure boundary condition’s performance under different geometric and discretisation scenarios.

  4. R

    Car Highway Dataset

    • universe.roboflow.com
    zip
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sallar (2023). Car Highway Dataset [Dataset]. https://universe.roboflow.com/sallar/car-highway/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    Sallar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Vehicles Bounding Boxes
    Description

    Car-Highway Data Annotation Project

    Introduction

    In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.

    Project Goals

    • Collect a diverse dataset of car images from highway scenes.
    • Annotate the dataset to identify and label cars within each image.
    • Organize and format the annotated data for machine learning model training.

    Tools and Technologies

    For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.

    Annotation Process

    1. Upload the raw car images to the Roboflow platform.
    2. Use the annotation tools in Roboflow to draw bounding boxes around each car in the images.
    3. Label each bounding box with the corresponding class (e.g., car).
    4. Review and validate the annotations for accuracy.

    Data Augmentation

    Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.

    Data Export

    Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.

    Milestones

    1. Data Collection and Preprocessing
    2. Annotation of Car Images
    3. Data Augmentation
    4. Data Export
    5. Model Training

    Conclusion

    By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.

  5. Climatologies for the IceNet Preprocessing Tool

    • zenodo.org
    nc
    Updated Sep 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanessa Stöckl; Vanessa Stöckl (2023). Climatologies for the IceNet Preprocessing Tool [Dataset]. http://doi.org/10.5281/zenodo.8328634
    Explore at:
    ncAvailable download formats
    Dataset updated
    Sep 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vanessa Stöckl; Vanessa Stöckl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Climatologies for the IceNet Preprocessing Tool

  6. f

    Table_1_Overview of data preprocessing for machine learning applications in...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Oct 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eliana Ibrahimi; Marta B. Lopes; Xhilda Dhamo; Andrea Simeon; Rajesh Shigdel; Karel Hron; Blaž Stres; Domenica D’Elia; Magali Berland; Laura Judith Marcos-Zambrano (2023). Table_1_Overview of data preprocessing for machine learning applications in human microbiome research.XLSX [Dataset]. http://doi.org/10.3389/fmicb.2023.1250909.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Eliana Ibrahimi; Marta B. Lopes; Xhilda Dhamo; Andrea Simeon; Rajesh Shigdel; Karel Hron; Blaž Stres; Domenica D’Elia; Magali Berland; Laura Judith Marcos-Zambrano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.

  7. f

    Data from: A New Evaluation Metric for Quantitative Accuracy of...

    • acs.figshare.com
    zip
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengtian Shi; Chiyuan Huang; Renhui Chen; David Da Yong Chen; Binjun Yan (2024). A New Evaluation Metric for Quantitative Accuracy of LC–MS/MS-Based Proteomics with Data-Independent Acquisition [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00088.s003
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    ACS Publications
    Authors
    Mengtian Shi; Chiyuan Huang; Renhui Chen; David Da Yong Chen; Binjun Yan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Data-independent acquisition (DIA) has improved the identification and quantitation coverage of peptides and proteins in liquid chromatography–tandem mass spectrometry-based proteomics. However, different DIA data-processing tools can produce very different identification and quantitation results for the same data set. Currently, benchmarking studies of DIA tools are predominantly focused on comparing the identification results, while the quantitative accuracy of DIA measurements is acknowledged to be important but insufficiently investigated, and the absence of suitable metrics for comparing quantitative accuracy is one of the reasons. A new metric is proposed for the evaluation of quantitative accuracy to avoid the influence of differences in false discovery rate control stringency. The part of the quantitation results with high reliability was acquired from each DIA tool first, and the quantitative accuracy was evaluated by comparing quantification error rates at the same number of accurate ratios. From the results of four benchmark data sets, the proposed metric was shown to be more sensitive to discriminating the quantitative performance of DIA tools. Moreover, the DIA tools with advantages in quantitative accuracy were consistently revealed by this metric. The proposed metric can also help researchers in optimizing algorithms of the same DIA tool and sample preprocessing methods to enhance quantitative accuracy.

  8. f

    Simulated RNA-seq data of Homo sapiens and Mus musculus

    • figshare.com
    • data.4tu.nl
    • +1more
    txt
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia Engelmann (2023). Simulated RNA-seq data of Homo sapiens and Mus musculus [Dataset]. http://doi.org/10.4121/uuid:f8f12fa1-ea24-4074-a231-89b075d13d28
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Julia Engelmann
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset consists of simulated reads of the human and mouse transcriptome. For each transcript as annotated in the respective reference genome (hg19, mm9), 20 reads were simulated using the R package polyester. Base qualities were uniformly sampled from a range of high quality scores. The reads of both organisms were shuffled and written to one file.

  9. R

    Preprocessing Part 0 Dataset

    • universe.roboflow.com
    zip
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Visionifyai Arunima (2023). Preprocessing Part 0 Dataset [Dataset]. https://universe.roboflow.com/visionifyai-arunima/preprocessing-part-0/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2023
    Dataset authored and provided by
    Visionifyai Arunima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    PPE KIT Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Safety Regulations Compliance: Industries and construction companies can use this computer vision model to monitor their sites and ensure every worker is complying with safety regulations by properly wearing necessary PPE kits.

    2. Automated Surveillance: The computer vision model can be integrated into CCTV systems in hospitals, factories, or construction sites to monitor if everyone on site is wearing their PPE kits correctly for their safety, helping to prevent workplace accidents.

    3. Training & Education: This model could serve as a tool in training simulations or educational materials, identifying correct and incorrect PPE usage to facilitate learning among new employees or students in safety-oriented programs.

    4. Infection Spread Prevention: In medical and healthcare settings, the model could be used to monitor healthcare professionals and patients to verify that they are using PPE accurately, potentially preventing the spread of infectious diseases.

    5. Rescue Operations Evaluation: For emergency services like fire fighters or disaster relief teams, the model could help assess the team's preparedness by checking if all necessary safety equipment is being worn correctly before entering dangerous situations.

  10. i

    North America Pathological Analysis Pre-processing Equipment Market

    • imrmarketreports.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swati Kalagate; Akshay Patil; Vishal Kumbhar, North America Pathological Analysis Pre-processing Equipment Market [Dataset]. https://www.imrmarketreports.com/reports/north-america-pathological-analysis-pre-processing-equipment-market
    Explore at:
    Dataset provided by
    IMR Market Reports
    Authors
    Swati Kalagate; Akshay Patil; Vishal Kumbhar
    License

    https://www.imrmarketreports.com/privacy-policy/https://www.imrmarketreports.com/privacy-policy/

    Area covered
    North America
    Description

    The North America Pathological Analysis Pre-processing Equipment report features an extensive regional analysis, identifying market penetration levels across major geographic areas. It highlights regional growth trends and opportunities, allowing businesses to tailor their market entry strategies and maximize growth in specific regions.

  11. AH&AITD – Arslan’s Human and AI Text Database

    • kaggle.com
    Updated May 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arslan Doula (2025). AH&AITD – Arslan’s Human and AI Text Database [Dataset]. http://doi.org/10.34740/kaggle/dsv/11936841
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arslan Doula
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Description: AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability.

    Purpose: To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.

    Composition Human-Written Samples (Total: 5,790) Collected from:

    Open Web Text (2,343 samples)

    Blogs (196 samples)

    Web Text (397 samples)

    Q&A Platforms (670 samples)

    News Articles (430 samples)

    Opinion Statements (1,549 samples)

    Scientific Research Abstracts (205 samples)

    AI-Generated Samples (Total: 5,790) Generated using:

    ChatGPT (1,130 samples)

    GPT-4 (744 samples)

    Paraphrase Models (1,694 samples)

    GPT-2 (328 samples)

    GPT-3 (296 samples)

    DaVinci (GPT-3.5 variant) (433 samples)

    GPT-3.5 (364 samples)

    OPT-IML (406 samples)

    Flan-T5 (395 samples)

    Licensing: License: Creative Commons Attribution 4.0 International (CC BY 4.0) Citation Required when used in academic or commercial work.

    Citation Akram, A. (2023). AH&AITD: Arslan’s Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.

  12. D

    Data Science And Ml Platforms Market Report | Global Forecast From 2025 To...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Science And Ml Platforms Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-science-and-ml-platforms-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Authors
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Science And ML Platforms Market Outlook



    The global market size for Data Science and ML Platforms was estimated to be approximately USD 78.9 billion in 2023, and it is projected to reach around USD 307.6 billion by 2032, growing at a Compound Annual Growth Rate (CAGR) of 16.4% during the forecast period. This remarkable growth can be largely attributed to the increasing adoption of artificial intelligence (AI) and machine learning (ML) across various industries to enhance operational efficiency, predictive analytics, and decision-making processes.



    The surge in big data and the necessity to make sense of unstructured data is a substantial growth driver for the Data Science and ML Platforms market. Organizations are increasingly leveraging data science and machine learning to gain insights that can help them stay competitive. This is especially true in sectors like retail and e-commerce where customer behavior analytics can lead to more targeted marketing strategies, personalized shopping experiences, and improved customer retention rates. Additionally, the proliferation of IoT devices is generating massive amounts of data, which further fuels the need for advanced data analytics platforms.



    Another significant growth factor is the increasing adoption of cloud-based solutions. Cloud platforms offer scalable resources, flexibility, and substantial cost savings, making them attractive for enterprises of all sizes. Cloud-based data science and machine learning platforms also facilitate collaboration among distributed teams, enabling more efficient workflows and faster time-to-market for new products and services. Furthermore, advancements in cloud technologies, such as serverless computing and containerization, are making it easier for organizations to deploy and manage their data science models.



    Investment in AI and ML by key industry players also plays a crucial role in market growth. Tech giants like Google, Amazon, Microsoft, and IBM are making substantial investments in developing advanced AI and ML tools and platforms. These investments are not only driving innovation but also making these technologies more accessible to smaller enterprises. Additionally, mergers and acquisitions in this space are leading to more integrated and comprehensive solutions, which are further accelerating market growth.



    Machine Learning Tools are at the heart of this technological evolution, providing the necessary frameworks and libraries that empower developers and data scientists to create sophisticated models and algorithms. These tools, such as TensorFlow, PyTorch, and Scikit-learn, offer a range of functionalities from data preprocessing to model deployment, catering to both beginners and experts. The accessibility and versatility of these tools have democratized machine learning, enabling a wider audience to harness the power of AI. As organizations continue to embrace digital transformation, the demand for robust machine learning tools is expected to grow, driving further innovation and development in this space.



    From a regional perspective, North America is expected to hold the largest market share due to the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is anticipated to exhibit the highest growth rate during the forecast period. This is driven by increasing investments in AI and ML, a burgeoning start-up ecosystem, and supportive government policies aimed at digital transformation. Countries like China, India, and Japan are at the forefront of this growth, making significant strides in AI research and application.



    Component Analysis



    When analyzing the Data Science and ML Platforms market by component, it's essential to differentiate between software and services. The software segment includes platforms and tools designed for data ingestion, processing, visualization, and model building. These software solutions are crucial for organizations looking to harness the power of big data and machine learning. They provide the necessary infrastructure for data scientists to develop, test, and deploy ML models. The software segment is expected to grow significantly due to ongoing advancements in AI algorithms and the increasing need for more sophisticated data analysis tools.



    The services segment in the Data Science and ML Platforms market encompasses consulting, system integration, and support services. Consulting services help organizatio

  13. Z

    ASA³P Software & Database volume

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schwengers, Oliver (2024). ASA³P Software & Database volume [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3606299
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    Schwengers, Oliver
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ASA³P is an automatic and highly scalable assembly, annotation and higher-level analyses pipeline for closely related bacterial isolates. https://github.com/oschwengers/asap

    ASA³P is a fully automatic, locally executable and scalable assembly, annotation and higher-level analysis pipeline creating results in standard bioinformatics file formats as well as sophisticated HTML5 documents. Its main purpose is the automatic processing of NGS WGS data of multiple closely related isolates, thus transforming raw reads into assembled and annotated genomes and finally gathering as much information on every single bacterial genome as possible. Per-isolate analyses are complemented by comparative insights. Therefore, the pipeline incorporates many best-in-class open source bioinformatics tools and thus minimizes the burden of ever-repeating tasks. Envisaged as a preprocessing tool it provides comprehensive insights as well as a general overview and comparison of analysed genomes along with all necessary result files for subsequent deeper analyses. All results are presented via modern HTML5 documents comprising interactive visualizations.

    Schwengers et al, 2020 PLOS Comp Bio DOI:10.1371/journal.pcbi.1007134

  14. Z

    AI-Enabled Testing Tools Market By technology (natural language processing...

    • zionmarketresearch.com
    pdf
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion Market Research (2025). AI-Enabled Testing Tools Market By technology (natural language processing (NLP), machine learning & pattern recognition, and computer vision and image processing), By solution (services, which include professional services & managed services, and AI-based tools which are reduction & feature selection, data pre-processing & wrangling, data visualization, and others), By application (efficiency and time-to-market, further categorized into test automation, data analytics, and infrastructure optimization, agility & coverage) And By Region: - Global And Regional Industry Overview, Market Intelligence, Comprehensive Analysis, Historical Data, And Forecasts, 2024-2032 [Dataset]. https://www.zionmarketresearch.com/report/ai-enabled-testing-tools-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset authored and provided by
    Zion Market Research
    License

    https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy

    Time period covered
    2022 - 2030
    Area covered
    Global
    Description

    Global AI-Enabled Testing Tools Market was valued at $437.56 Million in 2023, and is projected to reach $USD 1693.95 Million by 2032, at a CAGR of 16.23%.

  15. h

    turkish-function-calling-20k

    • huggingface.co
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmet (2025). turkish-function-calling-20k [Dataset]. https://huggingface.co/datasets/atasoglu/turkish-function-calling-20k
    Explore at:
    Dataset updated
    Mar 25, 2025
    Authors
    Ahmet
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Used argilla-warehouse/python-seed-tools to sample tools.

      Preprocessing
    

    Since some answers might not contain a valid JSON schema, ensure that you preprocess and validate the answer to check if it satisfies the query using the given tools. You can use the preprocessing code below: import json from datasets import Dataset, load_dataset

    def validate_answers(sample): if sample["answers"] is None: return True try: tools = json.loads(sample["tools"])… See the full description on the dataset page: https://huggingface.co/datasets/atasoglu/turkish-function-calling-20k.

  16. t

    Machine Learning As A Service (MLaaS) Global Market Report 2025

    • thebusinessresearchcompany.com
    pdf,excel,csv,ppt
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Business Research Company (2025). Machine Learning As A Service (MLaaS) Global Market Report 2025 [Dataset]. https://www.thebusinessresearchcompany.com/report/machine-learning-as-a-service-mlaas-global-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset authored and provided by
    The Business Research Company
    License

    https://www.thebusinessresearchcompany.com/privacy-policyhttps://www.thebusinessresearchcompany.com/privacy-policy

    Description

    Global Machine Learning As A Service (MLaaS) market size is expected to reach $278.65 billion by 2029 at 36.9%, segmented as by software tools, data preprocessing tools, machine learning algorithms and frameworks, model training and validation tools

  17. A

    AI Framework Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). AI Framework Report [Dataset]. https://www.datainsightsmarket.com/reports/ai-framework-1402663
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global AI Framework market size was valued at USD XXX million in 2022 and is projected to reach USD XXX million by 2033, exhibiting a CAGR of XX% during the forecast period (2023-2033). The market is driven by the increasing adoption of AI in various industries, such as healthcare, manufacturing, and finance. The growing demand for AI-powered solutions that can automate tasks, improve decision-making, and enhance customer engagement is also contributing to the market growth. The AI framework is an open-source software that provides a set of tools and libraries that developers can use to build AI applications. AI frameworks simplify the development process by providing pre-built components, such as machine learning algorithms, data preprocessing tools, and performance optimization techniques. This reduces the time and effort required to develop and deploy AI solutions. The top companies in the AI framework market include Google, Meta, Apache MXNet, Amazon, Skymind, MindSpore, PaddlePaddle, Baidu, Tencent, Ali, and ByteDance. These companies offer a wide range of AI frameworks that cater to different needs and use cases.

  18. i

    Pathological Analysis Pre-processing Equipment Market Report

    • imrmarketreports.com
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swati Kalagate; Akshay Patil; Vishal Kumbhar (2024). Pathological Analysis Pre-processing Equipment Market Report [Dataset]. https://www.imrmarketreports.com/reports/pathological-analysis-pre-processing-equipment-market
    Explore at:
    Dataset updated
    Oct 11, 2024
    Dataset provided by
    IMR Market Reports
    Authors
    Swati Kalagate; Akshay Patil; Vishal Kumbhar
    License

    https://www.imrmarketreports.com/privacy-policy/https://www.imrmarketreports.com/privacy-policy/

    Description

    Pathological Analysis Pre-processing Equipment comes with extensive industry analysis of development components, patterns, flows, and sizes. The report calculates present and past market values to forecast potential market management during the forecast period between 2024 - 2032.

  19. Data from: COVID-19 and media dataset: Mining textual data according periods...

    • dataverse.cirad.fr
    application/x-gzip +1
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
    Explore at:
    application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
    Dataset updated
    Dec 21, 2020
    Authors
    Mathieu Roche; Mathieu Roche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France, Spain, United Kingdom
    Dataset funded by
    ANR (#DigitAg)
    Horizon 2020 - European Commission - (MOOD project)
    Description

    These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],

  20. Employment Of India CLeaned and Messy Data

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SONIA SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nirmal Dash (2022). Preprocessing Tool Kit [Dataset]. https://www.kaggle.com/nirmaldash/preprocessing-tool-kit/discussion
Organization logo

Preprocessing Tool Kit

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nirmal Dash
Description

Dataset

This dataset was created by Nirmal Dash

Contents

Search
Clear search
Close search
Google apps
Main menu