98 datasets found
  1. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  2. M

    Machine Learning in Chip Design Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Machine Learning in Chip Design Report [Dataset]. https://www.archivemarketresearch.com/reports/machine-learning-in-chip-design-40714
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Market Size and Growth: The global market for Machine Learning (ML) in Chip Design is projected to reach USD 19.7 billion by 2033, registering a CAGR of 25.2% from 2025 to 2033. This growth is attributed to the increasing demand for faster, more power-efficient chips and the ability of ML to automate and optimize the chip design process. Key drivers include the need to reduce design time and cost, improve performance, and address emerging technologies such as AI and IoT. Market Segmentation and Trends: Based on type, supervised learning is expected to dominate the market due to its wide applications in chip design, including design rule checking, yield prediction, and fault diagnosis. Semi-supervised learning is gaining traction as it combines labeled and unlabeled data for training, offering improved accuracy. Unsupervised learning and reinforcement learning are also finding use in chip design, particularly in areas such as auto layout and routing. Major chipmakers such as Intel, NVIDIA, and Cadence Design Systems are investing heavily in ML technologies to enhance their chip design capabilities. Additionally, the adoption of ML in foundries is growing as they seek to improve yield and efficiency for their customers. This comprehensive report provides an in-depth analysis of the Machine Learning in Chip Design market, offering insights into key market dynamics, regional trends, growth drivers, and competitive landscapes. Covering the period from 2023 to 2029, the report forecasts market size and growth to assist businesses in making strategic decisions and capturing untapped opportunities.

  3. Z

    Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine...

    • data.niaid.nih.gov
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soundclim Network; Cañas, Juan Sebastián; María Paula, Toro-Gómez; Larissa Sayuri, Moreira Sugai; Toledo, Luis Felipe; Franco Leandro, De Souza; Selvino, Neckel De Oliveira; Rogerio, Pereira Bastos; Diego, Llusia; Juan Sebastián, Ulloa (2024). Unlabeled AnuraSet: A dataset for leveraging unlabeled data in machine learning models for passive acoustic monitoring [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11244813
    Explore at:
    Dataset updated
    May 27, 2024
    Authors
    Soundclim Network; Cañas, Juan Sebastián; María Paula, Toro-Gómez; Larissa Sayuri, Moreira Sugai; Toledo, Luis Felipe; Franco Leandro, De Souza; Selvino, Neckel De Oliveira; Rogerio, Pereira Bastos; Diego, Llusia; Juan Sebastián, Ulloa
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    The Unlabeled AnuraSet (U-AnuraSet) is an extension of the original AnuraSet dataset. It consists of soundscape recordings from passive acoustic monitoring conducted in Brazil. The recording sites are identical to those in the original AnuraSet. Each site comprises 2,666 one-minute raw audio files of unlabeled data. The U-AnuraSet is publicly available to encourage machine learning researchers to explore innovative methods for leveraging unlabeled data in the training of models aimed at solving problems such as anuran call identification.

    If you find the Unlabeled AnuraSet useful for your research, please consider citing it as follows:

    Cañas, J.S., Toro-Gómez, M.P., Sugai, L.S.M., et al. A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring. Sci Data 10, 771 (2023). https://doi.org/10.1038/s41597-023-02666-2

  4. Brazilian Legal Proceedings

    • kaggle.com
    zip
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Maia Polo (2021). Brazilian Legal Proceedings [Dataset]. https://www.kaggle.com/felipepolo/brazilian-legal-proceedings
    Explore at:
    zip(124024147 bytes)Available download formats
    Dataset updated
    May 14, 2021
    Authors
    Felipe Maia Polo
    Description

    The Dataset

    These datasets were used while writing the following work:

    Polo, F. M., Ciochetti, I., and Bertolo, E. (2021). Predicting legal proceedings status: approaches based on sequential text data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, pages 264–265.
    

    Please cite us if you use our datasets in your academic work:

    @inproceedings{polo2021predicting,
     title={Predicting legal proceedings status: approaches based on sequential text data},
     author={Polo, Felipe Maia and Ciochetti, Itamar and Bertolo, Emerson},
     booktitle={Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law},
     pages={264--265},
     year={2021}
    }
    

    More details below!

    Context

    Every legal proceeding in Brazil is one of three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. The three possible classes are given in a specific instant in time, which may be temporary or permanent. Moreover, they are decided by the courts to organize their workflow, which in Brazil may reach thousands of simultaneous cases per judge. Developing machine learning models to classify legal proceedings according to their status can assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency.

    In this dataset, each proceeding is made up of a sequence of short texts called “motions” written in Portuguese by the courts’ administrative staff. The motions relate to the proceedings, but not necessarily to their legal status.

    Content

    Our data is composed of two datasets: a dataset of ~3*10^6 unlabeled motions and a dataset containing 6449 legal proceedings, each with an individual and a variable number of motions, but which have been labeled by lawyers. Among the labeled data, 47.14% is classified as archived (class 1), 45.23% is classified as active (class 2), and 7.63% is classified as suspended (class 3).

    The datasets we use are representative samples from the first (São Paulo) and third (Rio de Janeiro) most significant state courts. State courts handle the most variable types of cases throughout Brazil and are responsible for 80% of the total amount of lawsuits. Therefore, these datasets are a good representation of a very significant portion of the use of language and expressions in Brazilian legal vocabulary.

    Regarding the labels dataset, the key "-1" denotes the most recent text while "-2" the second most recent and so on.

    Acknowledgements

    We would like to thank Ana Carolina Domingues Borges, Andrews Adriani Angeli, and Nathália Caroline Juarez Delgado from Tikal Tech for helping us to obtain the datasets. This work would not be possible without their efforts.

    Inspiration

    Can you develop good machine learning classifiers for text sequences? :)

  5. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  6. f

    Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully...

    • frontiersin.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leslie N. Smith; Adam Conovaloff (2023). Data_Sheet_1_Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully Supervised Performance.pdf [Dataset]. http://doi.org/10.3389/frai.2022.880729.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Leslie N. Smith; Adam Conovaloff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reaching the performance of fully supervised learning with unlabeled data and only labeling one sample per class might be ideal for deep learning applications. We demonstrate for the first time the potential for building one-shot semi-supervised (BOSS) learning on CIFAR-10 and SVHN up to attain test accuracies that are comparable to fully supervised learning. Our method combines class prototype refining, class balancing, and self-training. A good prototype choice is essential and we propose a technique for obtaining iconic examples. In addition, we demonstrate that class balancing methods substantially improve accuracy results in semi-supervised learning to levels that allow self-training to reach the level of fully supervised learning performance. Our experiments demonstrate the value with computing and analyzing test accuracies for every class, rather than only a total test accuracy. We show that our BOSS methodology can obtain total test accuracies with CIFAR-10 images and only one labeled sample per class up to 95% (compared to 94.5% for fully supervised). Similarly, the SVHN images obtains test accuracies of 97.8%, compared to 98.27% for fully supervised. Rigorous empirical evaluations provide evidence that labeling large datasets is not necessary for training deep neural networks. Our code is available at https://github.com/lnsmith54/BOSS to facilitate replication.

  7. Dataset for Fetal Ultrasound Grand Challenge: Semi-Supervised Cervical...

    • zenodo.org
    png
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jieyun Bai; Jieyun Bai; Ziduo Yang; Ziduo Yang; Jie Gan; Hasan Md. Kamrul; Zhuonan Liang; Weidong Cai; Tan Tao; Ye Jing; Yaqub Mohammad; Ni Dong; Slimani Saad; Ohene-Botwe Benard; Víctor Manuel Campello; Víctor Manuel Campello; Karim Lekadir; Karim Lekadir; Jie Gan; Hasan Md. Kamrul; Zhuonan Liang; Weidong Cai; Tan Tao; Ye Jing; Yaqub Mohammad; Ni Dong; Slimani Saad; Ohene-Botwe Benard (2024). Dataset for Fetal Ultrasound Grand Challenge: Semi-Supervised Cervical Segmentation (ISBI 2025) [Dataset]. http://doi.org/10.5281/zenodo.14305302
    Explore at:
    pngAvailable download formats
    Dataset updated
    Dec 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jieyun Bai; Jieyun Bai; Ziduo Yang; Ziduo Yang; Jie Gan; Hasan Md. Kamrul; Zhuonan Liang; Weidong Cai; Tan Tao; Ye Jing; Yaqub Mohammad; Ni Dong; Slimani Saad; Ohene-Botwe Benard; Víctor Manuel Campello; Víctor Manuel Campello; Karim Lekadir; Karim Lekadir; Jie Gan; Hasan Md. Kamrul; Zhuonan Liang; Weidong Cai; Tan Tao; Ye Jing; Yaqub Mohammad; Ni Dong; Slimani Saad; Ohene-Botwe Benard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 6, 2024
    Description

    Transvaginal ultrasound is the preferred method for visualizing the cervix in most patients, offering detailed insight into cervical anatomy and structure. Accurate segmentation of ultrasound (US) images of the cervical muscles is essential for analyzing deep muscle structures, assessing their function, and monitoring treatment protocols tailored to individual patients.

    The manual annotation of cervical structures in transvaginal ultrasound images is labor-intensive and time-consuming, limiting the availability of large labeled datasets required for robust machine learning models. In response to this challenge, semi supervised learning approaches have shown potential by leveraging both labeled and unlabeled data, enabling the extraction of useful information from unannotated cases. This method could reduce the need for extensive manual annotation while maintaining accuracy, thus accelerating the development of automated cervical image segmentation systems. The envisioned impact of this challenge is twofold: improving clinical decision-making through more accessible and accurate diagnostic tools and advancing machine learning techniques for medical image analysis, particularly in resource-constrained environments.

    We extend the MICCAI PSFHS 2023 Challenge and the MICCAI IUGC 2024 Challenge from fully supervised settings to a semi-supervised setting that focuses on how to use unlabeled data.

    Training/Validation/Test=500/90/300

    The dataset can be accessible after signing the data-sharing agreement and sending it to the organizer (fugc.isbi25@gmail.com).

  8. Z

    Data used in Machine learning reveals the waggle drift's role in the honey...

    • data-staging.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated May 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dormagen, David M; Wild, Benjamin; Wario, Fernando; Landgraf, Tim (2023). Data used in Machine learning reveals the waggle drift's role in the honey bee dance communication system [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7928120
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    Freie Universität Berlin
    Universidad de Guadalajara
    Authors
    Dormagen, David M; Wild, Benjamin; Wario, Fernando; Landgraf, Tim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"

    All timestamps are given in ISO 8601 format.

    The following files are included:

    Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv

    Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.

    timestamp: Date and time of the detection.

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).

    waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).

    Berlin2019_dances.csv

    Automatic detections of dance behavior during our recording period in 2019.

    dancer_id: Unique ID of the individual bee.

    dance_id: Unique ID of the dance.

    ts_from, ts_to: Date and time of the beginning and end of the dance.

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    median_x, median_y: Median position of the individual during the dance.

    feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

    Berlin2019_followers.csv

    Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.

    dance_id: Unique ID of the dance being attended or followed.

    follower_id: Unique ID of the individual attending or following the dance.

    ts_from, ts_to: Date and time of the beginning and end of the interaction.

    label: “attendance” or “follower”

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    Berlin2019_dances_with_manually_verified_times.csv

    A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).

    dance_id: Unique ID of the dance.

    dancer_id: Unique ID of the dancing individual.

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

    dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.

    Berlin2019_dance_classifier_labels.csv

    Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.

    timestamp: Timestamp of the individual frame the behavior was observed in.

    frame_id: Unique ID of the video frame the behavior was observed in.

    bee_id: Unique ID of the individual bee.

    label: One of “nothing”, “waggle”, “follower”

    Berlin2019_dance_classifier_unlabeled.csv

    Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.

    Berlin2021_waggle_phase_classifier_labels.csv

    Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.

    detection_id: Unique ID of the waggle phase.

    label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.

    orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).

    metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.

    Berlin2021_waggle_phase_classifier_ground_truth.zip

    The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.

    Berlin2019_tracks.zip

    Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training. We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.

    The individual files contain the following columns:

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    timestamp: Date and time of the detection.

    frame_id: Unique ID of the video frame of the recording from which the detection was extracted.

    track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.

    bee_id: Unique ID of the individual bee.

    bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.

    x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.

    orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).

    Berlin2019_feeder_experiment_log.csv

    Experiment log for our feeder experiments in 2019.

    date: Date given in the format year-month-day.

    feeder_cam_id: Numeric ID of the feeder.

    coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.

    time_opened, time_closed: Date and time when the feeder was set up or closed again. sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.

    Software used to acquire and analyze the data:

    bb_pipeline: Tag localization and decoding pipeline

    bb_pipeline_models: Pretrained localizer and decoder models for bb_pipeline

    bb_binary: Raw detection data storage format

    bb_irflash: IR flash system schematics and arduino code

    bb_imgacquisition: Recording and network storage

    bb_behavior: Database interaction and data (pre)processing, feature extraction

    bb_tracking: Tracking of bee detections over time

    bb_wdd2: Automatic detection and decoding of honey bee waggle dances

    bb_wdd_filter: Machine learning model to improve the accuracy of the waggle dance detector

    bb_dance_networks: Detection of dancing and following behavior from trajectories

  9. n

    Data from: Exploring deep learning techniques for wild animal behaviour...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 22, 2024
    Dataset provided by
    Osaka University
    Nagoya University
    Authors
    Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

    This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.

  10. S1 Appendix -

    • plos.figshare.com
    zip
    Updated Sep 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karina Shyrokykh; Max Girnyk; Lisa Dellmuth (2023). S1 Appendix - [Dataset]. http://doi.org/10.1371/journal.pone.0290762.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Karina Shyrokykh; Max Girnyk; Lisa Dellmuth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.

  11. network-anomaly-dataset

    • kaggle.com
    zip
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto del Rio (2024). network-anomaly-dataset [Dataset]. https://www.kaggle.com/datasets/kaiser14/network-anomaly-dataset
    Explore at:
    zip(29839 bytes)Available download formats
    Dataset updated
    Sep 5, 2024
    Authors
    Alberto del Rio
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset, titled "Network Anomaly Dataset," is designed for the development and evaluation of machine learning models focused on network anomaly detection. The dataset is available in two versions: a labeled version where each instance is marked as "Anomaly" or "Normal," and an unlabeled version that can be used for unsupervised learning techniques.

    Dataset Features: - Throughput: The amount of data successfully transmitted over a network in a given period. - Congestion: The degree of network traffic load, potentially leading to delays or packet loss. - Packet Loss: The percentage of packets that fail to reach their destination, indicative of network issues. - Latency: The time taken for data to travel from the source to the destination, crucial for time-sensitive applications. - Jitter: The variation in packet arrival times, affecting the quality of real-time communications.

    Applications: - Supervised Learning: Use the labeled dataset to train and evaluate models such as Random Forest, SVM, and Logistic Regression for anomaly detection. - Unsupervised Learning: Apply techniques like clustering and change point detection on the unlabeled dataset to discover hidden patterns and anomalies.

    This dataset is ideal for practitioners and researchers aiming to explore network security, develop robust anomaly detection models, or conduct comparative analysis between supervised and unsupervised learning methods.

  12. R

    AI in Semi-supervised Learning Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). AI in Semi-supervised Learning Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-semi-supervised-learning-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    AI in Semi-supervised Learning Market Outlook



    According to our latest research, the AI in Semi-supervised Learning market size reached USD 1.82 billion in 2024 globally, driven by rapid advancements in artificial intelligence and machine learning applications across diverse industries. The market is expected to expand at a robust CAGR of 28.1% from 2025 to 2033, reaching a projected value of USD 17.17 billion by 2033. This exponential growth is primarily fueled by the increasing need for efficient data labeling, the proliferation of unstructured data, and the growing adoption of AI-driven solutions in both large enterprises and small and medium businesses. As per the latest research, the surging demand for automation, accuracy, and cost-efficiency in data processing is significantly accelerating the adoption of semi-supervised learning models worldwide.



    One of the most significant growth factors for the AI in Semi-supervised Learning market is the explosive increase in data generation across industries such as healthcare, finance, retail, and automotive. Organizations are continually collecting vast amounts of structured and unstructured data, but the process of labeling this data for supervised learning remains time-consuming and expensive. Semi-supervised learning offers a compelling solution by leveraging small amounts of labeled data alongside large volumes of unlabeled data, thus reducing the dependency on extensive manual annotation. This approach not only accelerates the deployment of AI models but also enhances their accuracy and scalability, making it highly attractive for enterprises seeking to maximize the value of their data assets while minimizing operational costs.



    Another critical driver propelling the growth of the AI in Semi-supervised Learning market is the increasing sophistication of AI algorithms and the integration of advanced technologies such as deep learning, natural language processing, and computer vision. These advancements have enabled semi-supervised learning models to achieve remarkable performance in complex tasks like image and speech recognition, medical diagnostics, and fraud detection. The ability to process and interpret vast datasets with minimal supervision is particularly valuable in sectors where labeled data is scarce or expensive to obtain. Furthermore, the ongoing investments in research and development by leading technology companies and academic institutions are fostering innovation, resulting in more robust and scalable semi-supervised learning frameworks that can be seamlessly integrated into enterprise workflows.



    The proliferation of cloud computing and the increasing adoption of hybrid and multi-cloud environments are also contributing significantly to the expansion of the AI in Semi-supervised Learning market. Cloud-based deployment offers unparalleled scalability, flexibility, and cost-efficiency, allowing organizations of all sizes to access cutting-edge AI tools and infrastructure without the need for substantial upfront investments. This democratization of AI technology is empowering small and medium enterprises to leverage semi-supervised learning for competitive advantage, driving widespread adoption across regions and industries. Additionally, the emergence of AI-as-a-Service (AIaaS) platforms is further simplifying the integration and management of semi-supervised learning models, enabling businesses to accelerate their digital transformation initiatives and unlock new growth opportunities.



    From a regional perspective, North America currently dominates the AI in Semi-supervised Learning market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading AI vendors, robust technological infrastructure, and high investments in AI research and development are key factors driving market growth in these regions. Asia Pacific is expected to witness the fastest CAGR during the forecast period, fueled by rapid digitalization, expanding IT infrastructure, and increasing government initiatives to promote AI adoption. Meanwhile, Latin America and the Middle East & Africa are also showing promising growth potential, supported by rising awareness of AI benefits and growing investments in digital transformation projects across various sectors.



    Component Analysis



    The component segment of the AI in Semi-supervised Learning market is divided into software, hardware, and services, each playing a pivotal role in the adoption and implementation of semi-s

  13. Dataset: Data-Driven Machine Learning-Informed Framework for Model...

    • zenodo.org
    csv
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edgar Amalyan; Edgar Amalyan (2025). Dataset: Data-Driven Machine Learning-Informed Framework for Model Predictive Control in Vehicles [Dataset]. http://doi.org/10.5281/zenodo.15288740
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edgar Amalyan; Edgar Amalyan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset belonging to the paper: Data-Driven Machine Learning-Informed Framework for Model Predictive Control in Vehicles

    labeled_seed.csv: Processed and labeled data of all maneuvers combined into a single file, sorted by label

    raw_track_session.csv: Untouched CSV file from Racebox track session

    unlabeled_exemplar.csv: Processed but unlabeled data of street and track data

  14. G

    Self-Supervised Learning Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Self-Supervised Learning Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/self-supervised-learning-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Self-Supervised Learning Market Outlook



    According to our latest research, the global self-supervised learning market size reached USD 10.2 billion in 2024, demonstrating rapid adoption across multiple sectors. The market is set to expand at a strong CAGR of 33.1% from 2025 to 2033, propelled by the growing need for advanced artificial intelligence solutions that minimize dependency on labeled data. By 2033, the market is forecasted to achieve an impressive size of USD 117.2 billion, underscoring the transformative potential of self-supervised learning in revolutionizing data-driven decision-making and automation across industries. This growth trajectory is supported by increasing investments in AI research, the proliferation of big data, and the urgent demand for scalable machine learning models.




    The primary growth driver for the self-supervised learning market is the exponential surge in data generation across industries and the corresponding need for efficient data labeling techniques. Traditional supervised learning requires vast amounts of labeled data, which is both time-consuming and expensive to annotate. Self-supervised learning, by contrast, leverages unlabeled data to train models, significantly reducing operational costs and accelerating the deployment of AI systems. This paradigm shift is particularly critical in sectors like healthcare, finance, and autonomous vehicles, where large datasets are abundant but labeled examples are scarce. As organizations seek to unlock value from their data assets, self-supervised learning is emerging as a cornerstone technology, enabling more robust, scalable, and generalizable AI applications.




    Another significant factor fueling market expansion is the rapid advancement in computing infrastructure and algorithmic innovation. The availability of high-performance hardware, such as GPUs and TPUs, coupled with breakthroughs in neural network architectures, has made it feasible to train complex self-supervised models on massive datasets. Additionally, the open-source movement and collaborative research have democratized access to state-of-the-art self-supervised learning frameworks, fostering innovation and lowering barriers to entry for enterprises of all sizes. These technological advancements are empowering organizations to experiment with self-supervised learning at scale, driving adoption across a wide range of applications, from natural language processing to computer vision and robotics.




    The market is also benefiting from the growing emphasis on ethical AI and data privacy. Self-supervised learning methods, which minimize the need for sensitive labeled data, are increasingly being adopted to address privacy concerns and regulatory compliance requirements. This is particularly relevant in regions with stringent data protection regulations, such as the European Union. Furthermore, the ability of self-supervised learning to generalize across domains and tasks is enabling businesses to build more resilient and adaptable AI systems, further accelerating market growth. The convergence of these factors is positioning self-supervised learning as a key enabler of next-generation AI solutions.



    Transfer Learning is emerging as a pivotal technique in the realm of self-supervised learning, offering a bridge between different domains and tasks. By leveraging knowledge from pre-trained models, transfer learning allows for the adaptation of AI systems to new, related tasks with minimal additional data. This approach is particularly beneficial in scenarios where labeled data is scarce, enabling models to generalize better and learn more efficiently. The integration of transfer learning into self-supervised frameworks is enhancing the ability of AI systems to tackle complex problems across various industries, from healthcare diagnostics to autonomous driving. As the demand for versatile and efficient AI solutions grows, transfer learning is set to play a crucial role in the evolution of self-supervised learning technologies.




    From a regional perspective, North America currently leads the self-supervised learning market, accounting for the largest share due to its robust AI research ecosystem, significant investments from technology giants, and early adoption across verticals. However, Asia Pacific is projected to witness the fastest growth over the forecast period, driven by the rapid digital tran

  15. Comprehensive Dataset for Event Classification Using Distributed Acoustic...

    • springernature.figshare.com
    bin
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrian Tomasov; Pavel Zaviska; Petr Dejdar; Ondrej Klicnik; Tomas Horvath; Petr Munster (2025). Comprehensive Dataset for Event Classification Using Distributed Acoustic Sensing (DAS) Systems [Dataset]. http://doi.org/10.6084/m9.figshare.27004732.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Adrian Tomasov; Pavel Zaviska; Petr Dejdar; Ondrej Klicnik; Tomas Horvath; Petr Munster
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was collected using a Distributed Acoustic Sensing (DAS) system with phase-sensitive Optical Time-Domain Reflectometry (Φ-OTDR) technology. It includes labeled and unlabeled acoustic signal measurements gathered around a university campus, covering activities such as walking, running, vehicular movement, and potential security threats like fiber manipulation and fence climbing. The data was captured using an Optasense ODH-F DAS interrogator, which monitors signals from a buried single-mode fiber optic cable. The dataset, stored in HDF5 format, serves as a critical resource for training machine learning models aimed at event classification in DAS systems. Each event is identified by power spectral density (PSD) representations and labeled accordingly. This dataset is ideal for researchers developing and validating machine learning algorithms for DAS-based applications, including structural health monitoring and perimeter security.

  16. R

    AI in Unsupervised Learning Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). AI in Unsupervised Learning Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-unsupervised-learning-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    AI in Unsupervised Learning Market Outlook



    According to our latest research, the AI in Unsupervised Learning market size reached USD 3.8 billion globally in 2024, demonstrating robust expansion as organizations increasingly leverage unsupervised techniques for extracting actionable insights from unlabelled data. The market is forecasted to grow at a CAGR of 28.2% from 2025 to 2033, propelling the industry to an estimated USD 36.7 billion by 2033. This remarkable growth trajectory is primarily fueled by the escalating adoption of artificial intelligence across diverse sectors, an exponential surge in data generation, and the pressing need for advanced analytics that can operate without manual data labeling.



    One of the key growth factors driving the AI in Unsupervised Learning market is the rising complexity and volume of data generated by enterprises in the digital era. Organizations are inundated with unstructured and unlabelled data from sources such as social media, IoT devices, and transactional systems. Traditional supervised learning methods are often impractical due to the time and cost associated with manual labeling. Unsupervised learning algorithms, such as clustering and dimensionality reduction, offer a scalable solution by autonomously identifying patterns, anomalies, and hidden structures within vast datasets. This capability is increasingly vital for industries aiming to enhance decision-making, streamline operations, and gain a competitive edge through advanced analytics.



    Another significant driver is the rapid advancement in computational power and AI infrastructure, which has made it feasible to implement sophisticated unsupervised learning models at scale. The proliferation of cloud computing and specialized AI hardware has reduced barriers to entry, enabling even small and medium enterprises to deploy unsupervised learning solutions. Additionally, the evolution of neural networks and deep learning architectures has expanded the scope of unsupervised algorithms, allowing for more complex tasks such as image recognition, natural language processing, and anomaly detection. These technological advancements are not only accelerating adoption but also fostering innovation across sectors including healthcare, finance, manufacturing, and retail.



    Furthermore, regulatory compliance and the growing emphasis on data privacy are pushing organizations to adopt unsupervised learning methods. Unlike supervised approaches that require sensitive data labeling, unsupervised algorithms can process data without explicit human intervention, thereby reducing the risk of privacy breaches. This is particularly relevant in sectors such as healthcare and BFSI, where stringent data protection regulations are in place. The ability to derive insights from unlabelled data while maintaining compliance is a compelling value proposition, further propelling the market forward.



    Regionally, North America continues to dominate the AI in Unsupervised Learning market owing to its advanced technological ecosystem, significant investments in AI research, and strong presence of leading market players. Europe follows closely, driven by robust regulatory frameworks and a focus on ethical AI deployment. The Asia Pacific region is exhibiting the fastest growth, fueled by rapid digital transformation, government initiatives, and increasing adoption of AI across industries. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as awareness and infrastructure continue to develop.



    Component Analysis



    The Component segment of the AI in Unsupervised Learning market is categorized into Software, Hardware, and Services, each playing a pivotal role in the overall ecosystem. The software segment, comprising machine learning frameworks, data analytics platforms, and AI development tools, holds the largest market share. This dominance is attributed to the continuous evolution of AI algorithms and the increasing availability of open-source and proprietary solutions tailored for unsupervised learning. Enterprises are investing heavily in software that can facilitate the seamless integration of unsupervised learning capabilities into existing workflows, enabling automation, predictive analytics, and pattern recognition without the need for labeled data.



    The hardware segment, while smaller in comparison to software, is experiencing significant growth due to the escalating demand for high-perf

  17. Table_1_sscNOVA: a semi-supervised convolutional neural network for...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haibo Li; Zhenhua Yu; Fang Du; Lijuan Song; Yang Gao; Fangyuan Shi (2024). Table_1_sscNOVA: a semi-supervised convolutional neural network for predicting functional regulatory variants in autoimmune diseases.xlsx [Dataset]. http://doi.org/10.3389/fimmu.2024.1323072.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Haibo Li; Zhenhua Yu; Fang Du; Lijuan Song; Yang Gao; Fangyuan Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Genome-wide association studies (GWAS) have identified thousands of variants in the human genome with autoimmune diseases. However, identifying functional regulatory variants associated with autoimmune diseases remains challenging, largely because of insufficient experimental validation data. We adopt the concept of semi-supervised learning by combining labeled and unlabeled data to develop a deep learning-based algorithm framework, sscNOVA, to predict functional regulatory variants in autoimmune diseases and analyze the functional characteristics of these regulatory variants. Compared to traditional supervised learning methods, our approach leverages more variants’ data to explore the relationship between functional regulatory variants and autoimmune diseases. Based on the experimentally curated testing dataset and evaluation metrics, we find that sscNOVA outperforms other state-of-the-art methods. Furthermore, we illustrate that sscNOVA can help to improve the prioritization of functional regulatory variants from lead single-nucleotide polymorphisms and the proxy variants in autoimmune GWAS data.

  18. D

    Video Dataset Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Video Dataset Market Research Report 2033 [Dataset]. https://dataintelo.com/report/video-dataset-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Video Dataset Market Outlook



    Based on our latest research, the global video dataset market size reached USD 2.1 billion in 2024 and is projected to grow at a robust CAGR of 19.7% during the forecast period, reaching a value of USD 10.3 billion by 2033. This remarkable growth trajectory is driven by the increasing adoption of artificial intelligence and machine learning technologies, which heavily rely on high-quality video datasets for training and validation purposes. As organizations across industries seek to leverage advanced analytics and automation, the demand for comprehensive, well-annotated video datasets is accelerating rapidly, establishing the video dataset market as a critical enabler for next-generation digital solutions.




    One of the primary growth factors propelling the video dataset market is the exponential rise in the deployment of computer vision applications across diverse sectors. Industries such as automotive, healthcare, retail, and security are increasingly integrating AI-powered vision systems for tasks ranging from autonomous navigation and medical diagnostics to customer behavior analysis and surveillance. The effectiveness of these systems hinges on the availability of large, diverse, and accurately labeled video datasets that can be used to train robust machine learning models. With the proliferation of video-enabled devices and sensors, the volume of raw video data has surged, further fueling the need for curated datasets that can be harnessed to unlock actionable insights and drive automation.




    Another significant driver for the video dataset market is the growing emphasis on data-driven research and innovation within academic, commercial, and governmental institutions. Universities and research organizations are leveraging video datasets to advance studies in areas such as robotics, behavioral science, and smart city development. Similarly, commercial entities are utilizing these datasets to enhance product offerings, improve customer experiences, and gain a competitive edge through AI-driven solutions. Government and defense agencies are also investing in video datasets to bolster national security, surveillance, and public safety initiatives. This broad-based adoption across end-users is catalyzing the expansion of the video dataset market, as stakeholders recognize the strategic value of high-quality video data in driving technological progress and operational efficiency.




    The emergence of synthetic and augmented video datasets represents a transformative trend within the market, addressing challenges related to data scarcity, privacy, and bias. Synthetic datasets, generated using advanced simulation and generative AI techniques, enable organizations to create vast amounts of labeled video data tailored to specific scenarios without the need for extensive real-world data collection. This approach not only accelerates model development but also enhances data diversity and mitigates ethical concerns associated with using sensitive or personally identifiable information. As the technology for generating and validating synthetic video data matures, its adoption is expected to further accelerate, opening new avenues for innovation and market growth.




    Regionally, North America continues to dominate the video dataset market, accounting for the largest share in 2024 due to its advanced technological ecosystem, strong presence of leading AI companies, and substantial investments in research and development. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digital transformation, increasing adoption of AI in sectors like manufacturing and healthcare, and supportive government policies. Europe also represents a significant market, characterized by its focus on data privacy and regulatory compliance, which is shaping the development and utilization of video datasets across industries. These regional dynamics underscore the global nature of the video dataset market and highlight the diverse opportunities for stakeholders worldwide.



    Dataset Type Analysis



    The video dataset market is segmented by dataset type into labeled, unlabeled, and synthetic datasets, each serving distinct purposes and addressing unique industry requirements. Labeled video datasets are foundational for supervised learning applications, where annotated frames and sequences enable machine learning models to learn complex patterns and behaviors. The demand for labeled datasets is particularly high in sectors

  19. f

    DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jan 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Yue; Zhang, Liuchao; He, Jia; Li, Kang; Rong, Zhiwei; Xu, Zhenyi; Ji, Jianxin; Hou, Yan; Liu, Weisha (2023). DataSheet_1_HiRAND: A novel GCN semi-supervised deep learning-based framework for classification and feature selection in drug research and development.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000994299
    Explore at:
    Dataset updated
    Jan 26, 2023
    Authors
    Huang, Yue; Zhang, Liuchao; He, Jia; Li, Kang; Rong, Zhiwei; Xu, Zhenyi; Ji, Jianxin; Hou, Yan; Liu, Weisha
    Description

    The prediction of response to drugs before initiating therapy based on transcriptome data is a major challenge. However, identifying effective drug response label data costs time and resources. Methods available often predict poorly and fail to identify robust biomarkers due to the curse of dimensionality: high dimensionality and low sample size. Therefore, this necessitates the development of predictive models to effectively predict the response to drugs using limited labeled data while being interpretable. In this study, we report a novel Hierarchical Graph Random Neural Networks (HiRAND) framework to predict the drug response using transcriptome data of few labeled data and additional unlabeled data. HiRAND completes the information integration of the gene graph and sample graph by graph convolutional network (GCN). The innovation of our model is leveraging data augmentation strategy to solve the dilemma of limited labeled data and using consistency regularization to optimize the prediction consistency of unlabeled data across different data augmentations. The results showed that HiRAND achieved better performance than competitive methods in various prediction scenarios, including both simulation data and multiple drug response data. We found that the prediction ability of HiRAND in the drug vorinostat showed the best results across all 62 drugs. In addition, HiRAND was interpreted to identify the key genes most important to vorinostat response, highlighting critical roles for ribosomal protein-related genes in the response to histone deacetylase inhibition. Our HiRAND could be utilized as an efficient framework for improving the drug response prediction performance using few labeled data.

  20. n

    Data from: Solutions to Limited Annotation Problems of Deep Learning in...

    • curate.nd.edu
    pdf
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinrong Hu (2024). Solutions to Limited Annotation Problems of Deep Learning in Medical Image Segmentation [Dataset]. http://doi.org/10.7274/25604643.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Xinrong Hu
    License

    https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106

    Description

    Image segmentation holds broad applications in medical image analysis, providing crucial support to doctors in both automatic diagnosis and computer-assisted interventions. The heterogeneity observed across various medical image datasets necessitates the training of task-specific segmentation models. However, effectively supervising the training of deep learning segmentation models typically demands dense label masks, a requirement that becomes challenging due to the constraints posed by privacy and cost issues in collecting large-scale medical datasets. These challenges collectively give rise to the limited annotations problems in medical image segmentation.

    In this dissertation, we address the challenges posed by annotation deficiencies through a comprehensive exploration of various strategies. Firstly, we employ self-supervised learning to extract information from unlabeled data, presenting a tailored self-supervised method designed specifically for convolutional neural networks and 3D Vision Transformers. Secondly, our attention shifts to domain adaptation problems, leveraging images with similar content but in different modalities. We introduce the use of contrastive loss as a shape constraint in our image translation framework, resulting in both improved performance and enhanced training robustness. Thirdly, we incorporate diffusion models for data augmentation, expanding datasets with generated image-label pairs. Lastly, we explore to extract segmentation masks from image-level annotations alone. We propose a multi-task training framework for ECG abnormal beats localization and a conditional diffusion-based algorithm for tumor detection.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
Organization logo

Machine Learning Basics for Beginners🤖🧠

Machine Learning Basics

Explore at:
zip(492015 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Bhanupratap Biswas
License

ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically

Description

Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

  1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

  2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

  3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

  4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

  5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

  6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

  7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

  8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

  9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

  10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

Search
Clear search
Close search
Google apps
Main menu