Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This diagnostic dataset (website, paper) is specifically designed to evaluate the visual logical learning capabilities of machine learning models. It offers a seamless integration of visual and logical challenges, providing 2D images of complex visual trains, where the classification is derived from rule-based logic. The fundamental idea of V-LoL remains to integrate the explicit logical learning tasks of classic symbolic AI… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/v-lol-trains.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Logic synthesis is a challenging and widely-researched combinatorial optimization problem during integrated circuit (IC) design. It transforms a high-level description of hardware in a programming language like Verilog into an optimized digital circuit netlist, a network of interconnected Boolean logic gates, that implements the function. Spurred by the success of ML in solving combinatorial and graph problems in other domains, there is growing interest in the design of ML-guided logic synthesis tools. Yet, there are no standard datasets or prototypical learning tasks defined for this problem domain. Here, we describe OpenABC-D,a large-scale, labeled dataset produced by synthesizing open source designs with a leading open-source logic synthesis tool and illustrate its use in developing, evaluating and benchmarking ML-guided logic synthesis. OpenABC-D has intermediate and final outputs in the form of 870,000 And-Inverter-Graphs (AIGs) produced from 1500 synthesis runs plus labels such as the optimized node counts, and de-lay. We define a generic learning problem on this dataset and benchmark existing solutions for it. The codes related to dataset creation and benchmark models are available athttps://github.com/NYU-MLDA/OpenABC.git.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed results of four ML models on the overall test dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical machine learning usually achieves high-accuracy models by employing tens of thousands of examples. By contrast, both children and adult humans typically learn new concepts from either one or a small number of instances. The high data efficiency of human learning is not easily explained in terms of standard formal frameworks for machine learning, including Gold’s learning-in-the-limit framework and Valiant’s probably approximately correct (PAC) model. This paper explores ways in which this apparent disparity between human and machine learning can be reconciled by considering algorithms involving a preference for specificity combined with program minimality. It is shown how this can be efficiently enacted using hierarchical search based on identification of certificates and push-down automata to support hypothesizing compactly expressed maximal efficiency algorithms. Early results of a new system called DeepLog indicate that such approaches can support efficient top-down construction of relatively complex logic programs from a single example.This article is part of a discussion meeting issue ‘Cognitive artificial intelligence’.
MLRegTest is a benchmark for machine learning systems on sequence classification, which contains training, development, and test sets from 1,800 regular languages. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies., The languages were generated by creating finite-state acceptors and the datasets were generated by sampling from these finite-state acceptors. The scripts and software used for these processes are open source and available. For details, see https://github.com/heinz-jeffrey/subregular-learning. Details are described in the arxiv preprint "MLRegTest: A Benchmark for the Machine Learning of Regular Languages"., , # MLRegTest: A benchmark for the machine learning of regular languages
https://doi.org/10.5061/dryad.dncjsxm4h
MLRegTest provides training and testing data for 1800 regular languages.
This repository contains three gzipped tar archives.
> data.tar.gz (21GB) > languages.tar.gz (4.5MB) > models.tar.gz (76GB)
When uncompressed, these yield three directories, described in detail below.
> data (43GB) > languages (38MB) > models (87GB)
Languages are named according to the scheme Sigma.Tau.class.k.t.i.plebby
, where Sigma
is a two-digit alphabet size, Tau
a two-digit number of salient symbols (the 'tier'), class
the named subregular class, k
the width of factors used (if applicable), t
the threshold counted to (if applicable), and i
a unique identifier. The table below unabbreviates the class names, and shows how many languages of each class there are.
| class | name ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for modeling risky driver behaviors based on accelerometer (X,Y,Z axis in meters per second squared (m/s2)) and gyroscope (X,Y, Z axis in degrees per second (°/s) ) data. Sampling Rate: Average 2 samples (rows) per second Cars: Ford Fiesta 1.4, Ford Fiesta 1.25, Hyundai i20 Drivers: 3 different drivers with the ages of 27, 28 and 37 Driver Behaviors: Sudden Acceleration (Class Label: 1), Sudden Right Turn (Class Label: 2), Sudden Left Turn (Class Label: 3), Sudden Break (Class Label: 4) Best Window Size: 14 seconds Sensor: MPU6050 Device: Raspberry Pi 3 Model B Please See Summary Table for summary of the collected data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial Intelligence (AI) plays a fundamental role in the modern world, especially when used as an autonomous decision maker. One common concern nowadays is “how trustworthy the AIs are.” Human operators follow a strict educational curriculum and performance assessment that could be exploited to quantify how much we entrust them. To quantify the trust of AI decision makers, we must go beyond task accuracy especially when facing limited, incomplete, misleading, controversial or noisy datasets. Toward addressing these challenges, we describe DeepTrust, a Subjective Logic (SL) inspired framework that constructs a probabilistic logic description of an AI algorithm and takes into account the trustworthiness of both dataset and inner algorithmic workings. DeepTrust identifies proper multi-layered neural network (NN) topologies that have high projected trust probabilities, even when trained with untrusted data. We show that uncertain opinion of data is not always malicious while evaluating NN's opinion and trustworthiness, whereas the disbelief opinion hurts trust the most. Also trust probability does not necessarily correlate with accuracy. DeepTrust also provides a projected trust probability of NN's prediction, which is useful when the NN generates an over-confident output under problematic datasets. These findings open new analytical avenues for designing and improving the NN topology by optimizing opinion and trustworthiness, along with accuracy, in a multi-objective optimization formulation, subject to space and time constraints.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recently, a growing number of researchers have applied machine learning to assist users of interactive theorem provers.
However, the expressive nature of underlying logics and esoteric structures of proof documents impede machine learning practitioners, who often do not have much expertise in formal logic, let alone Isabelle/HOL, from applying their tools and expertise to theorem proving.
In this data description, we present a simple dataset that contains data on over 400k proof method applications in the Archive of Formal Proofs along with over 100 extracted features for each in a format that can be processed easily without any knowledge about formal logic.
Our simple data format allows machine learning practitioners to try machine learning tools to predict proof methods in Isabelle/HOL, even if they are unfamiliar with theorem proving.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Among many types of cancers, to date, lung cancer remains one of the deadliest cancers around the world. Many researchers, scientists, doctors, and people from other fields continuously contribute to this subject regarding early prediction and diagnosis. One of the significant problems in prediction is the black-box nature of machine learning models. Though the detection rate is comparatively satisfactory, people have yet to learn how a model came to that decision, causing trust issues among patients and healthcare workers. This work uses multiple machine learning models on a numerical dataset of lung cancer-relevant parameters and compares performance and accuracy. After comparison, each model has been explained using different methods. The main contribution of this research is to give logical explanations of why the model reached a particular decision to achieve trust. This research has also been compared with a previous study that worked with a similar dataset and took expert opinions regarding their proposed model. We also showed that our research achieved better results than their proposed model and specialist opinion using hyperparameter tuning, having an improved accuracy of almost 100% in all four models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of
H.F. Mateo-Romero, M.E. Carbonó de la Rosa, L. Hernández-Callejo, M.A. González-Rebollo, V. Cardeñoso-Payo, V. Alonso-Gómez, S. Gallardo-Saavedra, J.I. Morales Aragonés, “Enhancing photovoltaic cell classification through mamdani fuzzy logic: a comparative study with machine learning approaches employing electroluminescence images”, Progress in Artificial Intelligence (2024) pp. 1-11.
https://doi.org/10.1007/s13748-024-00353-w
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
See the official website: https://autovi.utc.fr
Modern industrial production lines must be set up with robust defect inspection modules that are able to withstand high product variability. This means that in a context of industrial production, new defects that are not yet known may appear, and must therefore be identified.
On industrial production lines, the typology of potential defects is vast (texture, part failure, logical defects, etc.). Inspection systems must therefore be able to detect non-listed defects, i.e. not-yet-observed defects upon the development of the inspection system. To solve this problem, research and development of unsupervised AI algorithms on real-world data is required.
Renault Group and the Université de technologie de Compiègne (Roberval and Heudiasyc Laboratories) have jointly developed the Automotive Visual Inspection Dataset (AutoVI), the purpose of which is to be used as a scientific benchmark to compare and develop advanced unsupervised anomaly detection algorithms under real production conditions. The images were acquired on Renault Group's automotive production lines, in a genuine industrial production line environment, with variations in brightness and lighting on constantly moving components. This dataset is representative of actual data acquisition conditions on automotive production lines.
The dataset contains 3950 images, split into 1530 training images and 2420 testing images.
The evaluation code can be found at https://github.com/phcarval/autovi_evaluation_code.
Disclaimer
All defects shown were intentionally created on Renault Group's production lines for the purpose of producing this dataset. The images were examined and labeled by Renault Group experts, and all defects were corrected after shooting.
License
Copyright © 2023-2024 Renault Group
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of the license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/.
For using the data in a way that falls under the commercial use clause of the license, please contact us.
Attribution
Please use the following for citing the dataset in scientific work:
Carvalho, P., Lafou, M., Durupt, A., Leblanc, A., & Grandvalet, Y. (2024). The Automotive Visual Inspection Dataset (AutoVI): A Genuine Industrial Production Dataset for Unsupervised Anomaly Detection [Dataset]. https://doi.org/10.5281/zenodo.10459003
Contact
If you have any questions or remarks about this dataset, please contact us at philippe.carvalho@utc.fr, meriem.lafou@renault.com, alexandre.durupt@utc.fr, antoine.leblanc@renault.com, yves.grandvalet@utc.fr.
Changelog
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TextBite is a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. It is mainly aimed at logical segmentation, but can be used for other tasks as well. Additionally, part of the dataset contains handwritten documents, primarily records from schools and public organizations, introducing extra segmentation challenges due to their more loosely structured layouts.
In total, the dataset contains 8,449 annotated pages, from which 7,346 pages are printed and 1,103 are handwritten. The pages contain a total of 78,863 segments. The test subset contains 964 pages, of which 185 are handwritten. The annotations are provided in an extended COCO format. Each segment is represented by a set of axis aligned bounding boxes, which are connected by directed relationships, representing reading order. To include these relationships in the COCO format, a new top-level key relations is added. Each relation entry specifies a source and a target bounding box.
In addition to the layout annotations, we provide a textual representation of the pages produced by Optical Character Recognition (OCR) tool PERO-OCR. These come in the form of XML files in the PAGE-XML format, which includes an enclosing polygon for each individual textline along with the transcriptions and their confidences. Lastly, we provide the OCR results in the ALTO format, which includes polygons for individual words in the page image.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Game Development and Enhancement: Developers can incorporate the MKSC model into their game development process for identifying different game elements like characters or objects (coins, trees, peaches, etc.). This can facilitate automatic level design, character recognition and movement logic.
Interactive Content Creation: Streamers, digital content creators, or video game reviewers can use this model to analyze gameplay, identifying key characters and events in real-time or during video editing. This can open doors to more interactive and engaging content for audiences, possibly even automated highlights or recaps based on character occurrences.
Gaming Tutorials and Guides: The MKSC model can be used to develop comprehensive gaming guides and step-by-step tutorials. By recognizing game elements, it can show players where to find specific items or characters, or provide an analysis of gameplay to help players improve.
Machine Learning Research: Researchers can use the MKSC model as a baseline or reference for their research in video game AI or broader computer vision/ML studies. It provides a good use-case for pixel class recognition in complex, dynamic environments like video games.
Video Game AI Training: AI bots can be trained using the MKSC model. It can help build a neural network that understands video game landscapes, enabling the bots to interact more diversely and intelligently in a video game setup, and enhancing player vs. AI experiences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Binarized version of the UNSW-NB15 dataset, where the original features (a mix of strings, categorical values, floating point values etc) are converted to a bit string of 593 bits. Each value in each feature is either 0 or 1, stored as a uint8 value. The uint8 values are represented as numpy arrays, provided separately for training and test data (same train/test split as the original dataset is used). The final binary value in each sample is the expected output.
Among others, this dataset has been used for quantized neural network research:
Umuroglu, Y., Akhauri, Y., Fraser, N. J., & Blott, M. (2020, August). LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications. In 2020 30th International Conference on Field-Programmable Logic and Applications (FPL) (pp. 291-297). IEEE.
The method for binarization is identical to the one described in 10.5281/zenodo.3258657 :
"T. Murovič, A. Trost, Massively Parallel Combinational Binary Neural Networks for Edge Processing, Elektrotehniški vestnik, vol. 86, no. 1-2, pp. 47-53, 2019"
The original UNSW-NB15 dataaset is by:
Moustafa, Nour, and Jill Slay. "UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)." Military Communications and Information Systems Conference (MilCIS), 2015. IEEE, 2015.
Developing an MLP-Based AI/ML Model for Sudoku Puzzle Solving
Introduction to AI/ML Sudoku Solvers
Sudoku, a widely recognized logic-based combinatorial number-placement puzzle, presents a compelling challenge for Artificial Intelligence and Machine Learning models. The objective of Sudoku is to populate a 9x9 grid, which is further subdivided into nine 3x3 subgrids, with digits ranging from 1 to 9. The fundamental constraint is that each digit must appear exactly once within each row, each… See the full description on the dataset page: https://huggingface.co/datasets/MartialTerran/Sodoku_Puzzle_Generator.
The functional diversity of microbial communities emerges from a combination of the great number of species and the many interaction types, such as competition, mutualism, predation or parasitism, in microbial ecological networks. Understanding the relationship between microbial networks and the services and functions delivered by the microbial communities is a key challenge for Microbial Ecology, particularly as so many of these interactions are difficult to observe and characterize. We believe that this 'Dark Web' of interactions could be unravelled using an explainable machine learning approach, called Abductive/Inductive Logic Programming (A/ILP) in the R package InfIntE, which uses mechanistic rules (interaction hypotheses) to infer directly the network structure and interaction types. Here we attempt to unravel the dark web of the plant microbiome embodied in metabarcoding data sampled from the grapevine foliar microbiome. Using synthetic, simulated data, we first show that it is possible to satisfactorily reconstruct microbial networks using explainable machine learning. Then we confirm that the dark web of the grapevine microbiome is diverse, being composed of a range of interaction types consistent with the literature. This first attempt to use explainable machine learning to infer microbial interaction networks advances our understanding of the ecological processes that occur in microbial communities and allows us to infer specific types of interaction within the grapevine microbiome that could be validated through experimentation. This work will have potentially valuable applications, such as the discovery of antagonistic interactions that might be used to identify potential biological control agents within the microbiome.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AcTBeCalf Dataset Description
The AcTBeCalf dataset is a comprehensive dataset designed to support the classification of pre-weaned calf behaviors from accelerometer data. It contains detailed accelerometer readings aligned with annotated behaviors, providing a valuable resource for research in multivariate time-series classification and animal behavior analysis. The dataset includes accelerometer data collected from 30 pre-weaned Holstein Friesian and Jersey calves, housed in group pens at the Teagasc Moorepark Research Farm, Ireland. Each calf was equipped with a 3D accelerometer sensor (AX3, Axivity Ltd, Newcastle, UK) sampling at 25 Hz and attached to a neck collar from one week of birth over 13 weeks.
This dataset encompasses 27.4 hours of accelerometer data aligned with calf behaviors, including both prominent behaviors like lying, standing, and running, as well as less frequent behaviors such as grooming, social interaction, and abnormal behaviors.
The dataset consists of a single CSV file with the following columns:
dateTime: Timestamp of the accelerometer reading, sampled at 25 Hz.
calfid: Identification number of the calf (1-30).
accX: Accelerometer reading for the X axis (top-bottom direction)*.
accY: Accelerometer reading for the Y axis (backward-forward direction)*.
accZ: Accelerometer reading for the Z axis (left-right direction)*.
behavior: Annotated behavior based on an ethogram of 23 behaviors.
segId: Segment identification number associated with each accelerometer reading/row, representing all readings of the same behavior segment.
Code Files Description
The dataset is accompanied by several code files to facilitate the preprocessing and analysis of the accelerometer data and to support the development and evaluation of machine learning models. The main code files included in the dataset repository are:
accelerometer_time_correction.ipynb: This script corrects the accelerometer time drift, ensuring the alignment of the accelerometer data with the reference time.
shake_pattern_detector.py: This script includes an algorithm to detect shake patterns in the accelerometer signal for aligning the accelerometer time series with reference times.
aligning_accelerometer_data_with_annotations.ipynb: This notebook aligns the accelerometer time series with the annotated behaviors based on timestamps.
manual_inspection_ts_validation.ipynb: This notebook provides a manual inspection process for ensuring the accurate alignment of the accelerometer data with the annotated behaviors.
additional_ts_generation.ipynb: This notebook generates additional time-series data from the original X, Y, and Z accelerometer readings, including Magnitude, ODBA (Overall Dynamic Body Acceleration), VeDBA (Vectorial Dynamic Body Acceleration), pitch, and roll.
genSplit.py: This script provides the logic used for the generalized subject separation for machine learning model training, validation and testing.
active_inactive_classification.ipynb: This notebook details the process of classifying behaviors into active and inactive categories using a RandomForest model, achieving a balanced accuracy of 92%.
four_behv_classification.ipynb: This notebook employs the mini-ROCKET feature derivation mechanism and a RidgeClassifierCV to classify behaviors into four categories: drinking milk, lying, running, and other, achieving a balanced accuracy of 84%.
Kindly cite one of the following papers when using this data:
Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Evaluating ROCKET and Catch22 features for calf behaviour classification from accelerometer data using Machine Learning models. arXiv preprint arXiv:2404.18159.
Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars. arXiv preprint arXiv:2406.17352
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of results due to parameter tuning for five k-fold cross-validation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundDecisions surrounding involuntary psychiatric treatment orders often involve complex clinical, legal, and ethical considerations, especially when patients lack decisional capacity and refuse treatment. In Quebec, these orders are issued by the Superior Court based on a combination of medical, legal, and behavioral evidence. However, no transparent, evidence-informed predictive tools currently exist to estimate the likelihood of full treatment order acceptance. This study aims to develop and evaluate a hybrid fuzzy logic–machine learning model to predict such outcomes and identify important influencing factors.MethodsA retrospective dataset of 176 Superior Court judgments rendered in Quebec in 2024 was curated from SOQUIJ, encompassing demographic, clinical, and legal variables. A Mamdani-type fuzzy inference system was constructed to simulate expert decision logic and output a continuous likelihood score. This score, along with structured features, was used to train a Random Forest classifier. Model performance was evaluated using accuracy, precision, recall and F1 score. A 10-fold stratified cross-validation was employed for internal validation. Feature importance was also computed to assess the influence of each variable on the prediction outcome.ResultsThe hybrid model achieved an accuracy of 98.1%, precision of 93.3%, recall of 100%, and a F1 score of 96.6. The most influential predictors were the duration of time granted by the court, duration requested by the clinical team, and age of the defendant. Fuzzy logic features such as severity, compliance, and a composite Burden_Score also significantly contributed to prediction accuracy. Only one misclassified case was observed in the test set, and the system provided interpretable decision logic consistent with expert reasoning.ConclusionThis exploratory study offers a novel approach for decision support in forensic psychiatric contexts. Future work should aim to validate the model across other jurisdictions, incorporate more advanced natural language processing for semantic feature extraction, and explore dynamic rule optimization techniques. These enhancements would further improve generalizability, fairness, and practical utility in real-world clinical and legal settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dstart,goal: Distance from play origin to goal. Dstart,goal: Distance from play end position to goal-line. Aopen: Opening angle of the goal from play origin. Yend*: End position of the play, projected onto goal-line.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This diagnostic dataset (website, paper) is specifically designed to evaluate the visual logical learning capabilities of machine learning models. It offers a seamless integration of visual and logical challenges, providing 2D images of complex visual trains, where the classification is derived from rule-based logic. The fundamental idea of V-LoL remains to integrate the explicit logical learning tasks of classic symbolic AI… See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/v-lol-trains.