Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:
Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?
A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.
Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains information about three species of Iris flowers: Setosa, Versicolour, and Virginica. It is a well-known dataset in the machine learning and statistics communities, often used for classification and clustering tasks. Each row represents a sample of an Iris flower, with measurements of its physical attributes and the corresponding target label.
Dataset Features: sepal length (cm): The length of the sepal in centimeters. sepal width (cm): The width of the sepal in centimeters. petal length (cm): The length of the petal in centimeters. petal width (cm): The width of the petal in centimeters. target: A numerical label (0, 1, or 2) indicating the flower species: 0: Setosa 1: Versicolour 2: Virginica
Purpose: This dataset can be used for: Supervised learning tasks, particularly classification. Exploratory data analysis and visualization of flower attributes. Understanding the application of machine learning algorithms like decision trees, KNN, and support vector machines.
Source: This is a modified version of the classic Iris flower dataset, often used for beginner-level machine learning projects and demonstrations.
Potential Use Cases: Training machine learning models for flower classification. Practicing data preprocessing, feature scaling, and visualization techniques. Understanding the relationships between features through scatter plots and correlation analysis.
Facebook
TwitterPython 2 Jupyter notebook that aggregates sub-daily time series observations up to a daily time scale. The code was originally written to aggregate data stored in the sqlite database stored in this resource: https://www.hydroshare.org/resource/9e1b23607ac240588ba50d6b5b9a49b5/
Facebook
TwitterEmpirically analyzing household behavior usually relies on informal data preprocessing. That is, before the estimation, observations are preselected to obtain a sufficiently homogeneous subset of data. In the context of estimating equivalence scales for household income, we use matching techniques and balance checking at this initial stage. This can be interpreted as a non-parametric approach to preprocessing data that formalizes informal procedures. We illustrate this using German micro-data on household expenditure, showing that matching leads to results which are more stable with respect to model specification and is especially useful when applied to specific subgroups, such as low-income households. The files provided here contain the code (in "R") which is needed to replicate our analyses.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.
The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.
The preprocessing steps include:
One-hot-encoding of categorical values
Imputation of missing values using knn-imputer with k=1
Standard scaling of ordinal attributes
Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Facebook
Twitterhis dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.
In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.
The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.
The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:
| Column Name | Description |
|---|---|
| Student_ID | Unique identifier for each student (e.g., S0001, S0002, …) |
| Age | Age of the student (between 18 and 25 years) |
| Gender | Gender of the student (Male/Female) |
| Study_Hours | Average number of study hours per day (contains missing values and outliers) |
| Attendance(%) | Percentage of class attendance (contains missing values) |
| Test_Score | Final exam score (0–100 scale) |
| Grade | Letter grade derived from test scores (F, C, B, A, A+) |
Test_Score → Predict test score based on study hours, attendance, age, and gender.
Predict the student’s test score using their study hours, attendance percentage, and age.
🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']
You can use:
And analyze feature influence using correlation or SHAP/LIME explainability.
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The size of the Pharmaceutical Sample Preprocessing System market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX% during the forecast period.
Facebook
Twitterhttps://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
According to our latest research, the Global RAP Preprocessing Moisture Reduction Systems market size was valued at $1.2 billion in 2024 and is projected to reach $2.8 billion by 2033, expanding at a robust CAGR of 9.7% during the forecast period of 2025–2033. The primary driver for this substantial growth is the increasing emphasis on sustainable infrastructure development and the rising adoption of recycled asphalt pavement (RAP) in road construction projects worldwide. As governments and private stakeholders prioritize cost-effective, environmentally responsible construction practices, the demand for advanced moisture reduction systems that enhance RAP quality and performance is set to accelerate significantly.
North America currently holds the largest market share in the global RAP Preprocessing Moisture Reduction Systems market, accounting for approximately 38% of the total market value in 2024. This dominance is attributed to the region’s mature road construction industry, stringent environmental regulations, and widespread adoption of asphalt recycling practices. The United States, in particular, has implemented robust policies that incentivize the use of RAP, driving investments in advanced moisture reduction technologies. Furthermore, established infrastructure, a strong network of asphalt recycling plants, and the presence of leading technology providers contribute to North America’s leadership position. The region’s focus on reducing greenhouse gas emissions and minimizing landfill waste further supports the rapid integration of moisture reduction systems into both new and existing asphalt production facilities.
The Asia Pacific region is expected to experience the fastest CAGR of 12.3% from 2025 to 2033, driven by rapid urbanization, expanding infrastructure projects, and growing government investments in sustainable road construction. Countries such as China, India, and Southeast Asian nations are witnessing a surge in road-building activities to support economic development and urban connectivity. The increasing awareness of the benefits of RAP, coupled with rising material costs and environmental concerns, is prompting both public and private sector players to adopt advanced preprocessing moisture reduction systems. Regional governments are also launching pilot projects and offering incentives to promote recycling technologies, which is anticipated to further boost market growth in the coming years.
Emerging economies in Latin America and the Middle East & Africa are gradually adopting RAP preprocessing moisture reduction systems, although market penetration remains in its nascent stages due to challenges such as limited technical expertise, budget constraints, and inconsistent regulatory frameworks. In these regions, the adoption of RAP technologies is often driven by large-scale infrastructure projects and international development funding. However, the lack of standardized policies and localized supply chains poses hurdles to widespread implementation. Nevertheless, as these economies continue to urbanize and prioritize cost-effective, sustainable construction methods, the long-term outlook for RAP moisture reduction systems remains positive, with significant growth potential as awareness and policy support increase.
| Attributes | Details |
| Report Title | RAP Preprocessing Moisture Reduction Systems Market Research Report 2033 |
| By Technology | Thermal Drying, Mechanical Dewatering, Chemical Treatment, Others |
| By Application | Road Construction, Asphalt Recycling Plants, Infrastructure Projects, Others |
| By System Type | Batch Systems, Continuous Systems |
| By End-User | Construction Companies, Municipalities, Contractors, Others |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study presents a comprehensive comparative analysis of Machine Learning (ML) and Deep Learning (DL) models for predicting Wind Turbine (WT) power output based on environmental variables such as temperature, humidity, wind speed, and wind direction. Along with Artificial Neural Network (ANN), Long Short-Term Memory (LSTM), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN), the following ML models were looked at: Linear Regression (LR), Support Vector Regressor (SVR), Random Forest (RF), Extra Trees (ET), Adaptive Boosting (AdaBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). Using a dataset of 40,000 observations, the models were assessed based on R-squared, Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). ET achieved the highest performance among ML models, with an R-squared value of 0.7231 and a RMSE of 0.1512. Among DL models, ANN demonstrated the best performance, achieving an R-squared value of 0.7248 and a RMSE of 0.1516. The results show that DL models, especially ANN, did slightly better than the best ML models. This means that they are better at modeling non-linear dependencies in multivariate data. Preprocessing techniques, including feature scaling and parameter tuning, improved model performance by enhancing data consistency and optimizing hyperparameters. When compared to previous benchmarks, the performance of both ANN and ET demonstrates significant predictive accuracy gains in WT power output forecasting. This study’s novelty lies in directly comparing a diverse range of ML and DL algorithms while highlighting the potential of advanced computational approaches for renewable energy optimization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT: Data Mining techniques play an important role in the prediction of soil spatial distribution in systematic soil surveying, though existing methodologies still lack standardization and a full understanding of their capabilities. The aim of this work was to evaluate the performance of preprocessing procedures and supervised classification approaches for predicting map units from 1:100,000-scale conventional semi-detailed soil surveys. Sheets of the Brazilian National Cartographic System on the 1:50,000 scale, “Dois Córregos” (“Brotas” 1:100,000-scale sheet), “São Pedro” and “Laras” (“Piracicaba” 1:100,000-scale sheet) were used for developing models. Soil map information and predictive environmental covariates for the dataset were obtained from the semi-detailed soil survey of the state of São Paulo, from the Brazilian Institute of Geography and Statistics (IBGE) 1:50,000-scale topographic sheets and from the 1:750,000-scale geological map of the state of São Paulo. The target variable was a soil map unit of four types: local “soil unit” name and soil class at three hierarchical levels of the Brazilian System of Soil Classification (SiBCS). Different data preprocessing treatments and four algorithms all having different approaches were also tested. Results showed that composite soil map units were not adequate for the machine learning process. Class balance did not contribute to improving the performance of classifiers. Accuracy values of 78 % and a Kappa index of 0.67 were obtained after preprocessing procedures with Random Forest, the algorithm that performed best. Information from conventional map units of semi-detailed (4th order) 1:100,000 soil survey generated models with values for accuracy, precision, sensitivity, specificity and Kappa indexes that support their use in programs for systematic soil surveying.
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The pharmaceutical sample preprocessing system market is booming, projected to reach $3.8 billion by 2033, driven by automation, personalized medicine, and high-throughput screening. Learn about market trends, key players (Roche, Menarini Diagnostics, Sekisui Diagnostics), and regional growth in this comprehensive analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set is collected from online ecommerce site and its very raw and junk data as you can get in industries for your data science project.
Trying your image preprocessing skill on such data will help you to understand the real time problems and challenges in industry projects.
It has some junk, partial as well as full jeans images.
You can perform different task on this data set like, Beginner - Resize all images to 48 * 48 size - Convert all images to gray scale images Intermediate - Perform image masking on all images Advance - Try to cluster jeans images - try if you can cluster based on color - try if you can cluster based on full, partial and junk jeans images
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Explore the dynamic Pharmaceutical Sample Preprocessing System market with insights on growth drivers, trends, restraints, and regional analysis. Discover market size projections and key players shaping the future of drug discovery and diagnostics.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global RAP Preprocessing Moisture Reduction Systems market size in 2024 stands at USD 1.28 billion, reflecting robust demand from the infrastructure and road construction sectors. The market is exhibiting a healthy growth trajectory, registering a CAGR of 6.9% from 2025 to 2033. By 2033, the market is forecasted to reach USD 2.39 billion, underpinned by ongoing advancements in moisture reduction technologies and increasing emphasis on sustainable construction practices. The primary driver fueling this growth is the surge in global infrastructure investments and the growing need for efficient asphalt recycling, as governments and private entities focus on cost-effective and environmentally conscious solutions.
One of the key growth factors propelling the RAP Preprocessing Moisture Reduction Systems market is the escalating focus on sustainable road construction and rehabilitation. The use of Reclaimed Asphalt Pavement (RAP) is increasingly favored due to its environmental benefits, such as reduced need for virgin materials, lower greenhouse gas emissions, and minimized landfill use. However, RAP often contains significant moisture, which can hinder its reuse and impact the quality of the final asphalt mix. As a result, the demand for advanced moisture reduction systems has surged, with technologies that efficiently remove moisture from RAP becoming critical for ensuring the durability and performance of recycled asphalt. This trend is further amplified by stringent government regulations and policies aimed at promoting green construction practices, thereby driving adoption across both developed and emerging markets.
Another significant factor contributing to market expansion is the rapid pace of technological innovation within the RAP preprocessing sector. Companies are investing heavily in research and development to introduce systems that deliver higher energy efficiency, faster processing times, and greater reliability. The integration of automation, real-time monitoring, and data analytics into moisture reduction systems is enabling construction companies and recycling facilities to optimize their operations, reduce operational costs, and improve output quality. Additionally, the availability of modular and scalable solutions is making it easier for end-users to customize their systems according to project size and specific requirements, further broadening the market’s appeal across diverse application segments.
The growing emphasis on infrastructure modernization and maintenance, particularly in regions with aging road networks, is also catalyzing market growth. Governments worldwide are allocating substantial budgets to upgrade existing transportation infrastructure, with a keen focus on sustainability and cost efficiency. The ability of RAP preprocessing moisture reduction systems to enhance the performance and longevity of recycled asphalt is making them indispensable tools for infrastructure maintenance and rehabilitation projects. Furthermore, collaborations between public agencies and private companies are accelerating the adoption of these systems, as stakeholders seek to maximize the value of available resources while minimizing environmental impact.
From a regional perspective, Asia Pacific is emerging as the dominant market for RAP Preprocessing Moisture Reduction Systems, driven by rapid urbanization, expanding road networks, and significant investments in infrastructure development. North America and Europe are also witnessing substantial growth, supported by strong regulatory frameworks and a mature construction industry. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their adoption rates, spurred by rising awareness of the benefits of asphalt recycling and the need to address infrastructure deficits. The regional outlook remains highly favorable, with each region presenting unique opportunities and challenges that are shaping the overall trajectory of the global market.
The RAP Preprocessing Moisture Reduction Systems market is segmented by technology into Thermal Drying, Mechanical Dewatering, Chemical Treatment, and Others, each offering distinct advantages and serving specific operational needs. Thermal drying remains the most widely adopted technology, accounting for a signif
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.
Our research addresses three main aspects:
This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.
We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:
To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.
Commit Diffs
We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.
Commit Classification
We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.
Model Metadata
We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.
The replication package is organized as follows:
- code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.
HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.HFCommitsPreprocessing.ipynb: Processes commit data, including:
HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.RQ1_Analysis.ipynb: Analysis for RQ1.RQ2_Analysis.ipynb: Analysis for RQ2.RQ3_Analysis.ipynb: Analysis for RQ3.- datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.
HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.- metadata/: Contains the tags_metadata.yaml file used during preprocessing.
- models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.
- requirements.txt: Lists the required Python packages to set up the environment and run the code.
bashpython -m venv venvsource venv/bin/activate # On Windows, use venv\Scripts\activatebashpip install -r requirements.txt- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.
- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.
- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.
Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.
This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Crop cultivar identification is fundamental for agricultural research, industry and policies. This paper investigates the feasibility of using visible/near infrared hyperspectral data collected with a miniaturized NIR spectrometer to identify cultivars of barley, chickpea and sorghum in the context of Ethiopia. A total of 2650 grains of barley, chickpea and sorghum cultivars were scanned using the SCIO, a recently released miniaturized NIR spectrometer. The effects of data preprocessing techniques and choosing a machine learning algorithm on distinguishing cultivars are further evaluated. Predictive multiclass models of 24 barley cultivars, 19 chickpea cultivars and 10 sorghum cultivars delivered an accuracy of 89%, 96% and 87% on hold-out sample. The Support Vector Machine (SVM) and Partial least squares discriminant analysis (PLS-DA) algorithms consistently outperformed other algorithms. Several cultivars, believed to be widely adopted in Ethiopia, were identified with perfect accuracy. These results advance the discussion on cultivar identification survey methods by demonstrating that miniaturized NIR spectrometers represent a low-cost, rapid and viable tool. We further discuss the potential utility of the method for adoption surveys, field-scale agronomic studies, socio-economic impact assessments and value chain quality control. Finally, we provide a free tool for R to easily carry out crop cultivar identification and measure uncertainty based on spectral data.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Fruit Classification Dataset is a beginner classification dataset configured to classify fruit types based on fruit name, color, and weight information.
2) Data Utilization (1) Fruit Classification Dataset has characteristics that: • This dataset consists of a total of three columns: categorical variable Color, continuous variable Weight, and target class Fruit, allowing you to pre-process categorical and numerical variables when learning classification models. (2) Fruit Classification Dataset can be used to: • Model learning and evaluation: It can be used as educational and research experimental data to compare and evaluate the performance of various machine learning classification algorithms using color and weight characteristics. • Data preprocessing practice: can be used as hands-on data to learn basic data preprocessing and feature engineering courses such as categorical variable encoding and continuous variable scaling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Algorithms proposed in computational pathology can allow to automatically analyze digitized tissue samples of histopathological images to help diagnosing diseases. Tissue samples are scanned at a high-resolution and usually saved as images with several magnification levels, namely whole slide images (WSIs). Convolutional neural networks (CNNs) represent the state-of-the-art computer vision methods targeting the analysis of histopathology images, aiming for detection, classification and segmentation. However, the development of CNNs that work with multi-scale images such as WSIs is still an open challenge. The image characteristics and the CNN properties impose architecture designs that are not trivial. Therefore, single scale CNN architectures are still often used. This paper presents Multi_Scale_Tools, a library aiming to facilitate exploiting the multi-scale structure of WSIs. Multi_Scale_Tools currently include four components: a pre-processing component, a scale detector, a multi-scale CNN for classification and a multi-scale CNN for segmentation of the images. The pre-processing component includes methods to extract patches at several magnification levels. The scale detector allows to identify the magnification level of images that do not contain this information, such as images from the scientific literature. The multi-scale CNNs are trained combining features and predictions that originate from different magnification levels. The components are developed using private datasets, including colon and breast cancer tissue samples. They are tested on private and public external data sources, such as The Cancer Genome Atlas (TCGA). The results of the library demonstrate its effectiveness and applicability. The scale detector accurately predicts multiple levels of image magnification and generalizes well to independent external data. The multi-scale CNNs outperform the single-magnification CNN for both classification and segmentation tasks. The code is developed in Python and it will be made publicly available upon publication. It aims to be easy to use and easy to be improved with additional functions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:
Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?
A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.
Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}