86 datasets found

Z
Community Detection to Split Large-scale Assemblies in Subassemblies
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Münker, Sören (2023). Community Detection to Split Large-scale Assemblies in Subassemblies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8260584
Explore at:
Dataset updated
Aug 19, 2023
Dataset provided by
WZL of RWTH Aachen University
Authors
Münker, Sören
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:

Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?

A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.

Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}
Dataset for Feature Scaling [Standardization]
kaggle.com
zip
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mit Gandhi (2024). Dataset for Feature Scaling [Standardization] [Dataset]. https://www.kaggle.com/datasets/mitgandhi10/dataset-for-feature-scaling-standardization
Explore at:
zip(951 bytes)Available download formats
Dataset updated
Nov 30, 2024
Authors
Mit Gandhi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains information about three species of Iris flowers: Setosa, Versicolour, and Virginica. It is a well-known dataset in the machine learning and statistics communities, often used for classification and clustering tasks. Each row represents a sample of an Iris flower, with measurements of its physical attributes and the corresponding target label.

Dataset Features: sepal length (cm): The length of the sepal in centimeters. sepal width (cm): The width of the sepal in centimeters. petal length (cm): The length of the petal in centimeters. petal width (cm): The width of the petal in centimeters. target: A numerical label (0, 1, or 2) indicating the flower species: 0: Setosa 1: Versicolour 2: Virginica

Purpose: This dataset can be used for: Supervised learning tasks, particularly classification. Exploratory data analysis and visualization of flower attributes. Understanding the application of machine learning algorithms like decision trees, KNN, and support vector machines.

Source: This is a modified version of the classic Iris flower dataset, often used for beginner-level machine learning projects and demonstrations.

Potential Use Cases: Training machine learning models for flower classification. Practicing data preprocessing, feature scaling, and visualization techniques. Understanding the relationships between features through scatter plots and correlation analysis.
d
Daily aggregation notebook
search.dataone.org
hydroshare.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Sadler (2021). Daily aggregation notebook [Dataset]. https://search.dataone.org/view/sha256%3A4859cfa9813183a6f7473feb0daa227bbabd13d9054dd2ca5f395a19933223e8
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Jeff Sadler
Description
Python 2 Jupyter notebook that aggregates sub-daily time series observations up to a daily time scale. The code was originally written to aggregate data stored in the sqlite database stored in this resource: https://www.hydroshare.org/resource/9e1b23607ac240588ba50d6b5b9a49b5/
r
Matching as non-parametric preprocessing for the estimation of equivalence...
resodate.org
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Dudel; Jan Marvin Garbuszus; Notburga Ott; Martin Werding (2025). Matching as non-parametric preprocessing for the estimation of equivalence scales (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9tYXRjaGluZy1hcy1ub24tcGFyYW1ldHJpYy1wcmVwcm9jZXNzaW5nLWZvci10aGUtZXN0aW1hdGlvbi1vZi1lcXVpdmFsZW5jZS1zY2FsZXMtcmVwbGljYXRpb24tZGF0YQ==
Explore at:
Dataset updated
Oct 2, 2025
Dataset provided by
ZBW Journal Data Archive
ZBW
Journal of Economics and Statistics
Authors
Christian Dudel; Jan Marvin Garbuszus; Notburga Ott; Martin Werding
Description
Empirically analyzing household behavior usually relies on informal data preprocessing. That is, before the estimation, observations are preselected to obtain a sufficiently homogeneous subset of data. In the context of estimating equivalence scales for household income, we use matching techniques and balance checking at this initial stage. This can be interpreted as a non-parametric approach to preprocessing data that formalizes informal procedures. We illustrate this using German micro-data on household expenditure, showing that matching leads to results which are more stable with respect to model specification and is especially useful when applied to specific subgroups, such as low-income households. The files provided here contain the code (in "R") which is needed to replicate our analyses.
Z
Adult dataset preprocessed
data.niaid.nih.gov
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pustozerova, Anastasia; Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
SBA Research
Authors
Pustozerova, Anastasia; Schuster, Verena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
r
MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A...
researchdata.edu.au
data.mendeley.com
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering (2023). MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A Benchmark Dataset [Dataset]. http://doi.org/10.17632/6D8V9JMVGM.1
Explore at:
Unique identifier
https://doi.org/10.17632/6D8V9JMVGM.1
Dataset updated
2023
Dataset provided by
Mendeley Data
The University of Western Australia
Authors
Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering
Description
his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.

In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.

The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.

Student Academic Performance (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset

Explore at:

zip(9287 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

Handling missing values
Removing duplicates
Detecting and treating outliers
Data normalization and transformation
Encoding categorical variables
Exploratory data analysis (EDA)
Regression Analysis

📊 Columns Description

Column Name	Description
Student_ID	Unique identifier for each student (e.g., S0001, S0002, …)
Age	Age of the student (between 18 and 25 years)
Gender	Gender of the student (Male/Female)
Study_Hours	Average number of study hours per day (contains missing values and outliers)
Attendance(%)	Percentage of class attendance (contains missing values)
Test_Score	Final exam score (0–100 scale)
Grade	Letter grade derived from test scores (`F`, `C`, `B`, `A`, `A+`)

🧠 Example Lab Tasks Using This Dataset:

Identify and impute missing values using mean/median.
Detect and remove duplicate records.
Use IQR or Z-score methods to handle outliers.
Normalize Study_Hours and Test_Score using Min-Max scaling.
Encode categorical variables (Gender, Grade) for model input.
Prepare a clean dataset ready for classification/regression analysis.
Can be used for Limited Regression

🎯 Possible Regression Targets

Test_Score → Predict test score based on study hours, attendance, age, and gender.

🧩 Example Regression Problem

Predict the student’s test score using their study hours, attendance percentage, and age.

🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

You can use:

Linear Regression (for simplicity)
Polynomial Regression (to explore nonlinear patterns)
Decision Tree Regressor or Random Forest Regressor

And analyze feature influence using correlation or SHAP/LIME explainability.

P
Pharmaceutical Sample Preprocessing System Report
marketreportanalytics.com
doc, pdf, ppt
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Pharmaceutical Sample Preprocessing System Report [Dataset]. https://www.marketreportanalytics.com/reports/pharmaceutical-sample-preprocessing-system-273992
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Nov 20, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The size of the Pharmaceutical Sample Preprocessing System market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX% during the forecast period.

RAP Preprocessing Moisture Reduction Systems Market Research Report 2033

researchintelo.com

csv, pdf, pptx

Updated Oct 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Research Intelo (2025). RAP Preprocessing Moisture Reduction Systems Market Research Report 2033 [Dataset]. https://researchintelo.com/report/rap-preprocessing-moisture-reduction-systems-market

Explore at:

csv, pdf, pptxAvailable download formats

Dataset updated

Oct 1, 2025

Dataset authored and provided by

Research Intelo

License

https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

Time period covered

2024 - 2033

Area covered

Global

Description

RAP Preprocessing Moisture Reduction Systems Market Outlook

According to our latest research, the Global RAP Preprocessing Moisture Reduction Systems market size was valued at $1.2 billion in 2024 and is projected to reach $2.8 billion by 2033, expanding at a robust CAGR of 9.7% during the forecast period of 2025–2033. The primary driver for this substantial growth is the increasing emphasis on sustainable infrastructure development and the rising adoption of recycled asphalt pavement (RAP) in road construction projects worldwide. As governments and private stakeholders prioritize cost-effective, environmentally responsible construction practices, the demand for advanced moisture reduction systems that enhance RAP quality and performance is set to accelerate significantly.

Regional Outlook

North America currently holds the largest market share in the global RAP Preprocessing Moisture Reduction Systems market, accounting for approximately 38% of the total market value in 2024. This dominance is attributed to the region’s mature road construction industry, stringent environmental regulations, and widespread adoption of asphalt recycling practices. The United States, in particular, has implemented robust policies that incentivize the use of RAP, driving investments in advanced moisture reduction technologies. Furthermore, established infrastructure, a strong network of asphalt recycling plants, and the presence of leading technology providers contribute to North America’s leadership position. The region’s focus on reducing greenhouse gas emissions and minimizing landfill waste further supports the rapid integration of moisture reduction systems into both new and existing asphalt production facilities.

The Asia Pacific region is expected to experience the fastest CAGR of 12.3% from 2025 to 2033, driven by rapid urbanization, expanding infrastructure projects, and growing government investments in sustainable road construction. Countries such as China, India, and Southeast Asian nations are witnessing a surge in road-building activities to support economic development and urban connectivity. The increasing awareness of the benefits of RAP, coupled with rising material costs and environmental concerns, is prompting both public and private sector players to adopt advanced preprocessing moisture reduction systems. Regional governments are also launching pilot projects and offering incentives to promote recycling technologies, which is anticipated to further boost market growth in the coming years.

Emerging economies in Latin America and the Middle East & Africa are gradually adopting RAP preprocessing moisture reduction systems, although market penetration remains in its nascent stages due to challenges such as limited technical expertise, budget constraints, and inconsistent regulatory frameworks. In these regions, the adoption of RAP technologies is often driven by large-scale infrastructure projects and international development funding. However, the lack of standardized policies and localized supply chains poses hurdles to widespread implementation. Nevertheless, as these economies continue to urbanize and prioritize cost-effective, sustainable construction methods, the long-term outlook for RAP moisture reduction systems remains positive, with significant growth potential as awareness and policy support increase.

Report Scope

Attributes	Details
Report Title	RAP Preprocessing Moisture Reduction Systems Market Research Report 2033
By Technology	Thermal Drying, Mechanical Dewatering, Chemical Treatment, Others
By Application	Road Construction, Asphalt Recycling Plants, Infrastructure Projects, Others
By System Type	Batch Systems, Continuous Systems
By End-User	Construction Companies, Municipalities, Contractors, Others

Data in the experiment.
plos.figshare.com
zip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montaser Abdelsattar; Mohamed A. Ismeil; Karim Menoufi; Ahmed AbdelMoety; Ahmed Emad-Eldeen (2025). Data in the experiment. [Dataset]. http://doi.org/10.1371/journal.pone.0317619.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317619.s001
Dataset updated
Jan 23, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Montaser Abdelsattar; Mohamed A. Ismeil; Karim Menoufi; Ahmed AbdelMoety; Ahmed Emad-Eldeen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study presents a comprehensive comparative analysis of Machine Learning (ML) and Deep Learning (DL) models for predicting Wind Turbine (WT) power output based on environmental variables such as temperature, humidity, wind speed, and wind direction. Along with Artificial Neural Network (ANN), Long Short-Term Memory (LSTM), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN), the following ML models were looked at: Linear Regression (LR), Support Vector Regressor (SVR), Random Forest (RF), Extra Trees (ET), Adaptive Boosting (AdaBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). Using a dataset of 40,000 observations, the models were assessed based on R-squared, Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). ET achieved the highest performance among ML models, with an R-squared value of 0.7231 and a RMSE of 0.1512. Among DL models, ANN demonstrated the best performance, achieving an R-squared value of 0.7248 and a RMSE of 0.1516. The results show that DL models, especially ANN, did slightly better than the best ML models. This means that they are better at modeling non-linear dependencies in multivariate data. Preprocessing techniques, including feature scaling and parameter tuning, improved model performance by enhancing data consistency and optimizing hyperparameters. When compared to previous benchmarks, the performance of both ANN and ET demonstrates significant predictive accuracy gains in WT power output forecasting. This study’s novelty lies in directly comparing a diverse range of ML and DL algorithms while highlighting the potential of advanced computational approaches for renewable energy optimization.
Preprocessing procedures and supervised classification applied to a database...
scielo.figshare.com
figshare.com
jpeg
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Pessoa Valadares; Ricardo Marques Coelho; Stanley Robson de Medeiros Oliveira (2023). Preprocessing procedures and supervised classification applied to a database of systematic soil survey [Dataset]. http://doi.org/10.6084/m9.figshare.8162873.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8162873.v1
Dataset updated
Jun 2, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Alan Pessoa Valadares; Ricardo Marques Coelho; Stanley Robson de Medeiros Oliveira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT: Data Mining techniques play an important role in the prediction of soil spatial distribution in systematic soil surveying, though existing methodologies still lack standardization and a full understanding of their capabilities. The aim of this work was to evaluate the performance of preprocessing procedures and supervised classification approaches for predicting map units from 1:100,000-scale conventional semi-detailed soil surveys. Sheets of the Brazilian National Cartographic System on the 1:50,000 scale, “Dois Córregos” (“Brotas” 1:100,000-scale sheet), “São Pedro” and “Laras” (“Piracicaba” 1:100,000-scale sheet) were used for developing models. Soil map information and predictive environmental covariates for the dataset were obtained from the semi-detailed soil survey of the state of São Paulo, from the Brazilian Institute of Geography and Statistics (IBGE) 1:50,000-scale topographic sheets and from the 1:750,000-scale geological map of the state of São Paulo. The target variable was a soil map unit of four types: local “soil unit” name and soil class at three hierarchical levels of the Brazilian System of Soil Classification (SiBCS). Different data preprocessing treatments and four algorithms all having different approaches were also tested. Results showed that composite soil map units were not adequate for the machine learning process. Class balance did not contribute to improving the performance of classifiers. Accuracy values of 78 % and a Kappa index of 0.67 were obtained after preprocessing procedures with Random Forest, the algorithm that performed best. Information from conventional map units of semi-detailed (4th order) 1:100,000 soil survey generated models with values for accuracy, precision, sensitivity, specificity and Kappa indexes that support their use in programs for systematic soil surveying.
P
Pharmaceutical Sample Preprocessing System Report
marketresearchforecast.com
doc, pdf, ppt
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Pharmaceutical Sample Preprocessing System Report [Dataset]. https://www.marketresearchforecast.com/reports/pharmaceutical-sample-preprocessing-system-265323
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jun 16, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The pharmaceutical sample preprocessing system market is booming, projected to reach $3.8 billion by 2033, driven by automation, personalized medicine, and high-throughput screening. Learn about market trends, key players (Roche, Menarini Diagnostics, Sekisui Diagnostics), and regional growth in this comprehensive analysis.
Prediction of Personality Traits using the Big 5 Framework
zenodo.org
csv, text/x-python
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
Explore at:
text/x-python, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7596072
Dataset updated
Feb 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neelima Brahmbhatt; Neelima Brahmbhatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

1. Acquire Personality Dataset

The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

2. Data preprocessing

After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

3. Feature Extraction

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

EXT1 I am the life of the party. EXT2 I don't talk a lot. EXT3 I feel comfortable around people. EXT4 I am quiet around strangers. EST1 I get stressed out easily. EST2 I get irritated easily. EST3 I worry about things. EST4 I change my mood a lot. AGR1 I have a soft heart. AGR2 I am interested in people. AGR3 I insult people. AGR4 I am not really interested in others. CSN1 I am always prepared. CSN2 I leave my belongings around. CSN3 I follow a schedule. CSN4 I make a mess of things. OPN1 I have a rich vocabulary. OPN2 I have difficulty understanding abstract ideas. OPN3 I do not have a good imagination. OPN4 I use difficult words.

4. Training the Model

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

5. Personality Prediction Output

After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Jeans Dataset
kaggle.com
zip
Updated May 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunny Kusawa (2022). Jeans Dataset [Dataset]. https://www.kaggle.com/datasets/sunnykusawa/jeans-dataset
Explore at:
zip(89002718 bytes)Available download formats
Dataset updated
May 24, 2022
Authors
Sunny Kusawa
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set is collected from online ecommerce site and its very raw and junk data as you can get in industries for your data science project.

Trying your image preprocessing skill on such data will help you to understand the real time problems and challenges in industry projects.

It has some junk, partial as well as full jeans images.

You can perform different task on this data set like, Beginner - Resize all images to 48 * 48 size - Convert all images to gray scale images Intermediate - Perform image masking on all images Advance - Try to cluster jeans images - try if you can cluster based on color - try if you can cluster based on full, partial and junk jeans images
P
Pharmaceutical Sample Preprocessing System Report
datainsightsmarket.com
doc, pdf, ppt
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Pharmaceutical Sample Preprocessing System Report [Dataset]. https://www.datainsightsmarket.com/reports/pharmaceutical-sample-preprocessing-system-201252
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Nov 7, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Explore the dynamic Pharmaceutical Sample Preprocessing System market with insights on growth drivers, trends, restraints, and regional analysis. Discover market size projections and key players shaping the future of drug discovery and diagnostics.
D
RAP Preprocessing Moisture Reduction Systems Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). RAP Preprocessing Moisture Reduction Systems Market Research Report 2033 [Dataset]. https://dataintelo.com/report/rap-preprocessing-moisture-reduction-systems-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
RAP Preprocessing Moisture Reduction Systems Market Outlook

According to our latest research, the global RAP Preprocessing Moisture Reduction Systems market size in 2024 stands at USD 1.28 billion, reflecting robust demand from the infrastructure and road construction sectors. The market is exhibiting a healthy growth trajectory, registering a CAGR of 6.9% from 2025 to 2033. By 2033, the market is forecasted to reach USD 2.39 billion, underpinned by ongoing advancements in moisture reduction technologies and increasing emphasis on sustainable construction practices. The primary driver fueling this growth is the surge in global infrastructure investments and the growing need for efficient asphalt recycling, as governments and private entities focus on cost-effective and environmentally conscious solutions.

One of the key growth factors propelling the RAP Preprocessing Moisture Reduction Systems market is the escalating focus on sustainable road construction and rehabilitation. The use of Reclaimed Asphalt Pavement (RAP) is increasingly favored due to its environmental benefits, such as reduced need for virgin materials, lower greenhouse gas emissions, and minimized landfill use. However, RAP often contains significant moisture, which can hinder its reuse and impact the quality of the final asphalt mix. As a result, the demand for advanced moisture reduction systems has surged, with technologies that efficiently remove moisture from RAP becoming critical for ensuring the durability and performance of recycled asphalt. This trend is further amplified by stringent government regulations and policies aimed at promoting green construction practices, thereby driving adoption across both developed and emerging markets.

Another significant factor contributing to market expansion is the rapid pace of technological innovation within the RAP preprocessing sector. Companies are investing heavily in research and development to introduce systems that deliver higher energy efficiency, faster processing times, and greater reliability. The integration of automation, real-time monitoring, and data analytics into moisture reduction systems is enabling construction companies and recycling facilities to optimize their operations, reduce operational costs, and improve output quality. Additionally, the availability of modular and scalable solutions is making it easier for end-users to customize their systems according to project size and specific requirements, further broadening the market’s appeal across diverse application segments.

The growing emphasis on infrastructure modernization and maintenance, particularly in regions with aging road networks, is also catalyzing market growth. Governments worldwide are allocating substantial budgets to upgrade existing transportation infrastructure, with a keen focus on sustainability and cost efficiency. The ability of RAP preprocessing moisture reduction systems to enhance the performance and longevity of recycled asphalt is making them indispensable tools for infrastructure maintenance and rehabilitation projects. Furthermore, collaborations between public agencies and private companies are accelerating the adoption of these systems, as stakeholders seek to maximize the value of available resources while minimizing environmental impact.

From a regional perspective, Asia Pacific is emerging as the dominant market for RAP Preprocessing Moisture Reduction Systems, driven by rapid urbanization, expanding road networks, and significant investments in infrastructure development. North America and Europe are also witnessing substantial growth, supported by strong regulatory frameworks and a mature construction industry. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their adoption rates, spurred by rising awareness of the benefits of asphalt recycling and the need to address infrastructure deficits. The regional outlook remains highly favorable, with each region presenting unique opportunities and challenges that are shaping the overall trajectory of the global market.

Technology Analysis

The RAP Preprocessing Moisture Reduction Systems market is segmented by technology into Thermal Drying, Mechanical Dewatering, Chemical Treatment, and Others, each offering distinct advantages and serving specific operational needs. Thermal drying remains the most widely adopted technology, accounting for a signif
Replication Package for 'How do Machine Learning Models Change?'
zenodo.org
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14128997
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

Our research addresses three main aspects:

Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.

Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.

Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

Data Collection and Preprocessing

Data Collection

We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.

Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.

Release Information: Information on model releases marked by tags in their repositories.

To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

Data Preprocessing

Commit Diffs

We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

Commit Classification

We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

Model Metadata

We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

Folder Structure

The replication package is organized as follows:

- code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

Collection/: Contains two Jupyter notebooks for data collection:

HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.

HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.

Preprocessing/: Contains preprocessing scripts:

HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.

HFCommitsPreprocessing.ipynb: Processes commit data, including:

Retrieval of diff information between commits.

Classification of commits following Bhatia et al.'s taxonomy using LLMs.

Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.

HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.

Analysis/: Contains three Jupyter notebooks with the analysis for each research question:

RQ1_Analysis.ipynb: Analysis for RQ1.

RQ2_Analysis.ipynb: Analysis for RQ2.

RQ3_Analysis.ipynb: Analysis for RQ3.

- datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

Main Datasets:

HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.

HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.

HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.

model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.

These datasets correspond to the following dataset splits:

+200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.

+200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.

+1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.

Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.

Additional Datasets:

HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.

HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.

Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

- metadata/: Contains the tags_metadata.yaml file used during preprocessing.

- models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

- requirements.txt: Lists the required Python packages to set up the environment and run the code.

Setup and Execution

Prerequisites

Python 3.10.11 or later.

Jupyter Notebook or JupyterLab.

Installation

Download and Extract the Replication Package

Create a Virtual Environment (Recommended):bash
python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate

Install Required Packages:bash
pip install -r requirements.txt

Notes

- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

Additional Information

Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
Evaluation of a miniaturized NIR spectrometer for cultivar identification:...
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frédéric Kosmowski; Tigist Worku (2023). Evaluation of a miniaturized NIR spectrometer for cultivar identification: The case of barley, chickpea and sorghum in Ethiopia [Dataset]. http://doi.org/10.1371/journal.pone.0193620
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0193620
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Frédéric Kosmowski; Tigist Worku
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ethiopia
Description
Crop cultivar identification is fundamental for agricultural research, industry and policies. This paper investigates the feasibility of using visible/near infrared hyperspectral data collected with a miniaturized NIR spectrometer to identify cultivars of barley, chickpea and sorghum in the context of Ethiopia. A total of 2650 grains of barley, chickpea and sorghum cultivars were scanned using the SCIO, a recently released miniaturized NIR spectrometer. The effects of data preprocessing techniques and choosing a machine learning algorithm on distinguishing cultivars are further evaluated. Predictive multiclass models of 24 barley cultivars, 19 chickpea cultivars and 10 sorghum cultivars delivered an accuracy of 89%, 96% and 87% on hold-out sample. The Support Vector Machine (SVM) and Partial least squares discriminant analysis (PLS-DA) algorithms consistently outperformed other algorithms. Several cultivars, believed to be widely adopted in Ethiopia, were identified with perfect accuracy. These results advance the discussion on cultivar identification survey methods by demonstrating that miniaturized NIR spectrometers represent a low-cost, rapid and viable tool. We further discuss the potential utility of the method for adoption surveys, field-scale agronomic studies, socio-economic impact assessments and value chain quality control. Finally, we provide a free tool for R to easily carry out crop cultivar identification and measure uncertainty based on spectral data.
c
Fruit Tabular Classification Dataset
cubig.ai
zip
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Fruit Tabular Classification Dataset [Dataset]. https://cubig.ai/store/products/563/fruit-tabular-classification-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jul 8, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Fruit Classification Dataset is a beginner classification dataset configured to classify fruit types based on fruit name, color, and weight information.

2) Data Utilization (1) Fruit Classification Dataset has characteristics that: • This dataset consists of a total of three columns: categorical variable Color, continuous variable Weight, and target class Fruit, allowing you to pre-process categorical and numerical variables when learning classification models. (2) Fruit Classification Dataset can be used to: • Model learning and evaluation: It can be used as educational and research experimental data to compare and evaluate the performance of various machine learning classification algorithms using color and weight characteristics. • Data preprocessing practice: can be used as hands-on data to learn basic data preprocessing and feature engineering courses such as categorical variable encoding and continuous variable scaling.
f
DataSheet1_Multi_Scale_Tools: A Python Library to Exploit Multi-Scale Whole...
frontiersin.figshare.com
pdf
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niccolò Marini; Sebastian Otálora; Damian Podareanu; Mart van Rijthoven; Jeroen van der Laak; Francesco Ciompi; Henning Müller; Manfredo Atzori (2023). DataSheet1_Multi_Scale_Tools: A Python Library to Exploit Multi-Scale Whole Slide Images.PDF [Dataset]. http://doi.org/10.3389/fcomp.2021.684521.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2021.684521.s001
Dataset updated
Jun 9, 2023
Dataset provided by
Frontiers
Authors
Niccolò Marini; Sebastian Otálora; Damian Podareanu; Mart van Rijthoven; Jeroen van der Laak; Francesco Ciompi; Henning Müller; Manfredo Atzori
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Algorithms proposed in computational pathology can allow to automatically analyze digitized tissue samples of histopathological images to help diagnosing diseases. Tissue samples are scanned at a high-resolution and usually saved as images with several magnification levels, namely whole slide images (WSIs). Convolutional neural networks (CNNs) represent the state-of-the-art computer vision methods targeting the analysis of histopathology images, aiming for detection, classification and segmentation. However, the development of CNNs that work with multi-scale images such as WSIs is still an open challenge. The image characteristics and the CNN properties impose architecture designs that are not trivial. Therefore, single scale CNN architectures are still often used. This paper presents Multi_Scale_Tools, a library aiming to facilitate exploiting the multi-scale structure of WSIs. Multi_Scale_Tools currently include four components: a pre-processing component, a scale detector, a multi-scale CNN for classification and a multi-scale CNN for segmentation of the images. The pre-processing component includes methods to extract patches at several magnification levels. The scale detector allows to identify the magnification level of images that do not contain this information, such as images from the scientific literature. The multi-scale CNNs are trained combining features and predictions that originate from different magnification levels. The components are developed using private datasets, including colon and breast cancer tissue samples. They are tested on private and public external data sources, such as The Cancer Genome Atlas (TCGA). The results of the library demonstrate its effectiveness and applicability. The scale detector accurately predicts multiple levels of image magnification and generalizes well to independent external data. The multi-scale CNNs outperform the single-magnification CNN for both classification and segmentation tasks. The code is developed in Python and it will be made publicly available upon publication. It aims to be easy to use and easy to be improved with additional functions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Münker, Sören (2023). Community Detection to Split Large-scale Assemblies in Subassemblies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8260584

Community Detection to Split Large-scale Assemblies in Subassemblies

Explore at:

Dataset updated

Aug 19, 2023

Dataset provided by

WZL of RWTH Aachen University

Authors

Münker, Sören

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:

Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?

A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.

Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}

Clear search

Close search

Google apps

Main menu

Community Detection to Split Large-scale Assemblies in Subassemblies

Dataset for Feature Scaling [Standardization]

Daily aggregation notebook

Matching as non-parametric preprocessing for the estimation of equivalence...

Adult dataset preprocessed

MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A...

Student Academic Performance (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks Using This Dataset:

🎯 Possible Regression Targets

🧩 Example Regression Problem

Pharmaceutical Sample Preprocessing System Report

RAP Preprocessing Moisture Reduction Systems Market Research Report 2033

RAP Preprocessing Moisture Reduction Systems Market Outlook

Regional Outlook

Report Scope

Data in the experiment.

Preprocessing procedures and supervised classification applied to a database...

Pharmaceutical Sample Preprocessing System Report

Prediction of Personality Traits using the Big 5 Framework

Jeans Dataset

Pharmaceutical Sample Preprocessing System Report

RAP Preprocessing Moisture Reduction Systems Market Research Report 2033

RAP Preprocessing Moisture Reduction Systems Market Outlook

Technology Analysis

Replication Package for 'How do Machine Learning Models Change?'

Overview

Data Collection and Preprocessing

Data Collection

Data Preprocessing

Folder Structure

Setup and Execution

Prerequisites

Installation

Notes

Additional Information

Evaluation of a miniaturized NIR spectrometer for cultivar identification:...

Fruit Tabular Classification Dataset

DataSheet1_Multi_Scale_Tools: A Python Library to Exploit Multi-Scale Whole...

Community Detection to Split Large-scale Assemblies in Subassemblies