100+ datasets found

E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.
f
DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fspas.2023.1134141.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Automobile Sales data

kaggle.com

zip

Updated Nov 18, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

dee dee (2023). Automobile Sales data [Dataset]. https://www.kaggle.com/datasets/ddosad/auto-sales-data/data

Explore at:

zip(81125 bytes)Available download formats

Dataset updated

Nov 18, 2023

Authors

dee dee

Description

The dataset contains Sales data of an Automobile company.

Do explore pinned 📌 notebook under code section for quick EDA📊 reference

Consider an upvote ^ if you find the dataset useful

Data Description

Column Name	Description
ORDERNUMBER	This column represents the unique identification number assigned to each order.
QUANTITYORDERED	It indicates the number of items ordered in each order.
PRICEEACH	This column specifies the price of each item in the order.
ORDERLINENUMBER	It represents the line number of each item within an order.
SALES	This column denotes the total sales amount for each order, which is calculated by multiplying the quantity ordered by the price of each item.
ORDERDATE	It denotes the date on which the order was placed.
DAYS_SINCE_LASTORDER	This column represents the number of days that have passed since the last order for each customer. It can be used to analyze customer purchasing patterns.
STATUS	It indicates the status of the order, such as "Shipped," "In Process," "Cancelled," "Disputed," "On Hold," or "Resolved."
PRODUCTLINE	This column specifies the product line categories to which each item belongs.
MSRP	It stands for Manufacturer's Suggested Retail Price and represents the suggested selling price for each item.
PRODUCTCODE	This column represents the unique code assigned to each product.
CUSTOMERNAME	It denotes the name of the customer who placed the order.
PHONE	This column contains the contact phone number for the customer.
ADDRESSLINE1	It represents the first line of the customer's address.
CITY	This column specifies the city where the customer is located.
POSTALCODE	It denotes the postal code or ZIP code associated with the customer's address.
COUNTRY	This column indicates the country where the customer is located.
CONTACTLASTNAME	It represents the last name of the contact person associated with the customer.
CONTACTFIRSTNAME	This column denotes the first name of the contact person associated with the customer.
DEALSIZE	It indicates the size of the deal or order, which are the categories "Small," "Medium," or "Large."

Evaluate AI Models for Weld Inspection & NDT in Auto Manufac - EDA
ai.tracebloc.io
json
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tracebloc (2025). Evaluate AI Models for Weld Inspection & NDT in Auto Manufac - EDA [Dataset]. https://ai.tracebloc.io/explore/ai-weld-inspection-ndt-testing-in-automotive-manufacturing?tab=exploratory-data-analysis
Explore at:
jsonAvailable download formats
Dataset updated
Dec 3, 2025
Dataset provided by
Tracebloc GmbH
Authors
tracebloc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Missing Values
Measurement technique
Statistical and exploratory data analysis
Description
Benchmark and compare 3rd-party AI models for weld defect detection & NDT in automotive production lines. Focus on recall, latency and enterprise deployment.
Data used in Automatic cell finding paper
kaggle.com
zip
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emmy Manders (2021). Data used in Automatic cell finding paper [Dataset]. https://www.kaggle.com/datasets/emmymanders/data-used-in-automatic-cell-finding-paper
Explore at:
zip(954389704 bytes)Available download formats
Dataset updated
Jul 20, 2021
Authors
Emmy Manders
Description
Dataset

This dataset was created by Emmy Manders

Contents
f
Data_Sheet_1_Clustering for Automated Exploratory Pattern Discovery in...
frontiersin.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Menaker; Joke Monteny; Lin Op de Beeck; Anna Zamansky (2023). Data_Sheet_1_Clustering for Automated Exploratory Pattern Discovery in Animal Behavioral Data.pdf [Dataset]. http://doi.org/10.3389/fvets.2022.884437.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fvets.2022.884437.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Tom Menaker; Joke Monteny; Lin Op de Beeck; Anna Zamansky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Traditional methods of data analysis in animal behavior research are usually based on measuring behavior by manually coding a set of chosen behavioral parameters, which is naturally prone to human bias and error, and is also a tedious labor-intensive task. Machine learning techniques are increasingly applied to support researchers in this field, mostly in a supervised manner: for tracking animals, detecting land marks or recognizing actions. Unsupervised methods are increasingly used, but are under-explored in the context of behavior studies and applied contexts such as behavioral testing of dogs. This study explores the potential of unsupervised approaches such as clustering for the automated discovery of patterns in data which have potential behavioral meaning. We aim to demonstrate that such patterns can be useful at exploratory stages of data analysis before forming specific hypotheses. To this end, we propose a concrete method for grouping video trials of behavioral testing of animal individuals into clusters using a set of potentially relevant features. Using an example of protocol for testing in a “Stranger Test”, we compare the discovered clusters against the C-BARQ owner-based questionnaire, which is commonly used for dog behavioral trait assessment, showing that our method separated well between dogs with higher C-BARQ scores for stranger fear, and those with lower scores. This demonstrates potential use of such clustering approach for exploration prior to hypothesis forming and testing in behavioral research.
f
ML models performance (SparsePCA).
plos.figshare.com
xls
Updated Nov 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed (2023). ML models performance (SparsePCA). [Dataset]. http://doi.org/10.1371/journal.pone.0294803.t017
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294803.t017
Dataset updated
Nov 27, 2023
Dataset provided by
PLOS ONE
Authors
Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.
Auto-mpg_Dataset
kaggle.com
zip
Updated Mar 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avish (2020). Auto-mpg_Dataset [Dataset]. https://www.kaggle.com/datasets/avish5787/autompg-dataset
Explore at:
zip(6461 bytes)Available download formats
Dataset updated
Mar 19, 2020
Authors
Avish
Description
Dataset

This dataset was created by Avish

Contents
auto_preprocess_pyfile
kaggle.com
zip
Updated Dec 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
blue7red (2021). auto_preprocess_pyfile [Dataset]. https://www.kaggle.com/datasets/rhythmcam/auto-preprocess-pyfile
Explore at:
zip(997 bytes)Available download formats
Dataset updated
Dec 26, 2021
Authors
blue7red
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by blue7red

Released under CC0: Public Domain

Contents
Sales Data Analysis
kaggle.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jawad Yasir Hadi (2024). Sales Data Analysis [Dataset]. https://www.kaggle.com/datasets/jawadyasirhadi/sales-data-analysis
Explore at:
zip(527235 bytes)Available download formats
Dataset updated
Aug 2, 2024
Authors
Jawad Yasir Hadi
Description
Sales Analysis 2023 Summary for Auto Parts Manufacturer Company Total Sales and Profit: The company achieved a total sales of 56 million AED in 2023. Net profit for the year was 34 million AED. Monthly Highlights: October 2023 saw the highest sales, exceeding 5 million AED. The average unit price peaked in October 2023. Product Performance: Car Accessories: Most sold category due to their lower price and smaller size. Body Parts: Generated the highest revenue of approximately 13 million AED due to their higher price. Recorded the highest gross margin at 66%. Highest net profit in this category, amounting to 8 million AED. Wheels and Tires: Have the highest average unit price. Sales Volume: March 2023 recorded the maximum quantity sold, with 20,000 items. Key Insights: Revenue Distribution: Body parts are the main revenue drivers, contributing significantly due to their high price. Profit Margins: Body parts not only contribute the most to revenue but also have the highest gross margin, highlighting their profitability. Sales Trends: October is the peak month for sales and unit price, indicating potential seasonal factors or successful marketing strategies. Product Mix: Car accessories are the most popular items by volume, while body parts dominate in terms of revenue and profit. Sales Volume: March is notable for the highest sales volume, which could be leveraged for future sales planning and inventory management. This analysis provides a comprehensive view of the sales performance in 2023, highlighting key areas of success and opportunities for strategic focus in the future.
f
ML models performance using all features.
datasetcatalog.nlm.nih.gov
Updated Nov 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siraji, Muntequa Imtiaz; Nishat, Mirza Muntasir; Rahman, Ahnaf Akif; Faisal, Fahim; Khalid, Lamim Ibtisam; Al Mamun, Abdullah; Ahmed, Ashik (2023). ML models performance using all features. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000960545
Explore at:
Dataset updated
Nov 27, 2023
Authors
Siraji, Muntequa Imtiaz; Nishat, Mirza Muntasir; Rahman, Ahnaf Akif; Faisal, Fahim; Khalid, Lamim Ibtisam; Al Mamun, Abdullah; Ahmed, Ashik
Description
Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.
f
ML models performance (Chi-square test).
plos.figshare.com
xls
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed (2023). ML models performance (Chi-square test). [Dataset]. http://doi.org/10.1371/journal.pone.0294803.t014
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294803.t014
Dataset updated
Nov 27, 2023
Dataset provided by
PLOS ONE
Authors
Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.
f
Residence of the study group.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Al Mamun, Abdullah; Ahmed, Ashik; Rahman, Ahnaf Akif; Siraji, Muntequa Imtiaz; Faisal, Fahim; Khalid, Lamim Ibtisam; Nishat, Mirza Muntasir (2023). Residence of the study group. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000960272
Explore at:
Dataset updated
Nov 27, 2023
Authors
Al Mamun, Abdullah; Ahmed, Ashik; Rahman, Ahnaf Akif; Siraji, Muntequa Imtiaz; Faisal, Fahim; Khalid, Lamim Ibtisam; Nishat, Mirza Muntasir
Description
Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.
The HARMOGEN-R dataset of human and AI-generated rubric evaluations for...
zenodo.org
bin, zip
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro C. Mendonça; Pedro C. Mendonça; Filipe Quintal; Fábio Mendonça; Karolina Baras; Karolina Baras; Filipe Quintal; Fábio Mendonça (2025). The HARMOGEN-R dataset of human and AI-generated rubric evaluations for formative programming assessment [Dataset]. http://doi.org/10.5281/zenodo.17284944
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17284944
Dataset updated
Oct 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pedro C. Mendonça; Pedro C. Mendonça; Filipe Quintal; Fábio Mendonça; Karolina Baras; Karolina Baras; Filipe Quintal; Fábio Mendonça
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

HARMOGEN-R Dataset: Large-Scale Human–LLM Evaluation Records for Automated Grading Research

The HARMOGEN-R dataset comprises 69,028 structured records of educational assessments that compare human and Large Language Model (LLM) grading. It includes 770 anonymized student responses from a Data Structures and Algorithms course, evaluated by human instructors and several LLMs using five rubric variants: one human-created and four AI-generated. Each response was assessed across multiple criteria, producing 50,050 individual evaluations, 15,480 aggregated scores, and 2,310 evaluation reports.

The dataset consists of 15 relational tables with enforced foreign keys and validated JSON fields that document the reasoning process used in automated grading. Each LLM evaluation includes a criteria_evaluations_by_llm JSON object describing the structured rationale applied by the model to each criterion, the intermediate sub-scores, and the corresponding textual justification. This structure enables reproducible quantitative and qualitative analyses of model evaluation behaviour.

The repository also provides CSV extracts, Python scripts for data normalization and exploratory analysis, and MySQL schema files to ensure reproducibility. All identifiers are anonymized, and the dataset does not contain personal information.

The HARMOGEN-R dataset facilitates empirical research on automated assessment, rubric generation, and human–AI grading consistency. It is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Delhi Bus Routes Dataset
kaggle.com
zip
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lourdu Radjou (2024). Delhi Bus Routes Dataset [Dataset]. https://www.kaggle.com/datasets/lourduradjou/delhi-bus-routes-dataset
Explore at:
zip(2020667 bytes)Available download formats
Dataset updated
Nov 17, 2024
Authors
Lourdu Radjou
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
Delhi
Description
Problem Statement description that i worked on to make this data: The Automated Bus Scheduling and Route Management System is designed to streamline and automate the scheduling and route planning process for the Delhi Transport Corporation (DTC). This project aims to improve operational efficiency, reduce errors, and enhance the reliability of bus services by replacing the current manual methods with an automated software solution. The system leverages algorithms, data analytics, and Geographic Information System (GIS) technologies to manage both linked and unlinked duty scheduling and optimize route planning.

Feel free to play with this!
t
Code and training dataset for the publication entitled: "A combined...
researchdata.tuwien.at
bin, zip
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh (2025). Code and training dataset for the publication entitled: "A combined microfluidic deep learning approach for lung cancer cell high throughput screening toward automatic cancer screening applications" [Dataset]. http://doi.org/10.48436/shgf6-h1h78
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.48436/shgf6-h1h78
Dataset updated
Mar 13, 2025
Dataset provided by
TU Wien
Authors
Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experiment Data & Analysis

Overview

This repository contains raw data, code and analysis scripts related to experiments performed in the ‘A combined microfluidic deep learning approach for lung cancer cell high throughput screening toward automatic cancer screening applications’. The data, code, and documentation provided here to facilitate reproducible research and enable further exploration and analysis of the experimental results.

Repository Contents

Analysis Code:

Languages: MATLAB 2020a or later with Deep Learning Toolbox

Description: This repository contains MATLAB scripts for data preprocessing, deep learning-based classification, and visualization of lung cancer cell images. The scripts train convolutional neural networks (CNNs) to classify six lung cell lines, including normal and five cancer subtypes.

Documentation:

File: LungCancer_CellLine_Code.zip

Description: This file provides exemplary code and sample images used for the machine learning approach.

File: Supplementary information and instructions.pdf

Description: This file provides an instruction and a description of the individual steps from raw data to image analysis.

File: Original Image data and Metadata Example - pc9.zip

Description: This .zip container provides an example of raw data in a native .vsi file format with folders containing the .ets file, with metadata documentation of the imaging parameters for a microfluidic channel imaged with the IX83 microscope.

File: Data augmentation documentation.docx (and Data augmentation documentation.pdf)

Description: This document provides descriptions of how data augmentation was performed.

File: Raw data.zip

Description: This file contains image raw data.

File: GrayCellData.rar

Description: This file contains image data converted to grayscale images.

File: CellData_Full.rar

Description: This file contains RGB image data.

Microfluidic cultivation protocol prior to imaging:

Cell Lines: The lung normal cell and non-small lung cancer cells (PC-9, SK-LU-1, H-1975, A-427, and A-549)

Plate Format: Plasma-bonded and coated microfluidics chip platform fabricated with silicon sheets and sterile object glass slides.

Surface Coating

Prior to cell seeding, the surface of the polydimethylsiloxane (PDMS) microfluidic chip was treated with collagen to enhance cell adhesion. A 0.1% (w/v) collagen solution was prepared using Type I collagen (derived from rat tail) dissolved in a 0.02 M acetic acid buffer. The PDMS surfaces were incubated with the collagen solution for 2 hours at room temperature to allow for proper coating. Following this, the chips were rinsed with phosphate-buffered saline (PBS) to remove any unbound collagen. Collagen, being a key extracellular matrix component, provides a conducive environment for cell attachment and proliferation. This surface modification was crucial for ensuring that the cells would adhere effectively to the microfluidic architecture, promoting optimal growth conditions. The collagen coating facilitated stronger cell-matrix interactions, thereby improving the overall experimental reliability and enabling accurate analysis of cell behavior in the microfluidic system.

Seeding Density

In this study, various cell types (lung normal cells and non-small cell lung cancer cells: PC-9, SK-LU-1, H-1975, A-427, and A-549) were cultured within a microfluidic chip designed with a total length of 75 mm and a width of 25 mm, featuring three separate chambers, each with a diameter of 900 μm. The seeding density was calculated to be approximately 5,000 cells/mL. Given the chamber dimensions, this density was optimized to ensure that the cells could achieve ~70% confluency within a reasonable timeframe while maintaining their viability and functionality. The initial seeding in a 25 cm² culture flask allowed for efficient expansion and preparation of the cells prior to their transfer to the microfluidic environment (the cell culture medium was DMEM or RPMI supplemented with 10% FBS and 1% PS).

Cultivation Duration

After trypsin treatment of cells cultured in a flask, the cells were allowed to adhere to the microfluidic chip for a duration of 48-72 hours post-injection. This incubation period was essential for the cells to establish stable adhesion to the collagen-coated surfaces, enabling them to regain their morphology and functionality. It ensured that the cellular environment within the microfluidic chambers mimicked in vivo conditions, allowing for proper cell spreading and growth.

Medium Composition

The medium utilized for cell cultivation consisted of DMEM (Dulbecco's Modified Eagle Medium) or RPMI-1640, supplemented with 10% fetal bovine serum (FBS) and 1% penicillin-streptomycin (PS), tailored to the specific cell types used. This composition was chosen to provide the necessary nutrients, growth factors, and antibiotics to support cell proliferation and prevent contamination. DMEM and RPMI are known to support a wide range of mammalian cell types, thereby enhancing the versatility of the experimental setup. The medium was pre-warmed to 37°C before use, and the cells were maintained in a humidified incubator at 37°C with 5% CO₂ during cultivation.

Imaging Setup

The imaging data was acquired using an automated IX83 microscope (Olympus, Japan), featuring a Merzhäuser motorized stage, a Hamamatsu ORCA-Flash4.0 camera, and a Lumencolor Spectra X fluorescent light source. This setup ensures high-resolution fluorescence imaging with precise stage control and sensitive image capture. Data was recorded automatically after adjustment of the z-axis using a multi-region area of interest on each microfluidic channel with the focus map function (medium density setting) with cellSens Dimension software (Version 2.1-2.3, Olympus). The DAPI staining of the blue fluorescence channel was used to facilitate large-area adjustment of the focus map prior to automated imaging. The green fluorescence channel representing the phalloidin staining of f-actin was used as a single channel exported images for the deep learning procedure outlined in the paper.

Setup and Installation

1. Extract the Raw Data:

Unzip the Raw data.zip file into your working directory.

2. Environment Setup:

Read the documentation Supplementary information and instructions.pdf and the readme.txt in the code for more details on the setup.

3. Running the Analysis:

Open the file Supplementary information and instructions.pdf for a detailed description.

Usage Instructions

Data Exploration: The analysis scripts include functions for exploratory data analysis (EDA). You can modify these scripts to investigate specific experimental conditions.

Reproducibility

Follow the code comments and documentation to replicate the analyses. Ensure that the environment and dependencies are correctly configured as described in the setup section.

Licensing

This repository is licensed as follows: Code is accessible under BSD 2-Clause "Simplified" license and data under a Creative Commons Attribution 4.0 International license.

Acknowledgement:

This work was supported by the Iran National Science Foundation (INSF) Grant No. 96006759.

Contact persons:

For data acquisition:

Abdullah Allahverdi, a-allahverdi@modares.ac.ir;

Hadi Hashemzadeh, Hashemzadeh.hadi@gmail.com;

Mario Rothbauer, mario.rothbauer@tuwien.ac.at

For data processing and augmentation:

Seyedehsamaneh Shojaei, s.shojaie@irost.ir, samane.shojaie@gmail.com
attritionprojectdataset
kaggle.com
zip
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
n-f-j (2025). attritionprojectdataset [Dataset]. https://www.kaggle.com/datasets/fathimajabbar22/hrattrition
Explore at:
zip(6028 bytes)Available download formats
Dataset updated
Nov 27, 2025
Authors
n-f-j
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Sample dataset for HR Attrition Intelligence Agent [Captsone Project > 5-Day AI Agents Intensive Course with Google (Nov 10 - 14, 2025)]

This dataset captures detailed employee lifecycle information—including department, role, tenure, employment type, and termination details—across global regions. It powers the HR Attrition Intelligence Agent, enabling natural-language queries, pattern recognition, and automated insights into workforce turnover. Designed for enterprise-grade analytics, it supports both individual lookups and macro-level attrition trend analysis. Ideal for building intelligent HR systems, predictive models, and workflow automation tools.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Protein Subcellular localization prediction data used in the article...
zenodo.org
csv
Updated Sep 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emmanuel S. Adabor; George K. Acquaah-Mensah; Gaston K. Mazandu; Emmanuel S. Adabor; George K. Acquaah-Mensah; Gaston K. Mazandu (2020). Protein Subcellular localization prediction data used in the article entitled "MSclassifier: Median-Supplement model-based Classification tool for automated knowledge discovery" [Dataset]. http://doi.org/10.5281/zenodo.3964503
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3964503
Dataset updated
Sep 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Emmanuel S. Adabor; George K. Acquaah-Mensah; Gaston K. Mazandu; Emmanuel S. Adabor; George K. Acquaah-Mensah; Gaston K. Mazandu
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
This repository contains data used to obtain results from a 5-fold cross-validation testing of how MSclassifier and other packages accurately predict protein subcellular localization in the software article entitled "MSclassifier: median-supplement model-based classification tool for automated knowledge discovery." The data used in the software article is derived from data generated in "G. K. Acquaah-Mensah, S. M. Leach, and C. Guda, Predicting the subcellular localization of human proteins using machine learning and exploratory data analysis, Genomics Proteomics Bioinformatics, 4(2):120-133, 2006, https://doi.org/10.1016/S1672-0229(06)60023-5"
r
Data from: Spectrum analysis based method for dynamics and collective...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu (2022). Spectrum analysis based method for dynamics and collective analysis of protein-protein interaction networks [Dataset]. http://doi.org/10.4225/03/5a13725619374
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a13725619374
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Yi-Zhen Shen; Yong-Sheng Ding; Quan Gu
Description
The importance of understanding biological interaction networks has fueled the development of numerous interaction data generation techniques, databases and prediction tools. Generation of high-confident interaction networks formulates the first step towards the study for protein–protein interactions (PPI). A number of experimental methods, based on distinct, physical principles have been developed to identify PPI such as the yeast two-hybrid method (Y2H). In this work, we focus on one example of biological networks, namely the yeast protein interaction network (YPIN). In YPIN, we design and implement a computational model that captures the discrete and stochastic nature of protein interactions. In this model, we apply spectrum analysis method to the variance of the protein nodes which play an important role in the PPI networks, which can show the topology structure of dynamic and collective performances of PPI networks. We take YPIN, such as 48 "quasi-cliques" and 6 "quasi-bipartites" separated from 11855 yeast PPI networks with 2617 proteins, as an example and apply spectrum analysis to show the topology structure of dynamic and collective analysis of PPI networks and the performances. The obtained results may be valuable for deciphering unknown protein functions, determining protein complexes, and inventing drugs. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.

Facebook

Twitter

Click to copy link

Link copied

Cite

Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164

Exploratory Data Analysis (EDA) Tools Report

Explore at:

ppt, pdf, docAvailable download formats

Dataset updated

Apr 2, 2025

Dataset authored and provided by

Market Report Analytics

License

https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

Time period covered

2025 - 2033

Area covered

Global

Variables measured

Market Size

Description

The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.

Clear search

Close search

Google apps

Main menu

Exploratory Data Analysis (EDA) Tools Report

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

Automobile Sales data

Evaluate AI Models for Weld Inspection & NDT in Auto Manufac - EDA

Data used in Automatic cell finding paper

Dataset

Contents

Data_Sheet_1_Clustering for Automated Exploratory Pattern Discovery in...

ML models performance (SparsePCA).

Auto-mpg_Dataset

Dataset

Contents

auto_preprocess_pyfile

Dataset

Contents

Sales Data Analysis

ML models performance using all features.

ML models performance (Chi-square test).

Residence of the study group.

The HARMOGEN-R dataset of human and AI-generated rubric evaluations for...

Delhi Bus Routes Dataset

Code and training dataset for the publication entitled: "A combined...

Experiment Data & Analysis

Overview

Repository Contents

Microfluidic cultivation protocol prior to imaging:

Setup and Installation

Usage Instructions

Contact persons:

attritionprojectdataset

Cdd Dataset

Protein Subcellular localization prediction data used in the article...

Data from: Spectrum analysis based method for dynamics and collective...

Exploratory Data Analysis (EDA) Tools ReportSee More Versions

Exploratory Data Analysis (EDA) Tools Report